Practice Problem:

**Learning Goals:**

- Correct usage of `fillna()` for imputing missing values.
- Proper usage of `dropna()` to drop rows with missing values.
- Effective use of regex with `str.extract()` to extract the domain from email addresses.
- Accurate use of `str.replace()` to clean invalid emails.


In this practice example, you will work with a toy dataset that contains missing values. The toy dataset simulates a small customer database with some missing data in columns for customer names, age, and email.

You will use `pandas` to clean the data by:
1. **Imputing missing values** using the `fillna()` function.
2. **Dropping rows with missing values** using the `dropna()` function.
3. **Using regex** to clean up string data, such as extracting specific parts of a string or replacing unwanted characters.

You will need to use the following pandas functions:
- `fillna()`
- `dropna()`
- `str.contains()` with regex
- `str.replace()` with regex


#### **Task 1: Create a toy dataset**

```
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'CustomerName': ['Amy', 'Tony', np.nan, 'Arian', 'Eva'],
    'Age': [25, np.nan, 22, 34, np.nan],
    'Email': ['amy@example.com', 'tony@example', 'noami@domain.com', np.nan, 'eva123@domain.org']
}

```
#### **Task 2: Impute Missing Values in Age**:

Use the mean for imputing missing values when:

-- The data is approximately normally distributed (symmetric).

-- The data is missing completely at random (MCAR).

-- You want to preserve the overall mean of the dataset.

Do not use the mean for imputation when:

-- The data is skewed or has outliers (consider using the median).

-- The data is missing not at random (MNAR).

-- You need to preserve the variance or heterogeneity of the data (consider more complex methods like KNN or model-based imputation).

The `Age` column contains missing values (NaN). For this task, you will use the `fillna()` method to impute the missing values in the `Age` column with the **mean** of the existing `Age` values.


#### **Task 3: Drop Rows with Missing `CustomerName` or `Email` is missing**:

In the `CustomerName` column, there is one missing value (NaN). For this task, you will: Use the `dropna()` function to remove rows where the `CustomerName` is missing.


#### **Task 4: Replace any occurrences of "@example" with "@calpoly.edu"**:

you can modify the existing code that uses `str.replace()` and use the regular expression `r'@example\b` where `@example` matches the string "@example" where

`\b`: This is a word boundary anchor. It ensures that the pattern only matches "@example" when it's a whole word and not part of a larger word (like "@example.com"). This prevents accidental replacements within longer domain names.



#### **Task 5: Use Regex to Extract Domain from Email**:

For the `Email` column, you will extract the domain name from each email address using a regular expression. You will:
1. Create a new column called `Domain` that contains the domain part of the email (i.e., everything after the `@` symbol).
2. Use the `str.extract()` method with a regular expression to capture the domain.







### **Submission Instructions**:

Show all the completed task to your instructor.

---

