In [None]:
import pandas as pd
import numpy as np

# Create a sample dataset with multiple issues
np.random.seed(42)

# Random data generation
data = {
    "Customer_ID": np.arange(1, 101),
    "Age": np.random.randint(18, 70, 100),
    "Region": np.random.choice(["North", "South", "East", "West"], 100),
    "Product": np.random.choice(["A", "B", "C"], 100),
    "Rating": np.random.uniform(1, 5, 100)
}

# Create DataFrame
df = pd.DataFrame(data)

# Introduce missing values
df.loc[::10, "Age"] = np.nan  # Missing 10% in Age column
df.loc[::15, "Region"] = np.nan  # Missing 6.67% in Region column

# Add some duplicate rows using pd.concat
df_duplicates = pd.concat([df, df.iloc[::5]], ignore_index=True)

# Introduce some outliers in the Rating column
outlier_indices = df_duplicates.iloc[::20].index  # Select every 20th row for outliers
df_duplicates.loc[outlier_indices, "Rating"] = np.random.uniform(10, 25, len(outlier_indices))  # Random outlier values

# Introduce some constant values in Product column
df_duplicates["Product"] = df_duplicates["Product"].replace({"A": "Special", "B": "Special"})

# Shuffle DataFrame
df_duplicates = df_duplicates.sample(frac=1, random_state=42).reset_index(drop=True)

# Display the first few rows
df_duplicates.head()


Unnamed: 0,Customer_ID,Age,Region,Product,Rating
0,45,38.0,East,Special,3.580413
1,48,35.0,South,C,1.920741
2,5,60.0,North,C,3.233174
3,56,37.0,East,Special,4.533976
4,27,29.0,North,C,1.703701


## Step 1: Handling Missing Values

Missing values are a common issue in real-world datasets. In this dataset, we have missing values in the `Age` and `Region` columns.

We will use **median imputation** for the `Age` column and **mode imputation** for the `Region` column.

### Imputation Strategy:
- For **Age**: Use the median to fill in missing values.
- For **Region**: Use the mode (most frequent value) to fill in missing values.

Let's start by addressing the missing data.


In [5]:
# Impute missing values
df["Age"].fillna(df["Age"].median(), inplace=True)
df["Region"].fillna(df["Region"].mode()[0], inplace=True)

# Display the result
df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Region"].fillna(df["Region"].mode()[0], inplace=True)


Unnamed: 0,Customer_ID,Age,Region,Product,Rating
0,1,42.0,East,Special,24.727613
1,2,69.0,North,B,2.043317
2,3,46.0,North,C,4.985015
3,4,32.0,East,B,4.861677
4,5,60.0,North,C,3.233174
