<a href="https://colab.research.google.com/github/DineshY1011/US_Accident/blob/main/Milestone_1/Week_2/Day_7/Day_7_handle_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Handling Missing Values
**Definition:**

Missing values in a dataset are entries for which no data was recorded. They can arise due to various reasons (sensor failure, skipped fields, data corruption, etc.).

- Why handle them?

  - Many ML algorithms can't process missing values directly.

  - They can bias or distort analysis if not treated properly.

  - Clean data ensures more reliable, interpretable insights.

### 2. Dropping and Imputing Columns/Rows with Missing Data

**Definition:**

Dropping removes columns or rows entirely if they're considered uninformative or too incomplete to fix.

- When to drop?

  - If a column has too many missing values (e.g., >30%).

  - If a row is missing crucial data and cannot be imputed reasonably.

Code Example:

In [None]:
import numpy as np
import pandas as pd

# Example DataFrame
data = {
    'A': [1, np.nan, 3, 4, 5],
    'B': [np.nan, np.nan, np.nan, 4, 5],
    'C': [1, 2, 3, np.nan, 5]
}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,,1.0
1,,,2.0
2,3.0,,3.0
3,4.0,4.0,
4,5.0,5.0,5.0


In [None]:
(df.isnull().sum()/df.shape[0])*100

Unnamed: 0,0
A,20.0
B,60.0
C,20.0


In [None]:
# Drop rows with ANY missing values
df_drop_rows = df.dropna(axis=0)
print("\nAfter dropping rows with any missing values:\n", df_drop_rows)


After dropping rows with any missing values:
      A    B    C
4  5.0  5.0  5.0


In [None]:
# Drop columns with ANY missing values
df_drop_cols = df.dropna(axis=1)
print("\nAfter dropping columns with any missing values:\n", df_drop_cols)


After dropping columns with any missing values:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]


In [None]:
# Drop columns with more than a threshold (e.g., 50%) missing values.
threshold = 0.5*len(df)
df_drop_threshold = df.dropna(axis=1, thresh=threshold)
df_drop_threshold

Unnamed: 0,A,C
0,1.0,1.0
1,,2.0
2,3.0,3.0
3,4.0,
4,5.0,5.0


### 2.2 Imputing (Filling) Missing Values
**Definition:**

Imputation replaces missing values with substituted values based on various strategies.

- When to impute?

  - When the column is important for analysis.

  - When the missingness is not substantial, and values can be plausibly estimated.

Possible Imputation Techniques & Code Examples:

a) Fill with a Constant (e.g., 0, "unknown")

In [None]:
df_constant =  df.fillna(0)
df_constant

Unnamed: 0,A,B,C
0,1.0,0.0,1.0
1,0.0,0.0,2.0
2,3.0,0.0,3.0
3,4.0,4.0,0.0
4,5.0,5.0,5.0


b) Fill with Mean / Median / Mode

In [None]:
# Fill numerical columns with mean
df_mean = df.fillna(df.mean(numeric_only=True))
df_mean

Unnamed: 0,A,B,C
0,1.0,4.5,1.0
1,3.25,4.5,2.0
2,3.0,4.5,3.0
3,4.0,4.0,2.75
4,5.0,5.0,5.0


In [None]:
# Fill numerical columns with median
df_median = df.fillna(df.median(numeric_only=True))
df_median

Unnamed: 0,A,B,C
0,1.0,4.5,1.0
1,3.5,4.5,2.0
2,3.0,4.5,3.0
3,4.0,4.0,2.5
4,5.0,5.0,5.0


In [None]:
# Fill categorical columns with mode
df_mode = df.apply(lambda x: x.fillna(x.mode()[0]) if x.dtype=='O' else x)
df_mode

Unnamed: 0,A,B,C
0,1.0,,1.0
1,,,2.0
2,3.0,,3.0
3,4.0,4.0,
4,5.0,5.0,5.0


c) Forward Fill / Backward Fill (for time series or panel data)

In [None]:
df_ffill = df.fillna(method='ffill')
df_ffill

  df_ffill = df.fillna(method='ffill')


Unnamed: 0,A,B,C
0,1.0,,1.0
1,1.0,,2.0
2,3.0,,3.0
3,4.0,4.0,3.0
4,5.0,5.0,5.0


In [None]:
df_bfill = df.fillna(method='bfill')
df_bfill

  df_bfill = df.fillna(method='bfill')


Unnamed: 0,A,B,C
0,1.0,4.0,1.0
1,3.0,4.0,2.0
2,3.0,4.0,3.0
3,4.0,4.0,5.0
4,5.0,5.0,5.0


d) Advanced: KNN Imputer / Iterative Imputer (for large projects)

In [None]:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=2)
df_knn = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)
df_knn

Unnamed: 0,A,B,C
0,1.0,4.5,1.0
1,2.0,5.0,2.0
2,3.0,4.5,3.0
3,4.0,4.0,4.0
4,5.0,5.0,5.0


### 3. Choosing the Best Approach
**Guidelines:**

  - Drop columns if lots of information is missing (>30–50%), but consider domain importance.

  - Impute when missingness is moderate, using domain knowledge to choose the technique.

  - Use mean/median for numeric, mode for categorical, and advanced methods for critical variables.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# Create a dummy dataset with missing values
data = {
    'Age': [25, np.nan, 30, 22, 40, np.nan, 28],
    'Salary': [50000, 60000, np.nan, 52000, 58000, 62000, np.nan],
    'City': ['New York', 'Los Angeles', 'New York', np.nan, 'Chicago', 'Chicago', 'Los Angeles'],
    'Purchased': ['Yes', 'No', np.nan, 'No', 'Yes', 'Yes', 'No']
}

df_dummy = pd.DataFrame(data)
print("Original Dummy Dataset with Missing Values:")
print(df_dummy)

# Assignment Tasks (interns should attempt):
# 1. Drop columns or rows with excessive missing values.
# 2. Impute missing numerical columns (Age, Salary) with mean or median.
# 3. Impute missing categorical columns (City, Purchased) with mode.
# 4. Optionally, apply KNN imputation for numerical columns.

Original Dummy Dataset with Missing Values:
    Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1   NaN  60000.0  Los Angeles        No
2  30.0      NaN     New York       NaN
3  22.0  52000.0          NaN        No
4  40.0  58000.0      Chicago       Yes
5   NaN  62000.0      Chicago       Yes
6  28.0      NaN  Los Angeles        No


**Assignment Tasks**

1. Drop columns or rows with excessive missing values.

In [3]:
threshold = 0.4 * len(df_dummy)  # allow up to 40% missing
df_step1 = df_dummy.dropna(axis=1, thresh=threshold)
print("\nAfter Dropping Columns with Excessive Missing Values:")
print(df_step1)


After Dropping Columns with Excessive Missing Values:
    Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1   NaN  60000.0  Los Angeles        No
2  30.0      NaN     New York       NaN
3  22.0  52000.0          NaN        No
4  40.0  58000.0      Chicago       Yes
5   NaN  62000.0      Chicago       Yes
6  28.0      NaN  Los Angeles        No


2. Impute missing numerical columns (Age, Salary) with mean or median.

In [4]:
df_step2 = df_dummy.copy()
for col in ['Age', 'Salary']:
    if col in df_step2.columns:
        df_step2[col].fillna(df_step2[col].mean(), inplace=True)   # mean
        # OR use median:
        # df_step2[col].fillna(df_step2[col].median(), inplace=True)

print("\nAfter Imputing Numerical Columns (Mean):")
print(df_step2)


After Imputing Numerical Columns (Mean):
    Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1  29.0  60000.0  Los Angeles        No
2  30.0  56400.0     New York       NaN
3  22.0  52000.0          NaN        No
4  40.0  58000.0      Chicago       Yes
5  29.0  62000.0      Chicago       Yes
6  28.0  56400.0  Los Angeles        No


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_step2[col].fillna(df_step2[col].mean(), inplace=True)   # mean


3. Impute missing categorical columns (City, Purchased) with mode.

In [5]:
df_step3 = df_dummy.copy()
for col in ['City', 'Purchased']:
    if col in df_step3.columns:
        df_step3[col].fillna(df_step3[col].mode()[0], inplace=True)

print("\nAfter Imputing Categorical Columns (Mode):")
print(df_step3)


After Imputing Categorical Columns (Mode):
    Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1   NaN  60000.0  Los Angeles        No
2  30.0      NaN     New York        No
3  22.0  52000.0      Chicago        No
4  40.0  58000.0      Chicago       Yes
5   NaN  62000.0      Chicago       Yes
6  28.0      NaN  Los Angeles        No


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_step3[col].fillna(df_step3[col].mode()[0], inplace=True)


4. apply KNN imputation for numerical columns.

In [6]:
numeric_cols = df_dummy[['Age', 'Salary']]  # only numeric

imputer = KNNImputer(n_neighbors=2)
numeric_imputed = imputer.fit_transform(numeric_cols)

df_knn = df_dummy.copy()
df_knn[['Age', 'Salary']] = numeric_imputed

print("\nAfter Applying KNN Imputer on Numerical Columns:")
print(df_knn)


After Applying KNN Imputer on Numerical Columns:
    Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1  31.0  60000.0  Los Angeles        No
2  30.0  51000.0     New York       NaN
3  22.0  52000.0          NaN        No
4  40.0  58000.0      Chicago       Yes
5  31.0  62000.0      Chicago       Yes
6  28.0  51000.0  Los Angeles        No
