### 1. Handling Missing Values
**Definition:**

Missing values in a dataset are entries for which no data was recorded. They can arise due to various reasons (sensor failure, skipped fields, data corruption, etc.).

- Why handle them?

  - Many ML algorithms can't process missing values directly.

  - They can bias or distort analysis if not treated properly.

  - Clean data ensures more reliable, interpretable insights.

### 2. Dropping and Imputing Columns/Rows with Missing Data

**Definition:**

Dropping removes columns or rows entirely if they're considered uninformative or too incomplete to fix.

- When to drop?

  - If a column has too many missing values (e.g., >30%).

  - If a row is missing crucial data and cannot be imputed reasonably.

Code Example:

In [1]:
import numpy as np
import pandas as pd

# Example DataFrame
data = {
    'A': [1, np.nan, 3, 4, 5],
    'B': [np.nan, np.nan, np.nan, 4, 5],
    'C': [1, 2, 3, np.nan, 5]
}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,,1.0
1,,,2.0
2,3.0,,3.0
3,4.0,4.0,
4,5.0,5.0,5.0


In [2]:
(df.isnull().sum()/df.shape[0])*100

Unnamed: 0,0
A,20.0
B,60.0
C,20.0


In [3]:
# Drop rows with ANY missing values
df_drop_rows = df.dropna(axis=0)
print("\nAfter dropping rows with any missing values:\n", df_drop_rows)


After dropping rows with any missing values:
      A    B    C
4  5.0  5.0  5.0


In [4]:
# Drop columns with ANY missing values
df_drop_cols = df.dropna(axis=1)
print("\nAfter dropping columns with any missing values:\n", df_drop_cols)


After dropping columns with any missing values:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]


In [5]:
# Drop columns with more than a threshold (e.g., 50%) missing values.
threshold = 0.5*len(df)
df_drop_threshold = df.dropna(axis=1, thresh=threshold)
df_drop_threshold

Unnamed: 0,A,C
0,1.0,1.0
1,,2.0
2,3.0,3.0
3,4.0,
4,5.0,5.0


### 2.2 Imputing (Filling) Missing Values
**Definition:**

Imputation replaces missing values with substituted values based on various strategies.

- When to impute?

  - When the column is important for analysis.

  - When the missingness is not substantial, and values can be plausibly estimated.

Possible Imputation Techniques & Code Examples:

a) Fill with a Constant (e.g., 0, "unknown")

In [6]:
df_constant =  df.fillna(0)
df_constant

Unnamed: 0,A,B,C
0,1.0,0.0,1.0
1,0.0,0.0,2.0
2,3.0,0.0,3.0
3,4.0,4.0,0.0
4,5.0,5.0,5.0


b) Fill with Mean / Median / Mode

In [7]:
# Fill numerical columns with mean
df_mean = df.fillna(df.mean(numeric_only=True))
df_mean

Unnamed: 0,A,B,C
0,1.0,4.5,1.0
1,3.25,4.5,2.0
2,3.0,4.5,3.0
3,4.0,4.0,2.75
4,5.0,5.0,5.0


In [8]:
# Fill numerical columns with median
df_median = df.fillna(df.median(numeric_only=True))
df_median

Unnamed: 0,A,B,C
0,1.0,4.5,1.0
1,3.5,4.5,2.0
2,3.0,4.5,3.0
3,4.0,4.0,2.5
4,5.0,5.0,5.0


In [9]:
# Fill categorical columns with mode
df_mode = df.apply(lambda x: x.fillna(x.mode()[0]) if x.dtype=='O' else x)
df_mode

Unnamed: 0,A,B,C
0,1.0,,1.0
1,,,2.0
2,3.0,,3.0
3,4.0,4.0,
4,5.0,5.0,5.0


c) Forward Fill / Backward Fill (for time series or panel data)

In [10]:
df_ffill = df.fillna(method='ffill')
df_ffill

  df_ffill = df.fillna(method='ffill')


Unnamed: 0,A,B,C
0,1.0,,1.0
1,1.0,,2.0
2,3.0,,3.0
3,4.0,4.0,3.0
4,5.0,5.0,5.0


In [11]:
df_bfill = df.fillna(method='bfill')
df_bfill

  df_bfill = df.fillna(method='bfill')


Unnamed: 0,A,B,C
0,1.0,4.0,1.0
1,3.0,4.0,2.0
2,3.0,4.0,3.0
3,4.0,4.0,5.0
4,5.0,5.0,5.0


d) Advanced: KNN Imputer / Iterative Imputer (for large projects)

In [12]:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=2)
df_knn = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)
df_knn

Unnamed: 0,A,B,C
0,1.0,4.5,1.0
1,2.0,5.0,2.0
2,3.0,4.5,3.0
3,4.0,4.0,4.0
4,5.0,5.0,5.0


### 3. Choosing the Best Approach
**Guidelines:**

  - Drop columns if lots of information is missing (>30–50%), but consider domain importance.

  - Impute when missingness is moderate, using domain knowledge to choose the technique.

  - Use mean/median for numeric, mode for categorical, and advanced methods for critical variables.

In [14]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# Create a dummy dataset with missing values
data = {
    'Age': [25, np.nan, 30, 22, 40, np.nan, 28],
    'Salary': [50000, 60000, np.nan, 52000, 58000, 62000, np.nan],
    'City': ['New York', 'Los Angeles', 'New York', np.nan, 'Chicago', 'Chicago', 'Los Angeles'],
    'Purchased': ['Yes', 'No', np.nan, 'No', 'Yes', 'Yes', 'No']
}

df_dummy = pd.DataFrame(data)
print("Original Dummy Dataset with Missing Values:")
print(df_dummy)




Original Dummy Dataset with Missing Values:
    Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1   NaN  60000.0  Los Angeles        No
2  30.0      NaN     New York       NaN
3  22.0  52000.0          NaN        No
4  40.0  58000.0      Chicago       Yes
5   NaN  62000.0      Chicago       Yes
6  28.0      NaN  Los Angeles        No


# Assignment Tasks (interns should attempt):
# 1. Drop columns or rows with excessive missing values.

In [17]:
#dropping rows with any missing values"
df_drop_rows = df_dummy.dropna(axis=0)
print("\nAfter dropping rows with any missing values:\n", df_drop_rows)


After dropping rows with any missing values:
     Age   Salary      City Purchased
0  25.0  50000.0  New York       Yes
4  40.0  58000.0   Chicago       Yes


In [19]:
#dropping columns with any missing values"
df_drop_columns = df_dummy.dropna(axis=1)
print("\nAfter dropping columns with any missing values:\n", df_drop_columns)


After dropping columns with any missing values:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]



# 2. Impute missing numerical columns (Age, Salary) with mean or median.

In [20]:
# Fill numerical columns with mean"
df_mean = df_dummy.fillna(df_dummy.mean(numeric_only=True))
df_mean

Unnamed: 0,Age,Salary,City,Purchased
0,25.0,50000.0,New York,Yes
1,29.0,60000.0,Los Angeles,No
2,30.0,56400.0,New York,
3,22.0,52000.0,,No
4,40.0,58000.0,Chicago,Yes
5,29.0,62000.0,Chicago,Yes
6,28.0,56400.0,Los Angeles,No


In [21]:
# Fill numerical columns with median"
df_median = df_dummy.fillna(df_dummy.median(numeric_only=True))
df_median

Unnamed: 0,Age,Salary,City,Purchased
0,25.0,50000.0,New York,Yes
1,28.0,60000.0,Los Angeles,No
2,30.0,58000.0,New York,
3,22.0,52000.0,,No
4,40.0,58000.0,Chicago,Yes
5,28.0,62000.0,Chicago,Yes
6,28.0,58000.0,Los Angeles,No



# 3. Impute missing categorical columns (City, Purchased) with mode.

In [22]:
# Copy the mean-imputed DataFrame
df_imputed = df_mean.copy()

# Fill categorical NaNs with mode
for col in ['City', 'Purchased']:
    mode_value = df_imputed[col].mode()[0]   # most frequent value
    df_imputed[col].fillna(mode_value, inplace=True)

print("\nAfter mean + mode imputation:\n", df_imputed)



After mean + mode imputation:
     Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1  29.0  60000.0  Los Angeles        No
2  30.0  56400.0     New York        No
3  22.0  52000.0      Chicago        No
4  40.0  58000.0      Chicago       Yes
5  29.0  62000.0      Chicago       Yes
6  28.0  56400.0  Los Angeles        No


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_imputed[col].fillna(mode_value, inplace=True)



# 4. Optionally, apply KNN imputation for numerical columns.

In [23]:
from sklearn.impute import KNNImputer


df_knn = df_dummy.copy()

numeric_cols = ['Age', 'Salary']

imputer = KNNImputer(n_neighbors=2)


df_knn[numeric_cols] = imputer.fit_transform(df_knn[numeric_cols])

print("\nAfter KNN Imputation (numerical only):\n", df_knn)



After KNN Imputation (numerical only):
     Age   Salary         City Purchased
0  25.0  50000.0     New York       Yes
1  31.0  60000.0  Los Angeles        No
2  30.0  51000.0     New York       NaN
3  22.0  52000.0          NaN        No
4  40.0  58000.0      Chicago       Yes
5  31.0  62000.0      Chicago       Yes
6  28.0  51000.0  Los Angeles        No
