Handling Missing Values

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample dataset with missing values
data = {'Feature1': [10, 20, np.nan, 40, 50], 
        'Feature2': [1, np.nan, 3, 4, 5]}

df = pd.DataFrame(data)
print("Original Data:\n", df)

# Replace NaN with column mean
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nData after Imputation:\n", df_imputed)


Original Data:
    Feature1  Feature2
0      10.0       1.0
1      20.0       NaN
2       NaN       3.0
3      40.0       4.0
4      50.0       5.0

Data after Imputation:
    Feature1  Feature2
0      10.0      1.00
1      20.0      3.25
2      30.0      3.00
3      40.0      4.00
4      50.0      5.00


Standardizing Data
Feature Scaling (Normalization & Standardization)
Many ML models work better when data is scaled. Scikit-Learn provides:

MinMaxScaler (Normalization) – Scales between 0 and 1.
StandardScaler (Standardization) – Converts data to have mean=0 and standard deviation=1.

In [2]:
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = [[10, 200], [15, 250], [30, 300], [40, 400]]
df = pd.DataFrame(data, columns=['Feature1', 'Feature2'])

# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

print("\nStandardized Data:\n", scaled_data)



Standardized Data:
 [[-1.15311332 -1.18321596]
 [-0.73379939 -0.50709255]
 [ 0.52414242  0.16903085]
 [ 1.36277029  1.52127766]]


Encoding Categorical Variables
ML models don’t understand text, so categorical features must be converted to numbers.
Scikit-Learn provides:

Label Encoding (LabelEncoder) – Converts categories to 0, 1, 2, …
One-Hot Encoding (OneHotEncoder) – Creates binary columns for each category.
Converts Red, Blue, Green into separate binary columns.

In [3]:
from sklearn.preprocessing import OneHotEncoder

# Sample categorical data
data = [['Red'], ['Blue'], ['Green'], ['Blue'], ['Red']]
df = pd.DataFrame(data, columns=['Color'])

# Apply OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(df)

print("\nOne-Hot Encoded Data:\n", encoded_data)



One-Hot Encoded Data:
 [[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


Feature Selection (Removing Irrelevant Features)
Too many features can cause overfitting. We can remove irrelevant ones using SelectKBest.

Example: Selecting Top 2 Features


In [4]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Select the best 2 features
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)

print("\nSelected Features Shape:", X_new.shape)



Selected Features Shape: (150, 2)
