# Data Preprocessing: The Essential First Step ðŸ§¹
Data Preprocessing is a data mining technique that transforms raw data into an understandable and usable format for machine learning algorithms. Real-world data is often incomplete, inconsistent, and dirty. If the data quality is poor, the derived results will also be unreliableâ€”a concept often summarized as "Garbage In, Garbage Out (GIGO)."

The most critical preprocessing steps include:

Handling Missing Data (NaNs).

Feature Scaling (Normalization/Standardization).

Encoding Categorical Variables.

## 1. Dealing with Missing Data (NaN Values) 
Missing data (often represented as NaN in Pandas) occurs when no value is stored for a feature in a particular observation. How you handle missing data can significantly impact your model's performance.

### 1.1 Identifying Missing Values
The first step is always to locate and quantify the missing data.

In [2]:
# CODE CELL 1: Setup and Loading a Dataset with Known Missing Values
import pandas as pd
import numpy as np
from io import StringIO

# Create a sample dataset with missing values
data = """
Age,Salary,Gender,Purchased
35,50000,Male,No
27,48000,Female,Yes
40,,Female,No
21,52000,Male,No
44,55000,,Yes
48,79000,Male,No
38,62000,Female,No
,83000,Male,Yes
"""

df = pd.read_csv(StringIO(data))
print("Original Data:")
print(df)
# Check for missing values in each column
print("\nMissing Values Count:")
print(df.isnull().sum())



Original Data:
    Age   Salary  Gender Purchased
0  35.0  50000.0    Male        No
1  27.0  48000.0  Female       Yes
2  40.0      NaN  Female        No
3  21.0  52000.0    Male        No
4  44.0  55000.0     NaN       Yes
5  48.0  79000.0    Male        No
6  38.0  62000.0  Female        No
7   NaN  83000.0    Male       Yes

Missing Values Count:
Age          1
Salary       1
Gender       1
Purchased    0
dtype: int64


### 1.2 Strategies for Imputation
There are two primary methods for addressing missing data:

- Deletion: Removing rows or columns entirely.

- Imputation: Filling in the missing values.

**Strategy A: Deletion**
- Listwise Deletion (Removing Rows): If only a few rows have missing data, you can delete those rows. This is quick but can lead to data loss if many rows are affected.

- Deleting Columns: If a feature (column) is missing a value in a majority of the observations, it may be best to drop the entire feature

In [5]:
# CODE CELL 2: Deleting Rows with Missing Values (Use with caution!)
df_deleted = df.dropna()
print("\nData after Row Deletion:")
print(df_deleted)


Data after Row Deletion:
    Age   Salary  Gender Purchased
0  35.0  50000.0    Male        No
1  27.0  48000.0  Female       Yes
3  21.0  52000.0    Male        No
5  48.0  79000.0    Male        No
6  38.0  62000.0  Female        No


**Strategy B: Imputation**
Imputation means estimating the missing value based on the available data in that feature.

- Mean/Median Imputation (for Numerical Data):

     - Mean: Good for normally distributed data. However, it is sensitive to outliers.

    - Median: Better for data with outliers or skewed distributions.

- Mode Imputation (for Categorical Data): Using the most frequent category.

In [19]:
# CODE CELL 3: Mean/Median Imputation using Scikit-learn's SimpleImputer
from sklearn.impute import SimpleImputer

# 1. Impute 'Age' (Numerical, less sensitive to outliers) with the Median
imputer_age = SimpleImputer(missing_values=np.nan, strategy='median')
df['Age'] = imputer_age.fit_transform(df[['Age']])

# 2. Impute 'Salary' (Numerical) with the Mean
imputer_salary = SimpleImputer(missing_values=np.nan, strategy='mean')
df['Salary'] = imputer_salary.fit_transform(df[['Salary']])

# 3. Impute 'Gender' (Categorical) with the Most Frequent value (Mode)
imputer_gender = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df['Gender'] = imputer_gender.fit_transform(df[['Gender']]).flatten()

print("\nData after Imputation:")
print(df)
print("\nMissing Values Count after Imputation:")
print(df.isnull().sum()) # Should show 0 for all columns


Data after Imputation:
    Age        Salary  Gender Purchased
0  35.0  50000.000000    Male        No
1  27.0  48000.000000  Female       Yes
2  40.0  61285.714286  Female        No
3  21.0  52000.000000    Male        No
4  44.0  55000.000000    Male       Yes
5  48.0  79000.000000    Male        No
6  38.0  62000.000000  Female        No
7  38.0  83000.000000    Male       Yes

Missing Values Count after Imputation:
Age          0
Salary       0
Gender       0
Purchased    0
dtype: int64


## 2. Feature Scaling 
Once missing data is handled, we often need to scale the features. This is critical for algorithms that calculate distances (like KNN or SVM) or rely on gradient descent (like Linear Regression and Neural Networks).

**Normalization (Min-Max Scaling):** Rescales the data to a fixed range, usually 0 to 1.

**Standardization (Z-score):** Rescales the data to have a mean of 0 and a standard deviation of 1
## 1. Setup and Sample Data
We start with a simple numerical dataset where the features have different scales, making scaling necessary.

In [21]:
# CODE CELL 1: Setup and Sample Data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Sample Data: Features are on different scales (Age vs. Salary)
data = {
    'Age': [25, 30, 45, 22, 58],
    'Salary': [30000, 60000, 120000, 20000, 90000],
    'Target': [0, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

# Separate features (X) from the target (y)
X = df[['Age', 'Salary']]
y = df['Target']

# Split data (Crucial: Fit scaler only on training data!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

print("--- Original Training Data (X_train) ---")
print(X_train)

--- Original Training Data (X_train) ---
   Age  Salary
2   45  120000
0   25   30000
3   22   20000


## 2. Standardization (StandardScaler)
Standardization (or Z-score normalization) transforms data such that the resulting distribution has a mean of 0 and a standard deviation of 1.$$\mathbf{x}_{\text{scaled}} = \frac{x - \mu}{\sigma}$$This is generally preferred for algorithms that assume a normal distribution or when your data has outliers, as it preserves information about outliers better than Min-Max Scaling.

In [26]:
# CODE CELL 2: Standardization
# 1. Create the StandardScaler object
scaler_standard = StandardScaler()

# 2. Fit the scaler on the TRAINING data ONLY
scaler_standard.fit(X_train)

# 3. Transform both the training and test sets
X_train_scaled_standard = scaler_standard.transform(X_train)
X_test_scaled_standard = scaler_standard.transform(X_test)

# Convert back to DataFrame for better viewing
X_train_scaled_standard_df = pd.DataFrame(
    X_train_scaled_standard, 
    columns=X_train.columns, 
    index=X_train.index
)

print("\n--- Standardized Training Data (Mean=0, Std=1) ---")
print(X_train_scaled_standard_df)
print(f"\nMean of 'Age' after standardization: {X_train_scaled_standard_df['Age'].mean():.2f}")
print(f"Standard Deviation of 'Age': {X_train_scaled_standard_df['Age'].std():.2f}")


--- Standardized Training Data (Mean=0, Std=1) ---
       Age    Salary
2  1.40400  1.408374
0 -0.55507 -0.592999
3 -0.84893 -0.815374

Mean of 'Age' after standardization: -0.00
Standard Deviation of 'Age': 1.22


## 3. Normalization (MinMaxScaler)
**Normalization**(or Min-Max Scaling) rescales the data so that all values fall within a specific range, usually 0 to 1.$$\mathbf{x}_{\text{scaled}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}$$This is useful when you need bounded data (e.g., for certain Neural Network activation functions). However, it is very sensitive to outliers.

In [29]:
# CODE CELL 3: Normalization (Min-Max Scaling)
# 1. Create the MinMaxScaler object

scaler_minmax = MinMaxScaler()
# 2. Fit the scaler on the TRAINING data ONLY
scaler_minmax.fit(X_train)

# 3. Transform both the training and test sets
X_train_scaled_minmax = scaler_minmax.transform(X_train)
X_test_scaled_minmax = scaler_minmax.transform(X_test)

# Convert back to DataFrame for better viewing
X_train_scaled_minmax_df = pd.DataFrame(
    X_train_scaled_minmax, 
    columns=X_train.columns, 
    index=X_train.index
)

print("\n--- Normalized Training Data (Min=0, Max=1) ---")
print(X_train_scaled_minmax_df)
print(f"\nMinimum of 'Age' after normalization: {X_train_scaled_minmax_df['Age'].min():.2f}")
print(f"Maximum of 'Age' after normalization: {X_train_scaled_minmax_df['Age'].max():.2f}")


--- Normalized Training Data (Min=0, Max=1) ---
        Age  Salary
2  1.000000     1.0
0  0.130435     0.1
3  0.000000     0.0

Minimum of 'Age' after normalization: 0.00
Maximum of 'Age' after normalization: 1.00
