# Introduction to Data Preprocessing

Data preprocessing is the **first and most important step** in any Machine Learning pipeline. 

In this notebook, we will cover:
- Why preprocessing is needed
- Handling missing values
- Encoding categorical variables
- Feature scaling
- Splitting into train/test sets

Let's start! 🚀

## 1. Why Data Preprocessing?

Real-world datasets are often messy:
- Missing or null values
- Different data formats
- Numerical values in different ranges
- Categorical (textual) data

Preprocessing helps clean and prepare the data so that Machine Learning models can learn effectively.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# Example dataset
data = {
    'Age': [25, 30, np.nan, 35, 40],
    'Salary': [50000, 60000, 55000, np.nan, 65000],
    'Country': ['India', 'USA', 'India', 'UK', np.nan],
    'Purchased': ['Yes', 'No', 'Yes', 'No', 'Yes']
}

df = pd.DataFrame(data)
df

## 2. Handling Missing Values

In [None]:
# Impute missing numerical values with mean
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])

# Fill missing categorical values with mode (most frequent)
df['Country'].fillna(df['Country'].mode()[0], inplace=True)
df

## 3. Encoding Categorical Variables
- Label Encoding: Convert categories to numbers (useful for target column).
- One-Hot Encoding: Create dummy variables (useful for features).

In [None]:
# Label Encoding for target column 'Purchased'
le = LabelEncoder()
df['Purchased'] = le.fit_transform(df['Purchased'])

# One-Hot Encoding for 'Country'
df = pd.get_dummies(df, columns=['Country'], drop_first=True)
df

## 4. Feature Scaling
Different features might have different ranges (e.g., Age vs. Salary).

Scaling brings them to a similar scale, which helps ML models perform better.

In [None]:
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
df

## 5. Splitting Dataset into Train/Test Sets

In [None]:
X = df.drop('Purchased', axis=1)
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Features:\n", X_train)
print("\nTraining Labels:\n", y_train)

## ✅ Summary
- Dealt with missing values (mean & mode).
- Encoded categorical features.
- Scaled numerical features.
- Split dataset into train/test sets.

This is the **foundation of every ML pipeline**. Preprocessed data ensures your models learn correctly and perform better.