# **Data Preprocessing in Python**

## Why Data Preprocessing?

Before training a Machine Learning model, the data must be cleaned and prepared. Real-world data is often messy — it can have **missing values**, **wrong formats**, **different scales**, or **categories** that need conversion.

### Goal:

Convert raw data into a clean, numerical, standardized form ready for training a model.






## Handling Missing Data

### Why do we get missing data?

* Human error
* Data corruption
* Not applicable fields

### Check for Missing Values

In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [None]:
df[df.isnull().any(axis=1)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


In [None]:
# pd.set_option("display.max_rows", None)
df[df['Age'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [None]:
df.iloc[5]

Unnamed: 0,5
PassengerId,6
Survived,0
Pclass,3
Name,"Moran, Mr. James"
Sex,male
Age,
SibSp,0
Parch,0
Ticket,330877
Fare,8.4583


In [None]:
df.iloc[5]

Unnamed: 0,5
PassengerId,6
Survived,0
Pclass,3
Name,"Moran, Mr. James"
Sex,male
Age,29.699118
SibSp,0
Parch,0
Ticket,330877
Fare,8.4583


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### a) Drop Rows or Columns

In [None]:
# df.dropna(inplace=True)  # drop rows with any missing value
# df.dropna(subset=['Age'], inplace=True)
df.drop(columns=['Cabin'], inplace=True)

### b) Fill with Static Value

In [None]:
df['Embarked'].fillna('Unknown', inplace=True)
df['Age'].fillna(0, inplace=True)

### c) Fill with Mean, Median, Mode


In [None]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting 

#### d) Forward/Backward Fill

In [None]:
df.fillna(method='ffill', inplace=True)
df.fillna(method='bfill', inplace=True)

#### Final handling

In [None]:
df.drop(columns=['Cabin'], inplace=True)
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Age'] = df['Age'].fillna(df['Age'].mean())

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


## Encoding Categorical Variables

### Why Encode?

ML models only work with numbers — not strings.

### a) Label Encoding

Used for **ordinal data** (Low < Medium < High). But often also used for binary nominal features.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])  # Male→1, Female→0
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.000000,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.000000,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.000000,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.000000,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",1,35.000000,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,27.000000,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.000000,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,29.699118,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",1,26.000000,0,0,111369,30.0000,C


#### **Important Clarification on Label Encoding**

Label Encoding gives **numbers** to categories, which can create a **false sense of order**.

| Sex    | Encoded |
| ------ | ------- |
| Male   | 1       |
| Female | 0       |

This might be **misleading for models** like Linear Regression, KNN, SVM — they may assume “Male > Female”.

**So, when is it OK?**

| Scenario                            | Encoding         |
| ----------------------------------- | ---------------- |
| Binary categories (e.g., 'Sex')     | Label ✅ OK       |
| Unordered with >2 categories        | ❌ Avoid Label    |
| Ordinal categories (e.g., Size)     | Label ✅          |
| Tree models (Decision Tree, RF)     | Label ✅          |
| Linear models (e.g., Logistic Reg.) | Prefer One-Hot ✅ |

---


### b) One-Hot Encoding

Use for **nominal (unordered)** categories.


| Embarked\_Q | Embarked\_S |
| ----------- | ----------- |
| 0           | 1           |
| 1           | 0           |

`drop_first=True` avoids redundant columns.

In [None]:
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,False,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,False,True
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,False,True
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,False,True
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,False,True
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,False,False


### Encoding Summary

In [None]:
# Label Encoding for binary
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])

# One-Hot Encoding for multi-class
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

## Feature Scaling

### Why Scale?

Features with large values (e.g., Salary = 50,000) can dominate those with small values (e.g., Age = 25).

Models like:

* KNN
* SVM
* Gradient Descent-based models

are **scale-sensitive**.




### a) StandardScaler (Z-score normalization)

Formula:

$$
z = \frac{x - \mu}{\sigma}
$$

* Mean becomes **0**
* Std Dev becomes **1**



In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

#### Example: Why Mean = 0 and Std = 1?

Let’s say:

```python
data = [10, 12, 14, 16, 18]
```

1. Mean = 14
2. Std Dev ≈ 2.828

Apply standardization:

```python
(10 - 14) / 2.828 ≈ -1.414
(12 - 14) / 2.828 ≈ -0.707
(14 - 14) / 2.828 = 0
(16 - 14) / 2.828 ≈ 0.707
(18 - 14) / 2.828 ≈ 1.414
```

These new values have:

* Mean ≈ 0
* Std Dev ≈ 1

It’s guaranteed by the formula!


### b) MinMaxScaler

Formula:

$$
x_{scaled} = \frac{x - \min}{\max - \min}
$$

Scales to \[0, 1]. Good when features have known range and no outliers.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

### c) RobustScaler

Uses **median and IQR** — great for features with **outliers**.

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

## Final Preprocessing Workflow

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
import pandas as pd

# import seaborn as sns
# df = sns.load_dataset('titanic');

df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


| Column        | Data Type | Description                                                         |
| ------------- | --------- | ------------------------------------------------------------------- |
| `PassengerId` | int       | Unique ID for each passenger (not useful for prediction)            |
| `Survived`    | int (0/1) | **Target variable** — 1 = Survived, 0 = Did not survive             |
| `Pclass`      | int (1-3) | Passenger class: 1 = 1st (upper), 2 = 2nd (middle), 3 = 3rd (lower) |
| `Name`        | object    | Full name (often contains titles like Mr., Mrs.)                    |
| `Sex`         | object    | Gender: `male` or `female`                                          |
| `Age`         | float     | Age in years (can have missing values)                              |
| `SibSp`       | int       | Siblings/Spouses aboard                                             |
| `Parch`       | int       | Parents/Children aboard                                             |
| `Ticket`      | object    | Ticket number (can be alphanumeric, not very useful)                |
| `Fare`        | float     | Ticket fare (price paid)                                            |
| `Cabin`       | object    | Cabin number (many missing values)                                  |
| `Embarked`    | object    | Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton |


### Drop unused

In [None]:
df.drop(columns=['Cabin'], inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


### Fill missing

In [None]:
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


### Encode

In [None]:
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,False,False
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,False,True
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,False,True


### Scale

In [None]:
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",1,-0.565736,1,0,A/5 21171,-0.502445,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,0.663861,1,0,PC 17599,0.786845,False,False
2,3,1,3,"Heikkinen, Miss. Laina",0,-0.258337,0,0,STON/O2. 3101282,-0.488854,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,0.433312,1,0,113803,0.420730,False,True
4,5,0,3,"Allen, Mr. William Henry",1,0.433312,0,0,373450,-0.486337,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,-0.181487,0,0,211536,-0.386671,False,True
887,888,1,1,"Graham, Miss. Margaret Edith",0,-0.796286,0,0,112053,-0.044381,False,True
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,-0.104637,1,2,W./C. 6607,-0.176263,False,True
889,890,1,1,"Behr, Mr. Karl Howell",1,-0.258337,0,0,111369,-0.044381,False,False


#### Summary Table

| Technique          | When to Use               | Function              |
| ------------------ | ------------------------- | --------------------- |
| `dropna()`         | Too many nulls            | Remove rows/cols      |
| `fillna(0)`        | Non-critical missing      | Fill with static      |
| `fillna(mean())`   | Numeric features          | Fill with average     |
| `LabelEncoder()`   | Ordinal / binary nominal  | Convert to numeric    |
| `get_dummies()`    | Nominal (unordered)       | Create binary columns |
| `StandardScaler()` | Normally distributed data | Mean = 0, Std = 1     |
| `MinMaxScaler()`   | Range-based data          | Scales to \[0, 1]     |
| `RobustScaler()`   | Outliers present          | Median, IQR based     |

