# Part 1 - Data Preprocessing in Python
This notebook explains the **first step of any Machine Learning project: Data Preprocessing**.  
We will follow the workflow:

1. Importing the dataset  
2. Handling missing data  
3. Encoding categorical variables  
4. Splitting the dataset  
5. Feature scaling  

Each step includes:  
- **What?** (what the step means)  
- **Why?** (why we need it in ML)  
- **How?** (how we apply it in Python, with code & outputs)  

---


## Where We Are in the ML Workflow

The **Machine Learning workflow** has three main phases:  
1. Data Preprocessing  
2. Modeling  
3. Evaluation  

In this notebook, we will focus on **Phase 1: Data Preprocessing**.  
This is the most critical step — if the data is not clean and well-prepared, even the most powerful algorithms will fail.  

💡 Think of it like cooking:  
- If your ingredients are fresh and well-prepared, the dish will turn out delicious.  
- If the ingredients are spoiled or messy, no recipe can save it.  

That’s why **the success of your ML model starts with successful preprocessing**.

---

## ⭐ Importing the Libraries

Before starting, let’s understand what a **library** means in programming:  

A **library** is a collection of **modules**.  
Each module contains **functions** and **classes** that let us perform specific actions without writing everything from scratch.  

👉 Think of it like a toolbox:  
- The **library** = the whole toolbox.  
- The **module** = a set of related tools inside the box.  
- A **function/class** = an individual tool you pick and use.  

In Machine Learning projects, the most common libraries are:  

- **NumPy** → work with arrays (most ML models expect inputs as arrays).  
- **Pandas** → load, clean, and manipulate datasets (especially for preprocessing).  
- **Matplotlib** → visualize data with plots and charts.  


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## ⭐ Importing the Dataset

<center>
  <img src="../../docs/Dataset-1.png" alt="Dataset Example" width="400"/>
</center>

**What:** We load the file `Data.csv` into a **DataFrame** (a table-like structure in Pandas).  
**Why:** Machine Learning models expect two entities:  
- **X** = independent variables (features) → the inputs  
- **y** = dependent variable (target) → the output we want to predict  

**How:** We use `iloc` for position-based indexing:  
- `:` → all rows  
- `:-1` → all columns except the last one  
- `-1` → only the last column  

Finally, we convert them into **NumPy arrays** (with `.values`), because most ML algorithms in scikit-learn expect arrays rather than DataFrames.


In [3]:
# 1) Load the dataset into a Pandas DataFrame (table with column names)
dataset = pd.read_csv('Data.csv')

# 2) Select features (X) and target (y) using iloc (position-based indexing)
# iloc syntax: df.iloc[row_slice, column_slice]
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [4]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [5]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


**Notes**
- `.iloc` selects columns/rows by **position** (0, 1, 2, …).  
- `.values` (or `.to_numpy()`) converts DataFrame/Series into NumPy arrays.  
- If your target column is not the last one, you can use its name instead:  
  `y = dataset['Purchased'].values`


## ⭐ Taking care of missing data
Real-world datasets usually contain **missing values**.  

ML models cannot work with missing data directly. If we don’t handle it, training will fail.

Some Options:  
1. Delete rows (only if few missing values).  
2. Replace missing values with statistical measures: mean, median, or mode.  

We will use **SimpleImputer** from `sklearn` to replace missing values with the column mean.




In [6]:
# Identify missing data (assumes that missing data is represented as NaN)
print(dataset.isnull())
# Print the number of missing entries in each column
print(dataset.isnull().sum())

   Country    Age  Salary  Purchased
0    False  False   False      False
1    False  False   False      False
2    False  False   False      False
3    False  False   False      False
4    False  False    True      False
5    False  False   False      False
6    False   True   False      False
7    False  False   False      False
8    False  False   False      False
9    False  False   False      False
Country      0
Age          1
Salary       1
Purchased    0
dtype: int64


In [7]:
from sklearn.impute import SimpleImputer
# Create an imputer object that will replace missing values (NaN) with the column mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on columns 1 and 2 (indexing in Python starts at 0, so [:, 1:3] means 2nd and 3rd columns)
imputer.fit(X[:, 1:3])

# Transform the data: replace NaN with the calculated mean values
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [8]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


🔎 **Result Explanation**

- `fit()` → calculates the mean of each selected column (does not change the data yet).  
- `transform()` → replaces the missing values (`NaN`) with those calculated means.  
- `X[:, 1:3]` → selects only the **2nd and 3rd columns** (Python slicing: start at index 1, stop before index 3).  


## ⭐ Encoding categorical data

**Problem:**  
Machine Learning models can only understand numbers, not text.  
But in our dataset, some columns contain text (like country names, or Yes/No answers).  

**Solution:**  
We convert text → numbers using special encoders.  





### Encoding the Independent Variable
#### 1. Encoding X (features)  
Suppose we have a column with countries:  
`France`, `Spain`, `Germany`  

❌ Bad idea: France=0, Spain=1, Germany=2  
- This makes it look like *Germany (2)* > *Spain (1)* > *France (0)*, which is not true.  
- The model would **assume an order or ranking** between the countries.  
- This false ordering can confuse the algorithm and **reduce the model’s quality and accuracy**.  


✅ Better idea: **One Hot Encoding**  
- Create a new column for each country.  
- Example:  

| Country   | France | Spain | Germany |  
|-----------|--------|-------|---------|  
| France    |   1    |   0   |   0     |  
| Spain     |   0    |   1   |   0     |  
| Germany   |   0    |   0   |   1     |  

This way, the model sees only "yes/no" (1 or 0), no fake ordering.  


In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# One Hot Encoding for the first column (Country)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough' # keep other columns as they are
                       )

# result will be a NumPy array with new encoded columns
X = np.array(ct.fit_transform(X))

In [10]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

#### 2. Encoding y (target)  
Our target column is binary (Yes / No).  
We can simply replace it with numbers:  

- Yes → 1  
- No → 0  

This works fine, because there are only two categories.  

In [11]:
from sklearn.preprocessing import LabelEncoder

# Label Encoding for the target (Yes/No)
le = LabelEncoder()
y = le.fit_transform(y)

In [12]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


🔎 **Result Explanation**

- **OneHotEncoder** → turns text categories into multiple binary columns (0/1).  
- **remainder='passthrough'** → keep the other columns as they are.  
- **LabelEncoder** → converts Yes/No into numbers (Yes=1, No=0).  

👉 Now, both X and y are numeric, ready for ML models.  


## ⭐ Splitting the Dataset  

We divide the dataset into two parts:  
- **Training set (80%)** → used to *teach* the model.  
- **Test set (20%)** → used to *evaluate* the model on new, unseen data.  

Why do we do this?  
- To make sure the model can **generalize** (perform well on new data), not just memorize the training data.  

<center>
  <img src="../../docs/Train & Test sets Spliting.png"  width="400"/>
</center>

---

❓ **Important Question:**  
Should we split before or after scaling?  

👉 **Answer:** Always split **before scaling**.  

---

### 🔎 Why? (Data Leakage)  

If we scale *before* splitting:  
- The scaler would calculate the mean/standard deviation using **all the data (including the test set)**.  
- This means some information from the test set “leaks” into the training process.  
- As a result, the test set is no longer *truly unseen*.  
- The model may look like it performs better than it really does (fake accuracy).  

✅ Correct way:  
1. Split into train and test sets.  
2. Fit the scaler **only on the training set** (calculate statistics).  
3. Apply the same transformation to the test set **(we will see this in the next step: Feature Scaling)**.  



In [13]:
from sklearn.model_selection import train_test_split

# Split the dataset: 80% training, 20% testing
# This 80/20 split is not mandatory, but it's the most common practice.
# Why? → We want to give the model as much data as possible for training,
# while still keeping a separate portion to test how well it generalizes.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

# X_train, y_train → used to build the model
# X_test, y_test → used to check if the model generalizes well

In [14]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [15]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [16]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [17]:
print(y_test)

[0 1]


## ⭐ Feature Scaling  

**The Problem:**  
In datasets, different features can have very different ranges.  
- Example:  
  - Age → values like 20, 35, 50, 100  
  - Salary → values like 30,000, 50,000, 70,000  

If we leave them like this, many ML algorithms will give **more importance** to the feature with larger values (Salary), and **ignore** the smaller one (Age).  
👉 This creates a **bias** in the model, leading to poor results.  

---

**The Goal of Scaling:**  
- Put all features on a similar scale.  
- Prevent one feature from *dominating* the others just because of its units.  
- Make training faster and more stable.  

---

**Two Main Techniques:**  

1. **Normalization**  
   - Scales all values into the range [0, 1].  
   - Best when data follows a *uniform distribution*.  
  
   
   **Formula :**
   $$
   x' = \frac{x - \min(x)}{\max(x) - \min(x)}
   $$


2. **Standardization**  
   - Scales values so that the mean = 0 and standard deviation = 1.  
   - Works well even when data is not uniformly distributed.  

   **Formula :**
   $$
   z = \frac{x - \mu}{\sigma}
   $$

   - This is the **most widely used** method in ML.  


---

Before scaling, we assumed the result would look like this.
<center>
  <img src="../../docs/Feature Scaling.png"  width="400"/>
</center>

However, after applying scaling, we discovered that our assumption was wrong and that the correct result is actually like this.
<center>
  <img src="../../docs/Feature Scaling - after.png"  width="400"/>
</center>

- Therefore, we don’t want our model to fall into the same mistake.
---

⚠️ **Important Note:**  
- Do **not** scale dummy variables (e.g., 0/1 values from OneHotEncoding).  
- Scaling them would destroy their meaning and make them hard to interpret.  


In [18]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# Fit on training set → calculate mean & std (statistics) from training data
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

# Transform test set using the same statistics (no new fit here!)
X_test[:, 3:] = sc.transform(X_test[:, 3:])

# Now both train and test features are on the same scale

In [19]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [20]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


🔎 **Result Explanation**  
- `fit_transform` on training data → learns scaling parameters and applies them.  
- `transform` on test data → applies the same scaling, without leaking test info.  
- Only scale **numeric continuous features**, not dummy variables.  


## 💡 Conclusion
In this notebook, we learned how to:  
1. Import & explore a dataset.  
2. Handle missing values.  
3. Encode categorical features.  
4. Split dataset into training & testing.  
5. Apply feature scaling.  

✅ Data preprocessing ensures that our dataset is **clean, consistent, and ready for ML models**.
