<a href="https://colab.research.google.com/github/Bilal-Moussaoui/Data_Science_Basics_PY/blob/main/Lesson_2_Data_Preprocessing_%26_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing & Cleaning


---


##Objectives for todays Lesson
1. Understand the importance of preprocessing for model performance
2. Learn to **detect, analyze, and handle missing data**
3. Undersant how to decide **what to impute, drop or transfrom**
4. Implement a first, clean **data preprocessing workflow** in python.

This is known as feature engineering foundation and is a principal step in every data science project.

## 1. The context: why preprocessing matters
We have to take some considerations with the dataset that we get:
1. Missing values
2. Inconsistent formats (euro, €, Euro, ...)
3. Extreme outliers
4. Categorical text variables
5. Skewed numerical features.

So, to start, the first step is to use pandas to display the basic information about the used dataset
```
# Basic info
print(data.shape)
print(data.info())

# Quick look at missing values
print(data.isnull().sum().sort_values(ascending=False).head(10))
```


### Step 1: Load and inspect the data

In [7]:
import pandas as pd
import numpy as np

# Load dataset
data = pd.read_csv("/content/sample_data/house_price_train.csv")

# Basic info
# print(data.shape) # Cell shape + Class of the object.
# print(data.info()) # Column_id + Type of Collumn (Non-null values) + Column Data Type

# Quick look at missing values (Which columns have missing values)
print(data.isnull().sum().sort_values(ascending=False).head(10))

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageQual        81
GarageFinish      81
GarageType        81
dtype: int64


### Step 2: Quantify missing data

We have to measure how problematic each column is. This will help us see which feature have many nulls and which have few.

> **IF an attribute has many nulls => We should drop it**
> **IF an attribute has few => We should impute them**


In [18]:
missing = data.isnull().sum()
# Filters the missing Series, keeping only the entries where the value is greater than 0 (columns with missing value)
missing = missing[missing > 0]
missing_percent = ((missing / len(data)) * 100).sort_values(ascending=False)
missing_df = pd.DataFrame({'Missing Values': missing, 'Percent': missing_percent})
print(missing_df)

              Missing Values    Percent
Alley                   1369  93.767123
BsmtCond                  37   2.534247
BsmtExposure              38   2.602740
BsmtFinType1              37   2.534247
BsmtFinType2              38   2.602740
BsmtQual                  37   2.534247
Electrical                 1   0.068493
Fence                   1179  80.753425
FireplaceQu              690  47.260274
GarageCond                81   5.547945
GarageFinish              81   5.547945
GarageQual                81   5.547945
GarageType                81   5.547945
GarageYrBlt               81   5.547945
LotFrontage              259  17.739726
MasVnrArea                 8   0.547945
MasVnrType               872  59.726027
MiscFeature             1406  96.301370
PoolQC                  1453  99.520548
