# **Data PreProcessing**
Author: **Siddik Barbhuiya**

**Email**:*siddikbarbhuiya@gmail.com*

# **Why is Data Preprocessing Important?**
Data preprocessing is a fundamental step in the machine learning pipeline. Its significance is rooted in the following reasons:


> **Quality of Data:**Real-world data is often messy. It can be incomplete, inconsistent, or even misleading. These anomalies can distort predictions and affect the accuracy of a model.


>**Irrelevant and Redundant Features**: Datasets often come with many features, not all of which are useful. Some might be redundant, and some might not have any correlation with the output variable. Removing such features can enhance the performance of the model.


> **Optimal Performance:** Machine learning models thrive on well-processed data. Preprocessing ensures that the data fed to these models is of high quality, thus ensuring that the models perform optimally.







# **Description of the Libraries**


> `numpy (np):` A library for numerical operations in Python. It provides support for arrays (including multidimensional arrays) and offers a variety of mathematical operations to operate on these arrays efficiently.


> `pandas (pd):` A data manipulation and analysis library for Python. It provides structures like DataFrames and Series for handling and analyzing structured data.


> `SimpleImputer`: A class from `sklearn` used to handle missing data by imputing or filling in the missing values using methods like mean, median, mode, etc.


> `StandardScaler`: A class from `sklearn` used to scale features so that they have a mean of 0 and a standard deviation of 1. It's a part of preprocessing to ensure data is standardized before feeding into certain algorithms.


> `OneHotEncoder`: A class from `sklearn` used to convert categorical variable(s) into a form that could be provided to ML algorithms to do a better job in prediction.













> `LabelEncoder:` A class from `sklearn` used to convert categorical text data into model-understandable numerical data


> `ColumnTransformer`: A class from `sklearn` that allows different columns or column subsets of the input to be transformed separately and the results to be concatenated


> `train_test_split:` A function from `sklearn` used to split datasets into random train and test subsets.








In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

In [None]:
# Load the dataset
data = pd.read_csv('Data.csv')
data.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


# **Handling The Missing Data**

In [None]:
# Identify missing values
missing_data = data.isnull().sum()
missing_data

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

The code is used to identify and count missing values in a Pandas DataFrame.

**Here's a breakdown:**

`data.isnull()`: This method checks each element in the DataFrame (data in this case) and returns a DataFrame of the same shape as data, but with True where the values are missing (i.e., NaN) and False where they are not.

`.sum()`: This method, when applied to a DataFrame, returns the sum of values for each column. When used in conjunction with `isnull()`, it sums up the True values, effectively counting the number of missing values in each column.

So, missing_data is a Pandas Series where the index represents the column names from the data DataFrame, and the values represent the count of missing values in each respective column.

For example, if missing_data has an entry `Age: 2`, it means there are 2 missing values in the 'Age' column of the data DataFrame

In [None]:
# Handling missing data using mean imputation
imputer = SimpleImputer(strategy="mean")
data[['Age', 'Salary']] = imputer.fit_transform(data[['Age', 'Salary']])
data.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes


**Here's a step-by-step breakdown:**

`SimpleImputer(strategy="mean")`:

This initializes an instance of the SimpleImputer class from the sklearn.impute module.
The strategy parameter is set to "mean", which means that the imputer will fill missing values with the mean of each column where missing values are located.
The resulting object, imputer, is now configured to perform mean imputation.
imputer.fit_transform(data[['Age', 'Salary']]):

The `fit_transform` method is a combination of two steps: fit and transform.
`fit`: This step computes the mean of the columns Age and Salary from the dataset data.
`transform`: This step replaces the missing values in the Age and Salary columns with their respective computed means.
The result is a new dataset with the missing values in the Age and Salary columns replaced by their means.
data[['Age', 'Salary']] = ...:

This assigns the result from the previous step (the dataset with imputed values) back to the original data DataFrame. This way, the Age and Salary columns in data are now updated with the mean-imputed values in places where they were missing.


`data.head()`:
This simply displays the top few rows of the data DataFrame, allowing you to verify that the missing values in the Age and Salary columns have been replaced with their respective mean values

# **Handling the Catagorical Dat**a

In [None]:
# Convert "Country" using one-hot encoding and "Purchased" using label encoding

# One-hot encoding for 'Country'
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
data_encoded = np.array(ct.fit_transform(data))

# Convert back to dataframe
data_encoded_df = pd.DataFrame(data_encoded, columns=['France', 'Germany', 'Spain', 'Age', 'Salary', 'Purchased'])

# Label encoding for 'Purchased'
le = LabelEncoder()
data_encoded_df['Purchased'] = le.fit_transform(data_encoded_df['Purchased'])

data_encoded_df.head()

Unnamed: 0,France,Germany,Spain,Age,Salary,Purchased
0,1.0,0.0,0.0,44.0,72000.0,0
1,0.0,0.0,1.0,27.0,48000.0,1
2,0.0,1.0,0.0,30.0,54000.0,0
3,0.0,0.0,1.0,38.0,61000.0,0
4,0.0,1.0,0.0,40.0,63777.777778,1


**One-hot encoding for 'Country':**

`ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough'):`

Initializes an instance of the ColumnTransformer class from sklearn.compose.
The transformers parameter takes a list of transformers. Here, only one transformer is defined:
`'encoder'`: This is just a name for this transformer.

`OneHotEncoder()`: This is the transformer instance, which will perform one-hot encoding on categorical data.

`[0]`: This specifies that the transformer should be applied to the first column of the data (which is the Country column in this case).

`remainder='passthrough'`: This means that columns not specified in the transformers should be left untouched and passed through as they are.
The result, ct, is an object ready to perform one-hot encoding on the specified column(s).

`data_encoded = np.array(ct.fit_transform(data))`:

Applies the previously defined column transformation (ct) to the dataset data.
fit_transform: Learns the unique categories in the Country column and then transforms this column into one-hot encoded format.

`np.array(...)`: Converts the result into a numpy array. This step is needed because the transformation results in an array-like structure.
Convert back to dataframe:


`pd.DataFrame(data_encoded, columns=['France', 'Germany', 'Spain', 'Age', 'Salary', 'Purchased'])`:

Converts the numpy array data_encoded back to a pandas DataFrame.
The columns parameter specifies the column names, with France, Germany, and Spain being the one-hot encoded columns for the Country column.
Label encoding for 'Purchased':


`LabelEncoder()`: Initializes an instance of the LabelEncoder class from sklearn.preprocessing

`data_encoded_df['Purchased'] = le.fit_transform(data_encoded_df['Purchased'])`:
Applies label encoding to the Purchased column.
The fit_transform method learns the unique categories in the Purchased column and assigns each category a unique integer.

The Purchased column in data_encoded_df is then updated with these integers.

`data_encoded_df.head(`):

Displays the top few rows of the data_encoded_df DataFrame, allowing you to verify that the transformations have been applied correctly.

# **Features Engineering**

In [None]:
# Apply standardization on 'Age' and 'Salary'
scaler = StandardScaler()
data_encoded_df[['Age', 'Salary']] = scaler.fit_transform(data_encoded_df[['Age', 'Salary']])
data_encoded_df.head()

Unnamed: 0,France,Germany,Spain,Age,Salary,Purchased
0,1.0,0.0,0.0,0.758874,0.7494733,0
1,0.0,0.0,1.0,-1.711504,-1.438178,1
2,0.0,1.0,0.0,-1.275555,-0.8912655,0
3,0.0,0.0,1.0,-0.113024,-0.2532004,0
4,0.0,1.0,0.0,0.177609,6.632192e-16,1


`StandardScaler()`:

> Initializes an instance of the StandardScaler class from sklearn.preprocessing.
> The StandardScaler performs standardization on the data.

`data_encoded_df[['Age', 'Salary']] = scaler.fit_transform(data_encoded_df[['Age', 'Salary']])`:
> The fit_transform method is applied to the Age and Salary columns of the data_encoded_df DataFrame.
fit: Computes the mean (μ) and standard deviation (σ) of the provided columns.
transform: Applies the standardization to the columns using the formula: