# Machine Learning Pipeline :
![574-5745035_machine-learning-workflow-machine-learning-data-pipeline.png](attachment:574-5745035_machine-learning-workflow-machine-learning-data-pipeline.png)

Data preprocessing is a step in the data mining and data analysis process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning.

Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it contain errors , but it is often incomplete, and doesn’t have a regular, uniform design.

# Data Preprocessing Importance :

When using data sets to train machine learning models, you'll often hear the phrase **"garbage in, garbage out"** This means that if you use bad or "dirty" data to train your model, you’ll end up with a bad, improperly trained model that won’t actually be relevant to your analysis.

Good, preprocessed data is even more important than the most powerful algorithms, to the point that machine learning models trained with bad data could actually be harmful to the analysis you're trying to do – giving you "garbage" results.

![gi.PNG](attachment:gi.PNG)

# Understanding Machine Learning Data Features :

Wikipedia describes a machine learning data feature as **"an individual measurable property or characteristic of a phenomenon being observed"**.

It’s important to understand what "features" are when preprocessing your data because you’ll need to choose which ones to focus on depending on what your business goals are.

First, let’s go over the two different types of features that are used to describe data: **categorical** and **numerical**:

   * **Categorical features**: Features whose explanations or values are taken from a defined set of possible explanations or values. Categorical values can be colors of a house , True/False , positive, negative, neutral, etc.
   * **Numerical features:** Features with values that are continuous on a scale, statistical, or integer-related. Numerical values are represented by whole numbers, fractions, or percentages. Numerical features can be house prices, word counts in a document

The diagram below shows how features are used to train machine learning model :
![Training%20Phase%20.png](attachment:Training%20Phase%20.png)

# Data Preprocessing Steps :

## 1. Data Quality Assesement :

Take an overall look at your data and get an idea of its overall quality :

   * **Duplicated values** : Duplicates are an extreme case of nonrandom sampling, and they bias your fitted model. Including them will essentially lead to the model overfitting this subset of points. **How to check if there is any duplicates ?**

In [1]:
#pandas.DataFrame.duplicated
#Example

import pandas as pd

data = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

data

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [2]:
data.duplicated()

0    False
1     True
2    False
3    False
4    False
dtype: bool

In [3]:
data.drop_duplicates()

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


 * **Data outliers**: By definition an outlier is a data value that is numerically distant from a data set .Outliers can have a huge impact on data analysis results. **Let's see an Example**

![Temperature%20of%20Tunisia%20on%201st%20July-2.png](attachment:Temperature%20of%20Tunisia%20on%201st%20July-2.png)

![hhhh.png](attachment:hhhh.png)

We can handle outliers by :
 * Dropping them
 * Transforming their values
 
We will see that we can visualize outliers using Box plots.

* **Missing data**: Take a look for missing data fields, blank spaces in text, or unanswered survey questions. This could be due to human error or incomplete data. To take care of missing data, you’ll have to perform data cleaning.

![image.png](attachment:image.png)

## 2. Data Cleaning :

Data cleaning is the process of adding missing data and correcting, repairing, or removing incorrect or irrelevant data from a data set. Dating cleaning is the most important step of preprocessing because it will ensure that your data is ready to go for our modeling step.

## Handle Missing Values :

 ### 1. Drop the Missing Values :

This data preprocessing method is commonly used to handle the null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 75% of missing values.

To drop the missing values we use the following method :

![image.png](attachment:image.png)

In [4]:
#Example

import numpy as np

df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"),
                            pd.NaT],
                  "salary": [np.nan, 1500, 2000],})
df

Unnamed: 0,name,toy,born,salary
0,Alfred,,NaT,
1,Batman,Batmobile,1940-04-25,1500.0
2,Catwoman,Bullwhip,NaT,2000.0


In [24]:
df.dropna(inplace=True)

In [26]:
df 

Unnamed: 0,name,toy,born,salary
1,Batman,Batmobile,1940-04-25,1500.0


### 2. Replace the Missing Values :

This strategy can be applied on a feature which has numeric data like the year column or Home team goal column. We can calculate the mean, median or mode of the feature and replace it with the missing values.

In [27]:
# we will handle the missing values in the salary column

median = df.salary.median()

median

1500.0

In [28]:
#fill the missing values with the median
df.salary.fillna(median , inplace = True)


In [29]:
df

Unnamed: 0,name,toy,born,salary
1,Batman,Batmobile,1940-04-25,1500.0


## Transform Categorical Values :

Since, machine learning models are based on Mathematical equations and you can intuitively understand that it would cause some problem if we can keep the Categorical data in the equations because we would only want numbers in the equations.

![categ.PNG](attachment:categ.PNG)

### Ordinal Encoding :

What is ordinal encoding?
In ordinal encoding, each unique category value is assigned an integer value.

![1_NUzgzszTdpLPZpeKPPf0kQ.png](attachment:1_NUzgzszTdpLPZpeKPPf0kQ.png)

In [10]:
#sklearn.preprocessing.OrdinalEncoder

# Creating an Pandas dataframe for ordinal data
data = {'Employee Id' : [112, 113, 114, 115], 'Income Range' : ['Low', 'High', 'Medium', 'High']}
df_ordinal = pd.DataFrame(data)
# Viewing few rows of created dataframe
df_ordinal.head()

Unnamed: 0,Employee Id,Income Range
0,112,Low
1,113,High
2,114,Medium
3,115,High


In [11]:
from sklearn.preprocessing import OrdinalEncoder

ordinalencoder = OrdinalEncoder()
ordinalencoder.fit_transform(df_ordinal[['Income Range']])

array([[1.],
       [0.],
       [2.],
       [0.]])

### One Hot Encoding :

OneHotEncoder of SciKit Learn encodes categorical data by creating Dummy variables for each label in the feature that was passed as an argument.

![ob_5e7622_mtimfxh.png](attachment:ob_5e7622_mtimfxh.png)

* **sklearn.preprocessing.OneHotEncoder** :

In [12]:
#sklearn.preprocessing.OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
result = enc.fit_transform(df_ordinal[['Income Range']]).toarray()

In [13]:
result

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [14]:
df_ordinal[['Income Range']]

Unnamed: 0,Income Range
0,Low
1,High
2,Medium
3,High


In [15]:
feat = enc.get_feature_names_out(['Income Range'])
feat

array(['Income Range_High', 'Income Range_Low', 'Income Range_Medium'],
      dtype=object)

In [16]:
result = pd.DataFrame(result , columns=feat)

In [17]:
result

Unnamed: 0,Income Range_High,Income Range_Low,Income Range_Medium
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,1.0,0.0,0.0


* **pandas.get_dummies** :
Convert categorical variable into dummy/indicator variables.

In [18]:
pd.get_dummies(df_ordinal)

Unnamed: 0,Employee Id,Income Range_High,Income Range_Low,Income Range_Medium
0,112,0,1,0
1,113,1,0,0
2,114,0,0,1
3,115,1,0,0


### Frequency Encoding :

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat to the target variable, it helps the model understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Three-step for this :
* Select a categorical variable you would like to transform
* Group by the categorical variable and obtain counts of each category
* Join it back with the training dataset
![1_l0mPlpqFEK_DSu4OqSnvLg.jpeg](attachment:1_l0mPlpqFEK_DSu4OqSnvLg.jpeg)

In [30]:
fe = df_ordinal.groupby("Income Range")

In [31]:
fe

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FDB420E220>

In [21]:
df_ordinal.loc[:,"Income_range_enc"] = df_ordinal["Income Range"].map(fe)

In [22]:
df_ordinal

Unnamed: 0,Employee Id,Income Range,Income_range_enc
0,112,Low,0.25
1,113,High,0.5
2,114,Medium,0.25
3,115,High,0.5


## Feature Scaling :
Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.

![image.png](attachment:image.png)

See the Age and Salary column. You can easily noticed Salary and Age variable don’t have the same scale and this will cause some issue in your machine learning model.

Because most of the Machine Learning models are based on **Euclidean Distance**.

![image.png](attachment:image.png)

Euclidean Formula


Let’s say we take two values from Age and Salary column

Age- **40 and 27**

Salary- **72000 and 48000**


One can easily compute and see that Salary column will be dominated in Euclidean Distance. And we don’t want this thing.

* **Standard Scaler** :
StandardScaler removes the mean and scales each feature/variable to unit variance.

![image.png](attachment:image.png)

* **Min Max Scaler** :
Transform features by scaling each feature to a given range.

![aaa.png](attachment:aaa.png)