In the real world, the data that we work with is raw, it is not clean and needs processing to be ready to be passed to a machine learning model. You may have heard that 80% of a data scientist’s time goes into data preprocessing and 20% of the time for model building. This isn’t false and is actually the case. 

1. What Is Meant by Data Preprocessing in Machine Learning
The workflow of Machine learning follows as below. As you can see, post the collection and combining the different data sources, data preprocessing in machine learning comes first in its pipeline. Lets’ understand further what exactly does data preprocessing means.

![Data%20Preprocessing.png](attachment:Data%20Preprocessing.png)

2. Why do we need Data Preprocessing in Machine Learning?

![image.png](attachment:image.png)

3. Which are the Data Preprocessing Techniques?

The data preprocessing techniques in machine learning can be broadly segmented into two parts: Data Cleaning and Data Transformation. The following flow-chart illustrates the above data preprocessing techniques and steps in machine learning

![Data%20Preprocessing1.png](attachment:Data%20Preprocessing1.png)

## 3.1. Data Cleaning/ Cleansing

As we have seen, the real-world data is not all complete, accurate, correct, consistent, and relevant. The first and the primary step is to clean the data. There are various steps in this stage, it involves:  

- Making the data consistent across the values, which can mean:
- The attributes may have incorrect data types and are not in sync with the data dictionary. Correction of the data types is a must before proceeding with any type of data cleaning.
- Replace the special characters for example: replace $ and comma signs in the column of Sales/Income/Profit i.e making $10,000 as 10000.
- Making the format of the date column consistent with the format of the tool used for data analysis.
- Check for null or missing values, also check for the negative values. The relevancy of the negative values depends on the data. In the income column, a negative value is spurious though the same negative value in the profit column becomes a loss. 
- Smoothing of the noise present in the data by identifying and treating for outliers.

Please note the above steps are not comprehensive. The data cleaning steps vary and depend on the nature of the data. For instance, text data consisting of, say, reviews, or tweets would have to be cleaned to make the cases of the words the same, remove punctuation marks, any special characters, remove common words, and differentiate words based on the parts of speech. Now, let’s understand how to handle the missing values and outliers in the data.

### 3.1.1 Handling the Null/Missing Values

The null values in the dataset are imputed using mean/median or mode based on the type of data that is missing:

- Numerical Data: If a numerical value is missing, then replace that NaN value with mean or median. It is preferred to impute using the median value as the average or the mean values are influenced by the outliers and skewness present in the data and are pulled in their respective direction.

- Categorical Data: When categorical data is missing, replace that with the value which is most occurring i.e. by mode. 

Now, if a column has, let’s say, 50% of its values missing, then do we replace all of those missing values with the respective median or mode value? Actually, we don’t. We delete that particular column in that case. We don’t impute it because then that column will be biased towards the median/mode value and will naturally have the most influence on the dependent variable. This is summarized in the chart below:

![Data%20Preprocessing2.png](attachment:Data%20Preprocessing2.png)

### Outliers Treatment

To check for the presence of outliers, we can plot BoxPlot. To treat the outliers, we can use either cap the data or transform the data:

#### Capping the data: 

We can place cap limits on the data again using three approaches. Oh yes! there are a lot of ways to deal with the data in machine learning 😀 So, can cap via:

#### Z-Score approach: 

All the values above and below 3 standard deviations and are outliers and can be removed

There are numerous techniques available to transform the data. Some of the most commonly used are:

- Logarithmic transformation
- Exponential transformation
- Square Root transformation
- Reciprocal transformation
- Box-cox transformation

### 3.2 Data Transformation

Data transformation is different from feature transformation, where the latter is to replace the existing attributes with a mathematical function of these attributes. The transformation on the data that we focus on is to make the numerical and the categorical data machine ready. This is done in the following manner:

### 3.2.1 Numerical data

The numerical data is scaled, meaning we bring all the numerical data on the same scale. For example, to predict how much loan amount to give to a customer depends on variables such as age, salary, number of working years. Now, on building a linear regression model for this problem, it would not be possible for us to compare the beta coefficients of the above variables as the scale of each variable is different from the others. Hence, the Scaling of the variables is essential.  The two ways to scale data are Standardization and Normalization.

- Standardization: On the basis of the Z-score, the numerical data is scaled using the formula of calculating Z values =  (x-mean)/standard deviation. The data ranges in the interval of -3 to 3. 
- Normalization: Here, the scaling happens using the formula: (x – min)/(max-min), reducing the data in the width of 0 to 1. This is also known as Min-Max Scalar. 

### 3.2.2 Categorical Data

The categorical data can not be directly fed into the model. We have seen machines are black and white, either 1 or 0. So, to use the categorical data for our model building process, we need to create dummy variables. Dummy variables are binary; they can take either the value as 1 or as 0. If we have n types of sub-categories within a categorical column, we must employ n-1 dummy variables. There are two ways to create dummy variables:

- Pandas’ function: pd.get_dummies, and
- sklearn’s in-built function of OneHotEncoder 

There is one more way of dealing with the categorical data, which is to use label encoding. The label encoder does not create dummy variables. However, it labels the categorical variable by numbers like below:

- Delhi   –>  1
- Mumbai   –>  2
- Hyderabad  –>  3

There is a limitation of label encoding: it converts the nominal data, which is the categorical data without any order, into ordinal data having order. In the above example, the three cities did not have order. However, the post applying label encoder has values 1,2,3, respectively. The machine will treat this data by giving precedence and treat the numbers as weights like 3 > 2 > 1 will make Hyderabad > Mumbai > Delhi. Hence, due to this limitation of label encoding, handling the categorical data is by creating the dummy variables.  

## 4. Data Preprocessing Steps in Machine Learning

The steps in data preprocessing in machine learning are:

1. Consolidation after acquisition of the data

2. Data Cleaning:
- Convert the data types if any mismatch present in the data types of the variables
- Change the format of the date variable to the required format
- Replace the special characters and constants with the appropriate values
3. Detection and treatment of missing values 
4. Treating for negative values, if any present depending on the data
5. Outliers detection and treatment
6. Transformation of variables
7. Creation of new derived variables
8. Scale the numerical variables
9. Encode the categorical variables
10. Split the data into training, validation, and test set