## Handling missing values in a Dataset:

Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.

In Pandas, usually, missing values are represented by NaN. It stands for Not a Number.


**Introduction:**

If you are aiming for a job as a data scientist, you must know how to handle the problem of missing values, which is quite common in many real-life datasets. Incomplete data can bias the results of the machine learning models and/or reduce the accuracy of the model. This article describes missing data, how it is represented, and the different reasons data values get missed. Along with the different categories of missing data, it also details out different ways of handling missing values with dataset examples.

**Learning Objectives:**

In this tutorial, we will learn about missing values and the benefits of missing data analysis in data science.
You will learn about the different types of missing data and how to handle them correctly.
You will also learn about the most widely used imputation methods to handle incomplete data.

**Table of Contents:**

What Is a Missing Value?

How Is a Missing Value Represented in a Dataset?

Why Is Data Missing From the Dataset?

Types of Missing Values?

Why Do We Need to Care About Handling Missing Data?

How to Impute Missing Values for Categorical Features?

How to Impute Missing Values Using Sci-kit Learn Library?

How to Use “Missingness” as a Feature?



**Why Is Data Missing From the Dataset?**

There can be multiple reasons why certain values are missing from the data. Reasons for the missing of data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.

**Some of the reasons are listed below:**

Past data might get corrupted due to improper maintenance.
Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.
The user has not provided the values intentionally
Item nonresponse: This means the participant refused to respond.
Types of Missing Values
Formally the missing values are categorized as follows:

**Missing Completely At Random (MCAR):**

In MCAR, the probability of data being missing is the same for all the observations. In this case, there is no relationship between the missing data and any other values observed or unobserved (the data which is not recorded) within the given dataset. That is, missing values are completely independent of other data. There is no pattern.

In the case of MCAR data, the value could be missing due to human error, some system/equipment failure, loss of sample, or some unsatisfactory technicalities while recording the values. For Example, suppose in a library there are some overdue books. Some values of overdue books in the computer system are missing. The reason might be a human error, like the librarian forgetting to type in the values. So, the missing values of overdue books are not related to any other variable/data in the system. It should not be assumed as it’s a rare case. The advantage of such data is that the statistical analysis remains unbiased.

**Missing At Random (MAR):**

MAR data means that the reason for missing values can be explained by variables on which you have complete information, as there is some relationship between the missing data and other values/data. In this case, the data is not missing for all the observations. It is missing only within sub-samples of the data, and there is some pattern in the missing values.

For example, if you check the survey data, you may find that all the people have answered their ‘Gender,’ but ‘Age’ values are mostly missing for people who have answered their ‘Gender’ as ‘female.’ (The reason being most of the females don’t want to reveal their age.)

So, the probability of data being missing depends only on the observed value or data. In this case, the variables ‘Gender’ and ‘Age’ are related. The reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable, but you can not predict the missing value itself.

Suppose a poll is taken for overdue books in a library. Gender and the number of overdue books are asked in the poll. Assume that most of the females answer the poll and men are less likely to answer. So why the data is missing can be explained by another factor, that is gender. In this case, the statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the missing data.

**Missing Not At Random (MNAR):**

Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data can not explain it, then it is considered to be Missing Not At Random (MNAR).

If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR. It can happen due to the reluctance of people to provide the required information. A specific group of respondents may not answer some questions in a survey.

For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll. So, in this case, the missing value of the number of overdue books depends on the people who have more books overdue.

Another example is that people having less income may refuse to share some information in a survey or questionnaire.

In the case of MNAR as well, the statistical analysis might result in bias.

**Why Do We Need to Care About Handling Missing Data?**

It is important to handle the missing values appropriately.

Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.
Missing data can lead to a lack of precision in the statistical analysis.

Checking for Missing Values in Python
The first step in handling missing values is to carefully look at the complete data and find all the missing values. The following code shows the total number of missing values in each column. It also shows the total number of missing values in the entire data set.


From the above output, we can see that there are 6 columns – Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History having missing values.


***Find the total number of missing values from the entire dataset :***

train_df.isnull().sum().sum() 


Handling Missing Values
Now that you have found the missing data, how do you handle the missing values?

Analyze each column with missing values carefully to understand the reasons behind the missing of those values, as this information is crucial to choose the strategy for handling the missing values.

There are 2 primary ways of handling missing values:

1.Deleting the Missing values
2.Imputing the Missing Values

**Deleting the Missing value:**

Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values. If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.

If the missing value is of type Missing At Random (MAR) or Missing Completely At Random (MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while missing observations are assumed to be completely random (MCAR) and addressed through pairwise deletion.)

The disadvantage of this method is one might end up deleting some useful data from the dataset.

**There are 2 ways one can delete the missing data values:**

**Deleting the entire row (listwise deletion):**

If a row has many missing values, you can drop the entire row. If every row has some (column) value missing, you might end up deleting the whole data. The code to drop the entire row is as follows:

IN:
df = train_df.dropna(axis=0)
df.isnull().sum()
OUT:
Loan_ID  0
Gender  0
Married  0
Dependents  0
Education  0
Self_Employed 0
ApplicantIncome  0
CoapplicantIncome  0
LoanAmount  0
Loan_Amount_Term  0
Credit_History  0
Property_Area  0
Loan_Status  0
dtype: int64

**Deleting the entire column:**

If a certain column has many missing values, then you can choose to drop the entire column. The code to drop the entire column is as follows:

IN:
df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum()

OUT:
Loan_ID  0
Gender  13
Married  3
Education  0
Self_Employed 32
ApplicantIncome  0
CoapplicantIncome  0
LoanAmount  22
Loan_Amount_Term  14
Credit_History  50
Property_Area  0
Loan_Status  0
dtype: int64

**Imputing the Missing Value:**
There are many imputation methods for replacing the missing values. You can use different python libraries such as Pandas, and Sci-kit Learn to do this. Let’s go through some of the ways of replacing the missing values.

**Replacing with an arbitrary value:**

If you can make an educated guess about the missing value, then you can replace it with some arbitrary value using the following code. E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with ‘0’.

**Replace the missing value with '0' using 'fiilna' method :**

train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()

OUT:
0

**Replacing with the mean:**

This is the most common method of imputing missing values of numeric columns. If there are outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first. You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’ with the mean of the respective column values.


**Replace the missing values for numerical columns with mean :**

train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())
train_df['Credit_History'] = train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean())

OUT:
Loan_ID  0
Gender  13
Married  3
Dependents  15
Education  0
Self_Employed 32
ApplicantIncome  0
CoapplicantIncome  0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area  0
Loan_Status  0
dtype: int64

**Replacing with the mode:**

Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’

IN:

**Replace the missing values for categorical columns with mode:**
train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0])
train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()[0])
train_df['Self_Employed'] = train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()[0])
train_df.isnull().sum()

OUT:
Loan_ID 0
Gender  0
Married 0
Dependents  0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount  0
Loan_Amount_Term  0
Credit_History  0
Property_Area 0
Loan_Status 0
dtype: int64

**Replacing with the median:**

The median is the middlemost value. It’s better to use the median value for imputation in the case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’ with the median value.

train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
Replacing with the previous value – forward fill

In some cases, imputing the values with the previous value instead of the mean, mode, or median is more appropriate. This is called forward fill. It is mostly used in time series data. You can use the ‘fillna’ function with the parameter ‘method = ffill’

IN:
import pandas as pd
import numpy as np
test = pd.Series(range(6))
test.loc[2:4] = np.nan
test
OUT:
0 0.0
1 1.0
2 Nan
3 Nan
4 Nan
5 5.0
dtype: float64
IN:
**Forward-Fill:**

test.fillna(method=‘ffill')
OUT:
0 0.0
1 1.0
2 1.0
3 1.0
4 1.0
5 5.0
dtype: float64
Replacing with the next value – backward fill

In backward fill, the missing value is imputed using the next value.


**Backward-Fill :**

test.fillna(method=‘bfill')
OUT:
0 0.0
1 1.0
2 5.0
3 5.0
4 5.0
5 5.0
dtype: float64

**Interpolation:**

Missing values can also be imputed using interpolation. Pandas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ The default method is ‘linear.’

IN:
test.interpolate()
OUT:
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
dtype: float64
How to Impute Missing Values for Categorical Features?
There are two ways to impute missing values for categorical features as follows:

**Impute the Most Frequent Value:**

We will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use mean or median, but we can use the most frequent value and constant.

IN:
import pandas as pd
import numpy as np
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]})
X
Shape
OUT:
0 square
1 square
2 oval
3 circle
4 NaN
IN:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)
OUT:
array([['square'],
       ['square'],
       ['oval'],
       ['circle'],
       ['square']], dtype=object)
As you can see, the missing value is imputed with the most frequent value, ’square.’

**Impute the Value “Missing”:**

We can impute the value “missing,” which treats it as a separate category.

IN:
imputer = SimpleImputer(strategy='constant', fill_value='missing')
imputer.fit_transform(X)
OUT:
array([['square'],
       ['square'],
       ['oval'],
       ['circle'],
       ['missing']], dtype=object)
In any of the above approaches, you will still need to OneHotEncode the data (or you can also use another encoder of your choice). After One Hot Encoding, in case 1, instead of the values ‘square,’ ‘oval,’ and’ circle,’ you will get three feature columns. And in case 2, you will get four feature columns (4th one for the ‘missing’ category). So it’s like adding the missing indicator column in the data. There is another way to add a missing indicator column, which we will discuss further.

**How to Impute Missing Values Using Sci-kit Learn Library?**

We can impute missing values using the sci-kit library by creating a model to predict the observed value of a variable based on another variable which is known as regression imputation.

**Univariate Approach:**

In a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median, or some constant value.

Let’s see an example:

IN:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
OUT: SimpleImputer()
IN:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))
OUT:
[[4.          2.        ]
 [6.          3.666...]
 [7.          6.        ]]
**Multivariate Approach:**

In a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.

Let’s take an example of a titanic dataset.

Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also younger and people with higher fares are also older. In that case, it would make sense to impute low age for low fare values and high age for high fare values. So here, we are taking multiple features into account by following a multivariate approach.

IN:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)
cols = ['SibSp', 'Fare', 'Age']
X = df[cols]
X
SibSp	Fare	Age
0	1	7.2500	22.0
1	1	71.2833	38.0
2	0	7.9250	26.0
3	1	53.1000	35.0
4	0	8.0500	35.0
5	0	8.4583	NaN
IN:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
impute_it = IterativeImputer()
impute_it.fit_transform(X)
OUT:
array([[ 1.        ,  7.25      , 22.        ],
       [ 1.        , 71.2833    , 38.        ],
       [ 0.        ,  7.925     , 26.        ],
       [ 1.        , 53.1       , 35.        ],
       [ 0.        ,  8.05      , 35.        ],
       [ 0.        ,  8.4583    , 28.50639495]])
Let’s see how IterativeImputer works. For all rows in which ‘Age’ is not missing, sci-kit learn runs a regression model. It uses ‘Sib sp’ and ‘Fare’ as the features and ‘Age’ as the target. And then, for all rows for which ‘Age’ is missing, it makes predictions for ‘Age’ by passing ‘Sib sp’ and ‘Fare’ to the training model. So it actually builds a regression model with two features and one target and then makes predictions on any places where there are missing values. And those predictions are the imputed values.

**Nearest Neighbors Imputations (KNNImputer):**

Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean distance is used to find the nearest neighbors. Let’s take the above example of the titanic dataset to see how it works.

IN:
from sklearn.impute import KNNImputer
impute_knn = KNNImputer(n_neighbors=2)
impute_knn.fit_transform(X)
OUT:
array([[ 1.    ,  7.25  , 22.    ],
       [ 1.    , 71.2833, 38.    ],
       [ 0.    ,  7.925 , 26.    ],
       [ 1.    , 53.1   , 35.    ],
       [ 0.    ,  8.05  , 35.    ],
       [ 0.    ,  8.4583, 30.5   ]])
In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing values. In this case, the last row has a missing value. And the third row and the fifth row have the closest values for the other two features. So the average of the ‘Age’ feature from these two rows is taken as the imputed value.

How to Use “Missingness” as a Feature?
In some cases, while imputing missing values, you can preserve information about which values were missing and use that as a feature. This is because sometimes, there may be a relationship between the reason for missing values (also called the “missingness”) and the target variable you are trying to predict. In such cases, you can add a missing indicator to encode the “missingness” as a feature in the imputed data set.

Where can we use this?

Suppose you are predicting the presence of a disease. Now, imagine a scenario where a missing age is a good predictor of the disease because we don’t have records for people in poverty. The age values are not missing at random. They are missing for people in poverty, and poverty is a good predictor of disease. Thus, missing age or “missingness” is a good predictor of disease.

IN:
import pandas as pd
import numpy as np
X = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})
X
Age
0	20.0
1	30.0
2	10.0
3	NaN
4	10.0
IN:
from sklearn.impute
import SimpleImputer

**impute the mean**

imputer = SimpleImputer()
imputer.fit_transform(X)
OUT:
array([[20. ],
       [30. ],
       [10. ],
       [17.5],
       [10. ]])

IN:
imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X)
OUT:
array([[20. ,  0. ],
       [30. ,  0. ],
       [10. ,  0. ],
       [17.5,  1. ],
       [10. ,  0. ]])
In the above example, the second column indicates whether the corresponding value in the first column was missing or not. ‘1’ indicates that the corresponding value was missing, and ‘0’ indicates that the corresponding value was not missing.

If you don’t want to impute missing values but only want to have the indicator matrix, then you can use the ‘MissingIndicator’ class from scikit learn.

**Conclusion:**

Missing data is a problem everyone faces while dealing with real-life data. It can impact the quality and accuracy of our results. Understanding the different types of missing data values and their potential impact on the analysis is crucial for researchers to select an appropriate method for handling the missing data. Each method has its advantages and disadvantages and is appropriate for different types of missing data values.



##Handling Duplicates in a data set:
                                            


**Duplicates in a structured dataset:**

Duplicates in a structured data set can cause issues with data quality, model performance, and interpretation of results. Here are some common methods to deal with duplicates:

**Drop Duplicates:**

 One straightforward approach is to drop all duplicate rows in the data set. This can be done using the "drop_duplicates" function in Python's pandas library or the "REMOVE DUPLICATES" command in SQL. However, this approach can lead to loss of valuable data if the duplicates contain unique information.

**Aggregate Duplicates:**

 Another approach is to aggregate the data by combining the duplicate rows into a single row. This can be done using functions such as "groupby" and "agg" in pandas or using SQL commands such as "GROUP BY" and "SELECT ... COUNT()". Aggregation can be performed by computing the mean, median, mode, or sum of the values in each column, depending on the data type and the research question.

**Identify and Resolve Duplicates:**

 Sometimes, duplicates may be the result of errors or inconsistencies in the data collection process. In this case, it may be necessary to identify the root cause of the duplicates and resolve them manually. This can involve data cleaning, data validation, or merging data from different sources.

**Record Linkage:**

 Record linkage is a process that involves identifying and linking records that refer to the same entity across different data sources. Record linkage techniques can be used to identify and remove duplicates in a structured data set by matching records based on common attributes such as name, address, or phone number.

**Data Fusion:**

 Data fusion is a process that involves combining data from multiple sources to create a more complete and accurate representation of the underlying phenomenon. Data fusion techniques can be used to merge duplicate records from different data sets, while also resolving any inconsistencies or errors in the data.


**Handling Duplicates In An Unstructured Dataset:**

Dealing with duplicates in an unstructured data set can be more challenging than in a structured data set because unstructured data typically lacks a consistent structure or format. Here are some common methods to deal with duplicates in an unstructured data set:

**Text Similarity:**

 One approach to identify duplicates in an unstructured data set is to compute the similarity between texts using natural language processing (NLP) techniques. Similarity measures such as cosine similarity, Jaccard similarity, or Levenshtein distance can be used to identify texts that have a high degree of similarity. Once the duplicates are identified, they can be removed or merged.

**Hashing:**

 Hashing is a technique that involves generating a unique digital fingerprint or hash for each text. Hashing can be used to identify texts that have the same hash value, indicating that they are duplicates. Hashing can be performed using algorithms such as MD5, SHA1, or SHA256.

**Clustering:**

 Clustering is a machine learning technique that involves grouping similar texts into clusters. Clustering algorithms such as k-means, hierarchical clustering, or density-based clustering can be used to group texts that are similar based on their content, structure, or metadata. Once the clusters are formed, the duplicates can be identified and removed or merged.

**Record Linkage:**

 Record linkage techniques can also be applied to unstructured data sets to identify and link records that refer to the same entity. Record linkage can be performed using probabilistic matching algorithms that take into account the similarity between texts, as well as other metadata such as date, time, or location.

Manual Review: In some cases, it may be necessary to manually review the unstructured data to identify and resolve duplicates. This can involve reading through the texts and identifying duplicates based on their content, structure, or metadata.





## Here’s All you Need to Know About Encoding Categorical Data :


**Overview:**

Understand what is Categorical Data Encoding
Learn different encoding techniques and when to use them
 

**Introduction:**

The performance of a machine learning model not only depends on the model and the hyperparameters but also on how we process and feed different types of variables to the model. Since most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.

**Categorical Data Encoding:**

A typical data scientist spends 70 – 80% of his time cleaning and preparing the data. And converting categorical data is an unavoidable activity. It not only elevates the model quality but also helps in better feature engineering. Now the question is, how do we proceed? 

**Which Categorical data encoding method should we use?**

In this article, I will be explaining various types of categorical data encoding methods with implementation in Python.

In case you want to learn concepts of data science in video format, check out our course- Introduction to Data Science


**Table of content:**

What is Categorical Data?

1.Label Encoding or Ordinal Encoding

2.One hot Encoding

3.Dummy Encoding

4.Effect Encoding

5.Binary Encoding

6.BaseN Encoding

7.Hash Encoding

8.Target Encoding
 


**What is categorical data?**

Since we are going to be working on categorical variables in this article, here is a quick refresher on the same with a couple of examples. Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number. 

**Here are a few examples:**

The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
The department a person works in: Finance, Human resources, IT, Production.
The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
The grades of a student:  A+, A, B+, B, B- etc.
In the above examples, the variables only have definite possible values.

 Further, we can see there are two kinds of categorical data-

**Ordinal Data**: The categories have an inherent order.
**Nominal Data**: The categories do not have an inherent order.

In Ordinal data, while encoding, one should retain the information regarding the order in which the category is provided. Like in the above example the highest degree a person possesses, gives vital information about his qualification. The degree is an important feature to decide whether a person is suitable for a post or not.

While encoding Nominal data, we have to consider the presence or absence of a feature. In such a case, no notion of order is present. For example, the city a person lives in. For the data, it is important to retain where a person lives. Here, We do not have any order or sequence. It is equal if a person lives in Delhi or Bangalore.

For encoding categorical data, we have a python package category_encoders. The following code helps you install easily.

pip install category_encoders
 


**Label Encoding or Ordinal Encoding:**

We use this categorical data encoding technique when the categorical feature is ordinal. In this case, retaining the order is important. Hence encoding should reflect the sequence.

In Label encoding, each label is converted into an integer value. We will create a variable that contains the categories representing the education qualification of a person.

Python Code:


Fit and transform train data

df_train_transformed = encoder.fit_transform(train_df)
categorical data encoding: Ordinal encoding

 

**One Hot Encoding:**

We use this categorical data encoding technique when the features are nominal(do not have any order). In one hot encoding, for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.

These newly created binary features are known as Dummy variables. The number of dummy variables depends on the levels present in the categorical variable. This might sound complicated. Let us take an example to understand this better. Suppose we have a dataset with a category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we have to one-hot encode this data.

categorical data encoding - One hot encoding

After encoding, in the second table, we have dummy variables each representing a category in the feature Animal. Now for each category that is present, we have 1 in the column of that category and 0 for the others. Let’s see how to implement a one-hot encoding in python.

 

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

**Create object for one-hot encoding**

encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_cat_names=True)

**Original Data**

data
 

Categorical Data Encoding : Data

**Fit and transform Data**

data_encoded = encoder.fit_transform(data)
data_encoded
Categorical Data Encoding : One-Hot Encoding

Now let’s move to another very interesting and widely used encoding technique i.e Dummy encoding.

 

**Dummy Encoding:**

Dummy coding scheme is similar to one-hot encoding. This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N binary variables. The dummy encoding is a small improvement over one-hot-encoding. Dummy encoding uses N-1 features to represent N labels/categories.

To understand this better let’s see the image below. Here we are coding the same data using both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to represent the data whereas dummy encoding uses 2 variables to code 3 categories.

Categorical data encoding - Dummy Code

 

Let us implement it in python.

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyderabad']})

**Original Data:**

data

**Encode the data:**
data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded


Here using drop_first  argument, we are representing the first label Bangalore using 0.

**Drawbacks of  One-Hot and Dummy Encoding:**

One hot encoder and dummy encoder are two powerful and effective encoding schemes. They are also very popular among the data scientists, But may not be as effective when-

A large number of levels are present in data. If there are multiple categories in a feature variable in such a case we need a similar number of dummy variables to encode the data. For example, a column with 30 different values will require 30 new variables for coding.
If we have multiple categorical features in the dataset similar situation will occur and again we will end to have several binary features each representing the categorical feature and their multiple categories e.g a dataset having 10 or more categorical columns.
In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e several columns having 0s and a few of them having 1s. In other words, it creates multiple dummy features in the dataset without adding much information.

Also, they might lead to a Dummy variable trap. It is a phenomenon where features are highly correlated. That means using the other variables, we can easily predict the value of a variable.

Due to the massive increase in the dataset, coding slows down the learning of the model along with deteriorating the overall performance that ultimately makes the model computationally expensive. Further, while using tree-based models these encodings are not an optimum choice.

 

**Effect Encoding:**

This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.

The row containing only 0s in dummy encoding is encoded as -1 in effect encoding.  In the dummy encoding example, the city Bangalore at index 4  was encoded as 0000. Whereas in effect encoding it is represented by -1-1-1-1.

Let us see how we implement it in python-

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyderabad']}) encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

**Original Data**

data
Categorical Data Encoding: Effect Encoding

 

encoder.fit_transform(data)


Effect encoding is an advanced technique. In case you are interested to know more about effect encoding, refer to this interesting paper.

 

**Hash Encoder:**

To understand Hash encoding it is necessary to know about hashing. Hashing is the transformation of arbitrary size input in the form of a fixed-size value. We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input. Further, hashing is a one-way process, in other words, one can not generate original input from the hash representation.

Hashing has several applications like data retrieval, checking data corruption, and in data encryption also. We have multiple hash functions available for example Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.

Just like one-hot encoding, the Hash encoder represents categorical features using the new dimensions. Here, the user can fix the number of dimensions after transformation using n_component argument. Here is what I mean – A feature with 5 categories can be represented using N new features similarly, a feature with 100 categories can also be transformed using N new features. Doesn’t this sound amazing?

By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any algorithm of his choice. If you want to explore the md5 algorithm, I suggest this paper.

import category_encoders as ce
import pandas as pd

**Create the dataframe**

data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

**Create object for hash encoder**

encoder=ce.HashingEncoder(cols='Month',n_components=6)
Cateforical Data Encoding : Hash Encoder Data

**Fit and Transform Data**

encoder.fit_transform(data)
Cateforical Data Encoding : Hash Encoder

 

Since Hashing transforms the data in lesser dimensions, it may lead to loss of information. Another issue faced by hashing encoder is the collision. Since here, a large number of features are depicted into lesser dimensions, hence multiple values can be represented by the same hash value, this is known as a collision.

Moreover, hashing encoders have been very successful in some Kaggle competitions. It is great to try if the dataset has high cardinality features.

 

**Binary Encoding:**

Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. After that binary value is split into different columns.

Binary encoding works really well when there are a high number of categories. For example the cities in a country where a company supplies its products.

**Import the libraries**

import category_encoders as ce
import pandas as pd

**Create the Dataframe**

data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

**Create object for binary encoding**

encoder= ce.BinaryEncoder(cols=['city'],return_df=True)

**Original Data**

data

**Fit and Transform Data**

data_encoded=encoder.fit_transform(data) 
data_encoded


Binary encoding is a memory-efficient encoding scheme as it uses fewer features than one-hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

 

**Base N Encoding:**

Before diving into BaseN encoding let’s first try to understand what is Base here?

In the numeral system, the Base or the radix is the number of digits or a combination of digits and letters used to represent the numbers. The most common base we use in our life is 10  or decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers. Another widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express all the numbers.

For Binary encoding, the Base is 2 which means it converts the numerical values of a category into its respective Binary form. If you want to change the Base of encoding scheme you may use Base N encoder. In the case when categories are more and binary encoding is not able to handle the dimensionality then we can use a larger base such as 4 or 8.

**Import the libraries**

import category_encoders as ce
import pandas as pd

**Create the dataframe**

data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

**Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['city'],return_df=True,base=5)

**Original Data**

data

**Fit and Transform Data**

data_encoded=encoder.fit_transform(data)
data_encoded


In the above example, I have used base 5 also known as the Quinary system. It is similar to the example of Binary encoding. While Binary encoding represents the same data by 4 new features the BaseN encoding uses only 3 new variables.

Hence BaseN encoding technique further reduces the number of features required to efficiently represent the data and improving memory usage. The default Base for Base N is 2 which is equivalent to Binary Encoding.

 

**Target Encoding:**

Target encoding is a Baysian encoding technique.

Bayesian encoders use information from dependent/target variables to encode the categorical data.

In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value. In the case of the categorical target variables, the posterior probability of the target replaces each category..

**import the libraries**

import pandas as pd
import category_encoders as ce

**Create the Datafram**

data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

**Create target encoding object**

encoder=ce.TargetEncoder(cols='class') 

**Original Data**

Data
categorical_data_encoding: data

**Fit and Transform Train Data**

encoder.fit_transform(data['class'],data['Marks'])
categorical_data_encoding: Target encoding

We perform Target encoding for train data only and code the test data using results obtained from the training dataset. Although, a very efficient coding system, it has the following issues responsible for deteriorating the model performance-

It can lead to target leakage or overfitting. To address overfitting we can use different techniques.
In the leave one out encoding, the current target value is reduced from the overall mean of the target to avoid leakage.
In another method, we may introduce some Gaussian noise in the target statistics. The value of this noise is hyperparameter to the model.
The second issue, we may face is the improper distribution of categories in train and test data. In such a case, the categories may assume extreme values. Therefore the target means for the category are mixed with the marginal mean of the target


##Handling Outliers in a Dataset:


**Introduction:**

One of the most important steps as part of data preprocessing is detecting and treating the outliers as they can negatively affect the statistical analysis and the training process of a machine learning algorithm resulting in lower accuracy.

**1. What are Outliers?**

We all have heard of the idiom ‘odd one out which means something unusual in comparison to the others in a group.

Similarly, an Outlier is an observation in a given dataset that lies far from the rest of the observations. That means an outlier is vastly larger or smaller than the remaining values in the set.

**2. Why do they occur?**

An outlier may occur due to the variability in the data, or due to experimental error/human error.

They may indicate an experimental error or heavy skewness in the data(heavy-tailed distribution).

**3. What do they affect?**

In statistics, we have three measures of central tendency namely Mean, Median, and Mode. They help us describe the data.

Mean is the accurate measure to describe the data when we do not have any outliers present.

Median is used if there is an outlier in the dataset.

Mode is used if there is an outlier AND about ½ or more of the data is the same.

‘Mean’ is the only measure of central tendency that is affected by the outliers which in turn impacts Standard deviation.

Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other values.

small dataset
computation with and without outlier (Image by author)
From the above calculations, we can clearly say the Mean is more affected than the Median.

**4. Detecting Outliers:**

If our dataset is small, we can detect the outlier by just looking at the dataset. But what if we have a huge dataset, how do we identify the outliers then? We need to use visualization and mathematical techniques.

Below are some of the techniques of detecting outliers

1.Boxplots
2.Z-score
3.Inter Quantile Range(IQR)

**4.1 Detecting outliers using Boxplot:**

Python code for boxplot is:


**4.2 Detecting outliers using the Z-scores
Criteria:**

any data point whose Z-score falls out of 3rd standard deviation is an outlier.

Detecting Outliers with Z-scores.

Steps:
loop through all the data points and compute the Z-score using the formula (Xi-mean)/std.
define a threshold value of 3 and mark the datapoints whose absolute value of Z-score is greater than the threshold as outliers.
import numpy as np
outliers = []
def detect_outliers_zscore(data):
    thres = 3
    mean = np.mean(data)
    std = np.std(data)
    # print(mean, std)
    for i in data:
        z_score = (i-mean)/std
        if (np.abs(z_score) > thres):
            outliers.append(i)
    return outliers# Driver code
sample_outliers = detect_outliers_zscore(sample)
print("Outliers from Z-scores method: ", sample_outliers)
The above code outputs: Outliers from Z-scores method: [101]

**4.3 Detecting outliers using the Inter Quantile Range(IQR):**

steps:
Sort the dataset in ascending order
calculate the 1st and 3rd quartiles(Q1, Q3)
compute IQR=Q3-Q1
compute lower bound = (Q1–1.5*IQR), upper bound = (Q3+1.5*IQR)
loop through the values of the dataset and check for those who fall below the lower bound and above the upper bound and mark them as outliers
Python Code:

outliers = []
def detect_outliers_iqr(data):
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    # print(q1, q3)
    IQR = q3-q1
    lwr_bound = q1-(1.5*IQR)
    upr_bound = q3+(1.5*IQR)
    # print(lwr_bound, upr_bound)
    for i in data: 
        if (i<lwr_bound or i>upr_bound):
            outliers.append(i)
    return outliers# Driver code
sample_outliers = detect_outliers_iqr(sample)
print("Outliers from IQR method: ", sample_outliers)
The above code outputs: Outliers from IQR method: [101]

**5. Handling Outliers:**

Till now we learned about detecting the outliers. The main question is how to deal with outliers?

Below are some of the methods of treating the outliers

1.Trimming/removing the outlier

2.Quantile based flooring and capping

3.Mean/Median imputation

**5.1 Trimming/Remove the outliers:**

In this technique, we remove the outliers from the dataset. Although it is not a good practice to follow.

Python code to delete the outlier and copy the rest of the elements to another array.

**Trimming**

for i in sample_outliers:
    a = np.delete(sample, np.where(sample==i))
print(a)

*print(len(sample), len(a))*

The outlier ‘101’ is deleted and the rest of the data points are copied to another array ‘a’.

**5.2 Quantile based flooring and capping:**

In this technique, the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value.

Python code:

**Computing 10th, 90th percentiles and replacing the outliers**

tenth_percentile = np.percentile(sample, 10)
ninetieth_percentile = np.percentile(sample, 90)

*print(tenth_percentile, ninetieth_percentile)*

b = np.where(sample<tenth_percentile, tenth_percentile, sample)

b = np.where(b>ninetieth_percentile, ninetieth_percentile, b)
print("Sample:", sample)

print("New array:",b)

The above code outputs: New array: [15, 20.7, 18, 7.2, 13, 16, 11, 20.7, 7.2, 15, 10, 9]

The data points that are lesser than the 10th percentile are replaced with the 10th percentile value and the data points that are greater than the 90th percentile are replaced with 90th percentile value.

**5.3 Mean/Median imputation:**

As the mean value is highly influenced by the outliers, it is advised to replace the outliers with the median value.

Python Code:

median = np.median(sample)# Replace with median
for i in sample_outliers:
    c = np.where(sample==i, 14, sample)
print("Sample: ", sample)
print("New array: ",c)

*print(x.dtype)*

**Visualizing the data after treating the outlier**

plt.boxplot(c, vert=False)
plt.title("Boxplot of the sample after treating the outliers")
plt.xlabel("Sample")


##Feature scaling Techniques:


**Introduction:**

In Data Processing, we try to change the data in such a way that the model can process it without any problems. And Feature Scaling is one such process in which we transform the data into a better version. Feature Scaling is done to normalize the features in the dataset into a finite range.

I will be discussing why this is required and what are the common feature scaling techniques used.

**Feature scaling techniques in python:**

1.Absolute Maximum Scaling

2.Min-Max Scaling

3.Normalization

4.Standardization

5.Robust Scaling


**Is Feature Scaling actually helpful?**

**Why Feature Scaling?**

Real Life Datasets have many features with a wide range of values like for example let’s consider the house price prediction dataset. It will have many features like no. of. bedrooms, square feet area of the house, etc.

As you can guess, the no. of bedrooms will vary between 1 and 5, but the square feet area will range from 500-2000. This is a huge difference in the range of both features.

Many machine learning algorithms that are using Euclidean distance as a metric to calculate the similarities will fail to give a reasonable recognition to the smaller feature, in this case, the number of bedrooms, which in the real case can turn out to be an actually important metric.

Eg: Linear Regression, Logistic Regression, KNN

There are several ways to do feature scaling. I will be discussing the top 5 of the most commonly used feature scaling techniques.

1.Absolute Maximum Scaling

2.Min-Max Scaling

3.Normalization

4.Standardization

5.Robust Scaling


**Absolute Maximum Scaling:**


Find the absolute maximum value of the feature in the dataset

Divide all the values in the column by that maximum value

If we do this for all the numerical columns, then all their values will lie between -1 and 1. The main disadvantage is that the technique is sensitive to outliers. Like consider the feature *square feet*, if 99% of the houses have square feet area of less than 1000, and even if just 1 house has a square feet area of 20,000, then all those other house values will be scaled down to less than 0.05.

I will be working with the sine and cosine functions throughout the article and show you how the scaling techniques affect their magnitude. sin() will be ranging between -1 and +1, and 50*cos() will be ranging between -50 and +50.


This is how they actually look, you will not even be able to see that the red one is a sine graph, it basically looks like a straight squiggly line when compared to the big blue graph.

y1_new = y1/max(y1)
y2_new = y2/max(y2)
Feature scaling techniques max scaling
See from the graph that now both the datasets are ranging from -1 to +1 after the scaling.

This might become significantly small with many data points below even 0.01 even if there is a single big outlier.

**Min Max Scaling:**

min-max you will subtract the minimum value in the dataset with all the values and then divide this by the range of the dataset(maximum-minimum). In this case, your dataset will lie between 0 and 1 in all cases whereas in the previous case, it was between -1 and +1. Again, this technique is also prone to outliers.

y1_new = (y1-min(y1))/(max(y1)-min(y1))
y2_new = (y2-min(y2))/(max(y2)-min(y2))
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e1bf8fd30>]

Feature scaling techniques min max scaled data
 

**Normalization:**

Instead of using the min() value in the previous case, in this case, we will be using the average() value.

In scaling, you are changing the range of your data while in normalization you arere changing the shape of the distribution of your data.

y1_new = (y1-np.mean(y1))/(max(y1)-min(y1))
y2_new = (y2-np.mean(y2))/(max(y2)-min(y2))
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e1bfb5518>]
Feature scaling techniques Normalization
 

**Standardization:**

In standardization, we calculate the z-value for each of the data points and replaces those with these values.

This will make sure that all the features are centred around the mean value with a standard deviation value of 1. This is the best to use if your feature is normally distributed like salary or age.

y1_new = (y1-np.mean(y1))/np.std(y1)
y2_new = (y2-np.mean(y2))/np.std(y2)
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e25e66e10>]

 

**Robust Scaling:**

In this method, you need to subtract all the data points with the median value and then divide it by the Inter Quartile Range(IQR) value.

Robust scaling
IQR is the distance between the 25th percentile point and the 50th percentile point.

This method centres the median value at zero and this method is robust to outliers.

from scipy import stats 
IQR1 = stats.iqr(y1, interpolation = 'midpoint') 
y1_new = (y1-np.median(y1))/IQR1
IQR2 = stats.iqr(y2, interpolation = 'midpoint') 
y2_new = (y2-np.median(y2))/IQR2
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e25e19080>]
Feature scaling techniques Robust scaling
 

**Is Feature Scaling actually helpful?**

Let’s look at an example of a College Admission dataset, in which your goal is to predict the chance of admission for each student based on the other features given.

You can download the dataset from the link below.

https://www.kaggle.com/mohansacharya/graduate-admissions

import pandas as pd
df = pd.read_csv("Admission_Predict.csv")
df.head()
Admission Dataset
The dataset has a wide variety of features with different ranges. The first column Serial No. is not important, so I am going to be deleting it. Then I am splitting the dataset into training and test dataset.

df.drop("Serial No.",axis=1,inplace=True)
y = df['Chance of Admit ']
df.drop("Chance of Admit ",axis=1,inplace=True)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df,y,test_size=0.2)
I am going to be building a linear regression model, first without normalization, and next with normalization, let’s check whether there is any improvement in the accuracy.

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
from sklearn import metrics
rmse = np.sqrt(metrics.mean_squared_error(y_test,pred))
rmse
0.06845052747026953
See that without normalization the root mean squared error value comes out to be 0.0684, as most of the values in the `y` are less than 0.5.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(df)
df = sc.transform(df)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df,y,test_size=0.2)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
from sklearn import metrics
rmse = np.sqrt(metrics.mean_squared_error(y_test,pred))
rmse
0.05674870151306346
See that, we are able to get a significant reduction in the error when we used the standardization technique.



##Feature Selection And Extraction Techniques :


###**Feature Selection :**

Feature selection involves selecting a subset of features from the original dataset that are most relevant to the prediction task. The goal of feature selection is to improve the performance of the machine learning model by reducing the dimensionality of the dataset, removing irrelevant or redundant features, and improving interpretability. Feature selection can be performed using various statistical and machine learning techniques, such as mutual information, correlation, or feature importance scores from a machine learning model.

**There are several techniques used for feature selection:**

**Filter methods:**

 Filter methods evaluate the relevance of features based on statistical measures, such as correlation or mutual information. They select features independently of the machine learning model and are often computationally efficient.

**Wrapper methods:**

 Wrapper methods evaluate the performance of the machine learning model with different subsets of features. They are computationally expensive but can provide better results than filter methods.

**Embedded methods:**

 Embedded methods perform feature selection as part of the machine learning model training process. These methods are often used with models that have built-in feature selection, such as Lasso or Ridge Regression.

**Some common techniques used in feature selection include:**

Correlation-based feature selection:

 Correlation-based feature selection measures the correlation between each feature and the target variable. Features with a high correlation are selected.

Mutual information-based feature selection: 

Mutual information-based feature selection measures the dependence between each feature and the target variable. Features with a high mutual information score are selected.

Recursive feature elimination: 

Recursive feature 
elimination is a wrapper method that recursively removes features and evaluates the performance of the machine learning model on the reduced feature set.

Lasso regression: 

Lasso regression is an embedded method that performs feature selection by imposing a penalty on the absolute value of the coefficients. Features with small coefficients are set to zero and are removed from the model.

###**Feature Extraction :**


**What is Feature Extraction?**

Feature extraction is a part of the dimensionality reduction process, in which, an initial set of the raw data is divided and reduced to more manageable groups. So when you want to process it will be easier. The most important characteristic of these large data sets is that they have a large number of variables. These variables require a lot of computing resources to process. So Feature extraction helps to get the best feature from those big data sets by selecting and combining variables into features, thus, effectively reducing the amount of data. These features are easy to process, but still able to describe the actual data set with accuracy and originality.


**Why Feature Extraction is Useful?**

The technique of extracting the features is useful when you have a large data set and need to reduce the number of resources without losing any important or relevant information. Feature extraction helps to reduce the amount of redundant data from the data set.

In the end, the reduction of the data helps to build the model with less machine effort and also increases the speed of learning and generalization steps in the machine learning process.


**There are several techniques used for feature extraction:**

**Principal Component Analysis (PCA):**

 PCA is a linear transformation technique that identifies the directions in which the data varies the most and projects the data onto those directions. The new features are orthogonal and uncorrelated, and the first few principal components capture the most important information in the original dataset.

**Linear Discriminant Analysis (LDA):**

 LDA is a supervised technique that maximizes the separation between the classes in the dataset. LDA identifies a set of features that maximize the ratio of between-class variance to within-class variance.

**Non-negative matrix factorization (NMF):**

 NMF is a technique that factorizes a non-negative matrix into two non-negative matrices. NMF is often used for image and text data.

**Autoencoder neural networks:**

 Autoencoder neural networks are neural networks that are trained to reconstruct the input data. The encoder part of the network learns a compressed representation of the input data, and the decoder part of the network reconstructs the original data.


  











