# Imputing missing values

'No one is perfect, well then I am no one !'. Imagine our dataset is telling this to us. This simply means, when we get a dataset, it is never clean. Not all columns will have all entries present in it. There will be few rows which will be having null values or missing values. Such dataset is called as raw dataset. We cant apply machine learning algorithms to such datasets. So we need to clean it and make it appropriate for our analysis. The process of doing this called as **Imputing Missing values**.

There are various ways of Imputing missing values. Lets discuss one by one.

### Imperfect dataset 

Lets import a dataset which has missing values in few of the columns. It is a dataset called as 'Big Mart Dataset'. The dataset has 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined.

In [29]:
#importing necessary libraries
import pandas as pd

#Loading the dataset into pandas dataframe
df = pd.read_csv('BigmartData.csv')

#check the head and info of the dataframe
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier              8523 non-null object
Item_Weight                  7060 non-null float64
Item_Fat_Content             8523 non-null object
Item_Visibility              8523 non-null float64
Item_Type                    8523 non-null object
Item_MRP                     8523 non-null float64
Outlet_Identifier            8523 non-null object
Outlet_Establishment_Year    8523 non-null int64
Outlet_Size                  6113 non-null object
Outlet_Location_Type         8523 non-null object
Outlet_Type                  8523 non-null object
Item_Outlet_Sales            8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.1+ KB


There are total 8523 entries but few columns have missing values. Such imperfect dataset can not be given to Machine learning model or to visualization packages. So we need to impute Missing values.

In [31]:
#check which all columns have null or missing values
df.apply(lambda x: sum(x.isnull()))

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

There are two columns namely 'Item_Weight' and 'Outlet_Size' which we need to impute. Former one is of type Float nad later one is of type object( i.e. Categorical).

### Replacing with Mean

One of the very common methods is finding Mean of a column which has missing values. And then replacing missing entries with the mean. This method can be applied to only numerical columns. Mean is simply average value of the column. In our case Item_Weight is numerical column so we will impute missing values in that perticular column with the its mean.

In [32]:
df['Item_Weight'].fillna(df['Item_Weight'].mean(),inplace=True)

Here we are using fillna() method of our dataset by passing first argument as the value which will be used to replace missing entries. If we have outliers present in the dataset then we need to use median() instead of mean() as a replacing value.

### Replacing with Mode

If we have missing values in categorical column then we use mode() to replace it. Mode is nothing but the most frequent value which is present in the column. Here Outlet_Size is of type object. Lets check its unique values first.

In [33]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

Here mode is 'Medium' as it is occuring for highest number of times. Lets replace missing values in this column with mode().

In [34]:
df['Outlet_Size'].fillna(df['Outlet_Size'].mode()[0],inplace=True)

Here we are taking first value by writing an index as [0] after we call mode() method. This is because we can have more than one mode and we need to select one value out of those.

Now lets check our dataset again for missing or null values. 

In [35]:
df.apply(lambda x: sum(x.isnull()))

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Our dataset is now ready for visual analysis as well as for training against machine learning algorithm with not a single column with missing entry.