# Imputer Function :-
You use an Imputer to handle missing data in your dataset. 
Imputer gives you easy methods to replace NaNs and blanks with something like the mean of the column or even median.
But before it can replace these values, it has to calculate the value that will be used to replace blanks.

### What does fit() and fit_transform() means ?

The fit() function calculates the values of these parameters. The transform function applies the values of the parameters on the actual data and gives the normalized value. The fit_transform() function performs both in the same step. Note that the same value is got whether we perform in 2 steps or in a single step.

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

In [2]:
df = pd.read_csv('Data.csv')

In [3]:
df

Unnamed: 0,Country,Age,Salary,Gender,Employement Status,Occupation,Purchased
0,France,44,72000,M,Y,Salaried,No
1,Spain,27,48000,F,N,Salaried,Yes
2,Germany,30,54000,F,N,Business,No
3,Spain,38,61000,F,Y,Salaried,No
4,Germany,40,63000,M,Y,Salaried,Yes
5,France,35,58000,M,Y,Business,Yes
6,Spain,34,52000,M,N,Business,No
7,France,48,79000,M,N,Salaried,Yes
8,Germany,50,83000,F,Y,Business,No
9,France,37,67000,F,N,Business,Yes


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Country             10 non-null     object
 1   Age                 10 non-null     int64 
 2   Salary              10 non-null     int64 
 3   Gender              10 non-null     object
 4   Employement Status  10 non-null     object
 5   Occupation          10 non-null     object
 6   Purchased           10 non-null     object
dtypes: int64(2), object(5)
memory usage: 688.0+ bytes


In [5]:
# Country Age Salary Gender Employement Status Ocuupation are independent variables
features = df.iloc[:,:-1].values

In [6]:
features

array([['France', 44, 72000, 'M', 'Y', 'Salaried'],
       ['Spain', 27, 48000, 'F', 'N', 'Salaried'],
       ['Germany', 30, 54000, 'F', 'N', 'Business'],
       ['Spain', 38, 61000, 'F', 'Y', 'Salaried'],
       ['Germany', 40, 63000, 'M', 'Y', 'Salaried'],
       ['France', 35, 58000, 'M', 'Y', 'Business'],
       ['Spain', 34, 52000, 'M', 'N', 'Business'],
       ['France', 48, 79000, 'M', 'N', 'Salaried'],
       ['Germany', 50, 83000, 'F', 'Y', 'Business'],
       ['France', 37, 67000, 'F', 'N', 'Business']], dtype=object)

In [7]:
# Purchased is a dependent variable
labels = df.iloc[:,-1].values

In [8]:
labels

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [9]:
labels.shape

(10,)

In [10]:
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')

In [11]:
# fit and transform
imputer.fit(features[:,[1,2]])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [12]:
# filling out the null values in Age and Salary i.e., Numerical Data Columns
features[:,[1,2]] = imputer.fit_transform(features[:,[1,2]])

In [13]:
features

array([['France', 44.0, 72000.0, 'M', 'Y', 'Salaried'],
       ['Spain', 27.0, 48000.0, 'F', 'N', 'Salaried'],
       ['Germany', 30.0, 54000.0, 'F', 'N', 'Business'],
       ['Spain', 38.0, 61000.0, 'F', 'Y', 'Salaried'],
       ['Germany', 40.0, 63000.0, 'M', 'Y', 'Salaried'],
       ['France', 35.0, 58000.0, 'M', 'Y', 'Business'],
       ['Spain', 34.0, 52000.0, 'M', 'N', 'Business'],
       ['France', 48.0, 79000.0, 'M', 'N', 'Salaried'],
       ['Germany', 50.0, 83000.0, 'F', 'Y', 'Business'],
       ['France', 37.0, 67000.0, 'F', 'N', 'Business']], dtype=object)

In [14]:
df1 = pd.DataFrame(features)

In [15]:
df1

Unnamed: 0,0,1,2,3,4,5
0,France,44,72000,M,Y,Salaried
1,Spain,27,48000,F,N,Salaried
2,Germany,30,54000,F,N,Business
3,Spain,38,61000,F,Y,Salaried
4,Germany,40,63000,M,Y,Salaried
5,France,35,58000,M,Y,Business
6,Spain,34,52000,M,N,Business
7,France,48,79000,M,N,Salaried
8,Germany,50,83000,F,Y,Business
9,France,37,67000,F,N,Business


In [16]:
# Dealing with String values
cols = ['Employement Status','Occupation']

In [17]:
df[cols] = df[cols].fillna(df.mode().iloc[0])

In [18]:
df

Unnamed: 0,Country,Age,Salary,Gender,Employement Status,Occupation,Purchased
0,France,44,72000,M,Y,Salaried,No
1,Spain,27,48000,F,N,Salaried,Yes
2,Germany,30,54000,F,N,Business,No
3,Spain,38,61000,F,Y,Salaried,No
4,Germany,40,63000,M,Y,Salaried,Yes
5,France,35,58000,M,Y,Business,Yes
6,Spain,34,52000,M,N,Business,No
7,France,48,79000,M,N,Salaried,Yes
8,Germany,50,83000,F,Y,Business,No
9,France,37,67000,F,N,Business,Yes


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Country             10 non-null     object
 1   Age                 10 non-null     int64 
 2   Salary              10 non-null     int64 
 3   Gender              10 non-null     object
 4   Employement Status  10 non-null     object
 5   Occupation          10 non-null     object
 6   Purchased           10 non-null     object
dtypes: int64(2), object(5)
memory usage: 688.0+ bytes
