# Q25 : 29/Apr/2022
John is given with the insurance dataset which is shared below (GitHub link reference is also provided). His manager has requested him to perform encoding methods for categorical features, so that they can prepare the data for modeling purposes. How do you suggest John to perform categorical feature encoding in python?

a) Perform encoding using python in a ipynb file.
b) Explain which method is used for encoding and why? 

In [1]:
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [5]:
data_path = "./"
df = pd.read_csv(data_path + "insurance.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             1338 non-null   int64  
 1   gender          1338 non-null   object 
 2   bmi             1338 non-null   float64
 3   children        1338 non-null   int64  
 4   smoker          1338 non-null   object 
 5   geography       1338 non-null   object 
 6   charges         1338 non-null   float64
 7   EducationLevel  1338 non-null   object 
dtypes: float64(2), int64(2), object(4)
memory usage: 83.8+ KB


Unnamed: 0,age,gender,bmi,children,smoker,geography,charges,EducationLevel
0,54,female,47.41,0,yes,East,63770.42801,PhD
1,45,male,30.36,0,yes,East,62592.87309,PhD
2,52,male,34.485,3,yes,West,60021.39897,Master
3,31,female,38.095,1,yes,North,58571.07448,PhD
4,33,female,35.53,0,yes,West,55135.40209,PhD


## One hot encoding
For categorical variables, ordering is not possible, hence use one hot encoding for these

In [6]:
# one hot encoding
one_hot_feture_name = ['gender', 'smoker', 'geography']
df1 = pd.get_dummies(df, columns = one_hot_feture_name)

In [7]:
df1.head()

Unnamed: 0,age,bmi,children,charges,EducationLevel,gender_female,gender_male,smoker_no,smoker_yes,geography_East,geography_North,geography_South,geography_West
0,54,47.41,0,63770.42801,PhD,1,0,0,1,1,0,0,0
1,45,30.36,0,62592.87309,PhD,0,1,0,1,1,0,0,0
2,52,34.485,3,60021.39897,Master,0,1,0,1,0,0,0,1
3,31,38.095,1,58571.07448,PhD,1,0,0,1,0,1,0,0
4,33,35.53,0,55135.40209,PhD,1,0,0,1,0,0,0,1


## Encoding education as level

In [10]:
df1['EducationLevel'].value_counts()

Bachelor      662
HighSchool    380
Master        222
PhD            74
Name: EducationLevel, dtype: int64

In [13]:
dict_label= {'HighSchool': 0,
             'Bachelor'  : 1,
             'Master'    : 2,
             'PhD'       :3
            }
df1['EducationLevel_encoded'] = df1['EducationLevel'].map(dict_label)
df1.head()

Unnamed: 0,age,bmi,children,charges,EducationLevel,gender_female,gender_male,smoker_no,smoker_yes,geography_East,geography_North,geography_South,geography_West,EducationLevel_encoded
0,54,47.41,0,63770.42801,PhD,1,0,0,1,1,0,0,0,3
1,45,30.36,0,62592.87309,PhD,0,1,0,1,1,0,0,0,3
2,52,34.485,3,60021.39897,Master,0,1,0,1,0,0,0,1,2
3,31,38.095,1,58571.07448,PhD,1,0,0,1,0,1,0,0,3
4,33,35.53,0,55135.40209,PhD,1,0,0,1,0,0,0,1,3


## Conclusions:
1. one hot encoding : Used for three variables 'gender', 'smoker', 'geography', since here numerical order does not makes sense.
2. Level encoding   : Used for Education, since once can say that higher education can mean levels or orders. 