### One Hot Encoding

sklearn also provides a function to perform a one-hot encoding of the categorical variable. Let us use 'OneHotEncoder' from skelarn to encode the variable 'sex'.

In [1]:
import pandas as pd
df_suicide = pd.read_csv('suicide_data_1.csv')

# import the OneHotEncoder
from sklearn.preprocessing import OneHotEncoder as ohe

# instantiate the encoder
enc = ohe(handle_unknown='ignore')

# fit the encoder on 'sex'
encoded_var = enc.fit_transform(df_suicide['sex'].values.reshape(-1,1)).toarray()
 
# 'encoded_var' returns an array of encoded variables
encoded_var 

# create a dataframe of encoded columns
df_encoded = pd.DataFrame(encoded_var, columns = ["sex_" + str(int(i)) for i in range(encoded_var.shape[1])])
df_encoded.head()

Unnamed: 0,sex_0,sex_1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0


In [2]:
df_suicide_copy = pd.concat([df_suicide, df_encoded],axis=1)
df_suicide_copy.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation,sex_0,sex_1
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X,0.0,1.0
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent,0.0,1.0
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X,1.0,0.0
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation,0.0,1.0
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers,0.0,1.0


In [3]:
df_suicide_copy = df_suicide_copy.copy().drop(columns=['sex']) 
df_suicide_copy.head()

Unnamed: 0,country,year,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation,sex_0,sex_1
0,Albania,1987,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X,0.0,1.0
1,Albania,1987,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent,0.0,1.0
2,Albania,1987,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X,1.0,0.0
3,Albania,1987,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation,0.0,1.0
4,Albania,1987,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers,0.0,1.0


### Label Encoding

This technique labels each of the categories of the variable with values between 0 and (n-1), where 'n' is the number of distinct categories in the variable. If the category is repeating in the data, then the same label gets assigned.

Use 'LabelEncoder' from sklearn to encode the variable 'generation'

In [4]:
# check the categories in 'generation'
df_suicide['generation'].astype('category').dtypes

CategoricalDtype(categories=['Boomers', 'G.I. Generation', 'Generation X', 'Generation Z',
                  'Millenials', 'Silent'],
                 ordered=False)

In [5]:
# import the LabelEncoder
from sklearn.preprocessing import LabelEncoder as LE

# instantiate the encoder
le = LE()

# fit the encoder on 'generation' 
df_suicide['generation'] = le.fit_transform(df_suicide['generation'])

# display first 5 observations
df_suicide['generation'].head(5)

0    2
1    5
2    2
3    1
4    0
Name: generation, dtype: int32

LabelEncoder has encoded the six generations. This method is not always useful, as it creates the order in the label which is not present in the original variable. This method assigns the order to the categories in an alphabetical manner.