# Machine Learning Models: One Hot Encoding

## One Hot Encoding is a method for converting categorical variables into a binary format. It creates new columns for each category where 1 means the category is present and 0 means it is not. The primary purpose of One Hot Encoding is to ensure that categorical data can be effectively used in machine learning models.

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.datasets import make_blobs

## We will be investigating text data using One Hot Encoding!

In [29]:
## First lets get our dataset using the adult.data file, it is a csv file 
## There is also another file called adult.names which gives more information about the adult.data file and what the columns represent

## Basically what is described in the adult.names is the data is to be used to create a machine learning model which can determine if a record makes above or below 50k per year
## The file include whether they do or do-not make 50k, so the correct answers are availiable but this would normally be used as dummy data that would make the determination 

## For what we need there is categorical data present which is what we will be using with the OneHotEncoding to prep the data for a model

X = pd.read_csv('adult.data')

In [19]:
## Lets check the shape of the data! 15 columns and 32000 datapoints

X.shape

(32560, 15)

In [23]:
## Now notice how there is no header function in this data. It is the first record being used as the column headers, this is bad. We'll need to fix that first!

X.head()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [31]:
## So what we'll do first is reinitialize the csv with no header to ensure the first record is not the column headers

X = pd.read_csv('adult.data',header=None)

In [35]:
## Now notice we have just numbers for the columns or a column index

X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [54]:
## Now lets rename the columns

X = X.rename(columns={0:'age',1:'workclass',2:'fnlwgt',3:'education',4:'education-num',5:'marital-status',6:'occupation',7:'relationship',8:'race',9:'sex',10:'capital-gain',11:'capital-loss',12:'hours-per-week',13:'native-country',14:'salary',})

In [59]:
## Great! Now the column names are correct!

X

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [67]:
## Now lets dive into setting up the OneHotEncoding
## First create some binary columns for OneHotEncoding 
## We will create a new X2 dataframe to not ruin our original dataframe.

## We will choose a categorical column or string columns and based on all the availiable strings in that column there will be a new binary column 
## where True represents that columns contained the same data from the orininal and False for it did not 

## We will use the 'workclass' column as this is a categorical column determining what sector someone works in

X2 = pd.get_dummies(X,columns=['workclass'])

In [71]:
## Now notice at the end of the dataframe there are new binary columns 
## Each of those columns represents the sector that was given for that particular column
## Also notice that the original column for 'workclass' is gone

X2.head()

Unnamed: 0,age,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,...,salary,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay
0,39,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,...,<=50K,False,False,False,False,False,False,False,True,False
1,50,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,...,<=50K,False,False,False,False,False,False,True,False,False
2,38,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,...,<=50K,False,False,False,False,True,False,False,False,False
3,53,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,...,<=50K,False,False,False,False,True,False,False,False,False
4,28,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,...,<=50K,False,False,False,False,True,False,False,False,False


In [75]:
## Now that we understand what the get_dummies is capable of lets do the same for more columns 
## The columns are are intrested in are: workclass, education, marital-status, occupation, relationship, race, sex, native-country

X2 = pd.get_dummies(X,columns=['workclass','education','marital-status','occupation','relationship','race','sex','native-country'])

In [79]:
## We now have 109 columns!! Wow! Thats a lot!

X2.shape

(32561, 109)

In [81]:
X2.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,salary,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,<=50K,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,50,83311,13,0,0,13,<=50K,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,38,215646,9,0,0,40,<=50K,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,53,234721,7,0,0,40,<=50K,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,28,338409,13,0,0,40,<=50K,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [83]:
## Now remember the initially we wanted to determine if someone makes above or below 50k per year salary
## And remember we have the answer to that information with the 'salary' column. That column represents our y 
## lets seperate that y out from our X2

y = X2.salary

In [87]:
## Great! We now have y set to the salary

y

0         <=50K
1         <=50K
2         <=50K
3         <=50K
4         <=50K
          ...  
32556     <=50K
32557      >50K
32558     <=50K
32559     <=50K
32560      >50K
Name: salary, Length: 32561, dtype: object

In [89]:
## Lets see the unique values in y

y.unique()

array([' <=50K', ' >50K'], dtype=object)

In [91]:
## Ok now that we can see this only contains the strings of ' <=50K', ' >50K' 
## Lets change this to not be a string and instad a boolean value of true or false and then set that in our original X2 column
## First lets thet get the unique values and set them into a vals variable

vals = y.unique()
vals

array([' <=50K', ' >50K'], dtype=object)

In [93]:
## Great now lets get an index variable by comparing are vals variable against the salary column in X2

idx = X2.loc[:,'salary']==vals[0] ## we use the loc parameter to get the full salary column and then just take one of the variables in vals to compare and get a list of boolean for idx
idx

0         True
1         True
2         True
3         True
4         True
         ...  
32556     True
32557    False
32558     True
32559     True
32560    False
Name: salary, Length: 32561, dtype: bool

In [95]:
## Now lets change the values in X2 to be 0's or 1's based on idx

X2.loc[idx,:] = 0
X2.loc[~idx,:] = 1 ## the ~ means the compliment

  X2.loc[idx,:] = 0


In [99]:
## Great! Now we can see our salary column in X2 is a boolean column

X2.salary

0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: salary, Length: 32561, dtype: object

## Now that we have prepared all of the categorical data in the dataframe we can save the data for further use as 'abs.csv'

In [102]:
X2.to_csv('abc.csv')