<h2>Encoding</h2>
<h4>We use three forms of encoding, or categorical data transformations, to convert our 
categorical data into appropriate numbers for machine learning.  One form is for
converting labels, another is for nominal data, and the third is for ordinal data.
Nominal data is categorical values that have no natural order.  Examples are colors,
brands of products (unless we are ranking them by sales, quality, etc.!), and the like.
Ordinal data is that which is naturally ranked, such as age, sizes, measurements, etc.</h4>

In [3]:
# Load the packages we will need.

print("One Hot Encoding example with the Category Samples Dataset.\nLoading packages...", end = '')

import numpy as np
import pandas as pd 
import os

print("Finished.")

One Hot Encoding example with the Category Samples Dataset.
Loading packages...Finished.


In [7]:
# Show all .csv files (data files) in the Data directory

print('.csv image files in the /Data directory:\n')
for file in os.listdir("C:/Users/weyen/OneDrive/Desktop/deep"):
    if file.endswith(".csv"):
        print(os.path.join(file))


.csv image files in the /Data directory:

categorySamples.csv
fashion-mnist_test.csv
fashion-mnist_train.csv
nyc-dropped-columns.csv
nyc_collision_factors.csv


In [8]:
# Load the category samples data.  The data is in the categorySamples.csv file, which should be in 
# the Data directory (folder).  The data is loaded into a Python dataframe we will call df.  

print("Load the category samples dataset...")

df=pd.read_csv('categorySamples.csv')

print("Finished.")

Load the category samples dataset...
Finished.


In [9]:
# Show some statistics of the df data frame.

# dataframe.size
size = df.size
  
# dataframe.shape
shape = df.shape

# printing size and shape
print("Size = {}\nShape = {}".
format(size, shape))

Size = 400
Shape = (100, 4)


In [10]:
#  Print the first few rows of the df data frame.

df.head()

Unnamed: 0,colors,cars,shirts,weight
0,Blue,Buick,med,167
1,Red,Toyota,med,182
2,White,Chevrolet,small,138
3,Green,Nissan,large,189
4,Green,Ford,x-large,166


<h4>Let's first look at the ordinal data; that is, the categorical data that is naturally ranked.
In this case, it is the shirts column as it lists various shirt sizes.  We cannot process with 
the sizes as described so we must transform them to numbers as well as rank them in order.</h4>

In [11]:
#  To start, let's get the unique values in a column.

dclass = df['shirts']
dclass = dclass.unique()
print(dclass)

['med' 'small' 'large' 'x-large' 'x-small']


In [12]:
# We must change these to numeric values.  Thus, we can change x-small and small to 0 and 1, etc., 
# and achieve what we need to train our model.  We can do that, in this case, by listing in proper 
# order the unique "sizes" from above and pairing them with appropriate integers (normally 
# beginning with 0) by running the following block of Python code.

df.shirts.replace(['x-small', 'small', 'med', 'large', 'x-large'], [0, 1, 2, 3, 4], inplace=True)

# And let's see what it gives us.

df.head(len(dclass)*2)

Unnamed: 0,colors,cars,shirts,weight
0,Blue,Buick,2,167
1,Red,Toyota,2,182
2,White,Chevrolet,1,138
3,Green,Nissan,3,189
4,Green,Ford,4,166
5,Red,Land Rover,1,143
6,Yellow,Toyota,2,130
7,Blue,Buick,0,118
8,White,Nissan,3,121
9,Blue,Cadillac,2,201


In [13]:
# Now that we've one hot-encoded our ordinal data, let's save the file to a new name so we 
# preserve the original data while giving us a new file with which to work.

print("Saving the category samples dataset...")

df.to_csv('categorySamples_1.csv', index = False)

print("Finished.")

Saving the category samples dataset...
Finished.


<h4>Now we turn our attention to the nominal categorical data.  Nominal data is that which 
has no natural order.  Examples are names, colors, etc.  One hot-encoding them is a bit 
of a different process.</h4>

In [15]:
# Load the new category samples data.  The data is in the categorySamples_1.csv file, which 
# should have been saved in the Data directory (folder).  Let's now load this new data set
# Python dataframe we will again call df.  
# Print the data.

print("Load the category samples dataset...")

df=pd.read_csv('categorySamples_1.csv')

print("Finished.")

Load the category samples dataset...
Finished.


In [16]:
#  To start, let's get the unique values in a column.  Let's look at the colors column.

dclass = df['colors']
dclass = dclass.unique()
print(dclass)

['Blue' 'Red' 'White' 'Green' 'Yellow' 'Orange' 'Black']


In [17]:
# One Hot Encode the colors.  Note that we load the OHE columns into a new
# data frame we're calling ohe.  Print the ohe data frame.

ohe = pd.get_dummies(df['colors'])
ohe.head(len(dclass)*2)

Unnamed: 0,Black,Blue,Green,Orange,Red,White,Yellow
0,0,1,0,0,0,0,0
1,0,0,0,0,1,0,0
2,0,0,0,0,0,1,0
3,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0
6,0,0,0,0,0,0,1
7,0,1,0,0,0,0,0
8,0,0,0,0,0,1,0
9,0,1,0,0,0,0,0


In [18]:
# let's get the unique values in the cars column now and determine how many 
# different cars are in our data set.

dclass = df['cars']
dclass = dclass.unique()
print(dclass)
print(len(dclass))

['Buick' 'Toyota' 'Chevrolet' 'Nissan' 'Ford' 'Land Rover' 'Cadillac' nan]
8


In [19]:
# One Hot Encode the cars column and preserve the original column title, cars.

ohe = pd.get_dummies(df['cars'], prefix = "cars")

ohe.head(len(dclass)*2)

Unnamed: 0,cars_Buick,cars_Cadillac,cars_Chevrolet,cars_Ford,cars_Land Rover,cars_Nissan,cars_Toyota
0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0
4,0,0,0,1,0,0,0
5,0,0,0,0,1,0,0
6,0,0,0,0,0,0,1
7,1,0,0,0,0,0,0
8,0,0,0,0,0,1,0
9,0,1,0,0,0,0,0


In [20]:
# One Hot Encode multiple columns and preserve the original colors and cars column names.

ohe = pd.get_dummies(data = df, columns = ['colors', 'cars'])
ohe.head(len(dclass)*2)

Unnamed: 0,shirts,weight,colors_Black,colors_Blue,colors_Green,colors_Orange,colors_Red,colors_White,colors_Yellow,cars_Buick,cars_Cadillac,cars_Chevrolet,cars_Ford,cars_Land Rover,cars_Nissan,cars_Toyota
0,2,167,0,1,0,0,0,0,0,1,0,0,0,0,0,0
1,2,182,0,0,0,0,1,0,0,0,0,0,0,0,0,1
2,1,138,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,3,189,0,0,1,0,0,0,0,0,0,0,0,0,1,0
4,4,166,0,0,1,0,0,0,0,0,0,0,1,0,0,0
5,1,143,0,0,0,0,1,0,0,0,0,0,0,1,0,0
6,2,130,0,0,0,0,0,0,1,0,0,0,0,0,0,1
7,0,118,0,1,0,0,0,0,0,1,0,0,0,0,0,0
8,3,121,0,0,0,0,0,1,0,0,0,0,0,0,1,0
9,2,201,0,1,0,0,0,0,0,0,1,0,0,0,0,0


In [21]:
# One Hot Encode all columns and preserve the original colors and cars column names.
# Note that the column of ordinal data, weight, is ignored.  However, when we OHE'd all
# the columns we inadvertently included our ordinal data, shirts.

ohe = pd.get_dummies(data = df)

ohe.head(len(dclass)*2)

Unnamed: 0,shirts,weight,colors_Black,colors_Blue,colors_Green,colors_Orange,colors_Red,colors_White,colors_Yellow,cars_Buick,cars_Cadillac,cars_Chevrolet,cars_Ford,cars_Land Rover,cars_Nissan,cars_Toyota
0,2,167,0,1,0,0,0,0,0,1,0,0,0,0,0,0
1,2,182,0,0,0,0,1,0,0,0,0,0,0,0,0,1
2,1,138,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,3,189,0,0,1,0,0,0,0,0,0,0,0,0,1,0
4,4,166,0,0,1,0,0,0,0,0,0,0,1,0,0,0
5,1,143,0,0,0,0,1,0,0,0,0,0,0,1,0,0
6,2,130,0,0,0,0,0,0,1,0,0,0,0,0,0,1
7,0,118,0,1,0,0,0,0,0,1,0,0,0,0,0,0
8,3,121,0,0,0,0,0,1,0,0,0,0,0,0,1,0
9,2,201,0,1,0,0,0,0,0,0,1,0,0,0,0,0


In [22]:
# We've now one hot-encoded all of our data that was categorical.  let's save the file to a new name so we, 
# as before, preserve the original data while giving us a new file with which to work.  This new file will be 
# the one we will load and use to train and test our models.  Here I've added the abbreviation "dev" so it is 
# clear that we will be using it for our model development work.

print("Saving the category samples dataset...")

ohe.to_csv('categorySamples_dev.csv', index = False)

print("Finished.")

Saving the category samples dataset...
Finished.


In [23]:
# Load our development version of the category samples data.  It's in the categorySamples_dev.csv file, 
# which we just saved and it should be in the Data directory (folder).  The data is loaded into a 
# Python dataframe we will again call df.  

print("Load the category samples development dataset...")

df=pd.read_csv('categorySamples_dev.csv')

print("Finished.")

Load the category samples development dataset...
Finished.


In [24]:
# Show the statistics of our development df data frame.

# dataframe.size
size = df.size
  
# dataframe.shape
shape = df.shape

# printing size and shape
print("Size = {}\nShape = {}".
format(size, shape))

Size = 1600
Shape = (100, 16)


In [25]:
# And here's what it looks like:

df.head()

Unnamed: 0,shirts,weight,colors_Black,colors_Blue,colors_Green,colors_Orange,colors_Red,colors_White,colors_Yellow,cars_Buick,cars_Cadillac,cars_Chevrolet,cars_Ford,cars_Land Rover,cars_Nissan,cars_Toyota
0,2,167,0,1,0,0,0,0,0,1,0,0,0,0,0,0
1,2,182,0,0,0,0,1,0,0,0,0,0,0,0,0,1
2,1,138,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,3,189,0,0,1,0,0,0,0,0,0,0,0,0,1,0
4,4,166,0,0,1,0,0,0,0,0,0,0,1,0,0,0
