# One-Hot Encoding 

One-hot encoding is one of the most important feature transformation techniques that turns categorical values into binary representations. Machine learning algorithms operate on numerical inputs. Therefore, we have to transform categorical data into some form of numerical representation. One-hot encoding is an excellent candidate for this task because it is easy to understand and straightforward to implement. 

When we apply one-hot encoding to our categorical columns in our dataset, we create a new binary indicator column for every unique value in the original categorical column. 

Let's consider an example. Let's say we have a feature named `animal` that can have one of three possible values: `Dog`, `Cat` and `Dinosaur`. We would replace the column `animal` with three new columns, one for every potential value of `animal`: `Dog`, `Cat` and `Dinosaur`. Each new column would contain binary values. For example, for every row in which the original column `animal` had the value of `Dog`, the new column `Dog` would have the value of 1. For every row in which the original column `animal` did NOT have the value of `Dog`, the new column `Dog` would have the value of 0. Compare the original dataset below with the resulting dataset after one-hot encoding has been performed on the `animal` column.

<img src='onehotencoding3.png' width=600 height=600 align="left"/>

Recall that to implement a k-Nearest Neighbors model, we cannot have features for which computing the distance is impossible: the string-valued, categorical features. 

In the previous exercise, you removed these values from the dataset before implementing the KNN model. In this demo,  you will see how to use one hot encoding to transform these values instead.

### Import Packages

Let's begin by loading the required packages:

In [1]:
import pandas as pd
import numpy as np
import os 

### Load the Data Set

We will once again work with the "cell2celltrain" data set. This data set is already preprocessed, with the proper formatting, outliers and missing values taken care of, and all numerical columns scaled to the [0, 1] interval.

In [2]:
filename = os.path.join("data", "cell2celltrain.csv")
df = pd.read_csv(filename, header=0)

FileNotFoundError: [Errno 2] No such file or directory: 'data\\cell2celltrain.csv'

In [None]:
df.shape

In [None]:
df.head()

### Find the Columns Containing String Values

In [None]:
df.dtypes

The code cell below finds all columns of type `object`.

In [None]:
to_encode = list(df.select_dtypes(include=['object']).columns)
print(to_encode)


Below you will one-hot encode the columns using three different approaches.

## One-Hot Encode the Data Using NumPy

In the last exercise, you removed these columns from DataFrame `df`. This time, we will transform them using One-hot encoding.

There are 5 object-type columns in our DataFrame `df`. Lets inspect the possible number of values each column (feature) may have.

In [None]:
df[to_encode].nunique()

Notice that column `ServiceArea` has 747 potential values. This means we would have to create 747 new binary indicator columns - one column per unique value. That is too many!

Let's handle the special case of column `ServiceArea` first. Transforming this many categorical values would slow down the computation down the line. Instead, we will convert the top 10 most frequent values in column `ServiceArea`. 

In [None]:
top_10_SA = list(df['ServiceArea'].value_counts().head(10).index)

top_10_SA

Now that we have obtained the ten most frequent values for `ServiceArea`, let's transform DataFrame `df` to represent these values numerically.

1. Create new columns to represent `ServiceArea`.

    * Instead of the original `ServiceArea` column, `df` must contain ten one-hot encoded columns: one column for every value in the top 10 most frequent service areas.

    * For example, there will be one column for `NYCBRO917`, one column for `HOUHOU281`, one column for `DALDAL214` and so on. We will name each column 'ServiceArea + '\_' + $<$service area value$>$'. For example, there will be a column named  `ServiceArea_NYCBRO917`.


2. Create values for each column.

    * Each column will have a value of either 0 or 1. 

    * 1 means that the row in question had that corresponding value present in the original `ServiceArea` column. 

    * For example, row 47 in DataFrame `df` originally had the value `DALDAL214` in column `ServiceArea`. After one-hot ending, row 47 will have the value of 1 in new column `ServiceArea_DALDAL214`.
    
The code cell below accomplishes the task of creating ten one-hot encoding columns. 


In [None]:
for value in top_10_SA:
    
    ## Create columns and their calues
    df['ServiceArea_'+ value] = np.where(df['ServiceArea']==value,1,0)
    
    
# Remove the original column from your DataFrame df
df.drop(columns = 'ServiceArea', inplace=True)

# Remove from list to_encode
to_encode.remove('ServiceArea')

Inspect DataFrame `df` and see the new columns and their values

In [None]:
df.head()

In [None]:
df.columns

Let's inspect column `ServiceAreaDALDAL214` in row 47. Remember, it should have a value of 1

In [None]:
df.loc[47]['ServiceArea_DALDAL214']

## One-Hot Encode the Data Using Pandas

Now that we have successfully transformed the `ServiceArea` column, let us transform the `Married` column. Let's inspect the values in `Married`.

In [None]:
df['Married']

We will perform the same method as above, but using a simpler approach. We will use the Pandas `pd.get_dummies()` function. Recall that we often refer to a binary value that represents a categorical one as "dummy" value or variable.
For more information, consult the online [documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html). 


In the code cell below, we are specifying which column to encode and the prefix for the new columns. Note that `pd.get_dummies()` returns a new DataFrame with the new one-hot encoded values.

In [None]:
df_Married = pd.get_dummies(df['Married'], prefix='Married_')
df_Married


Since the `pd.get_dummies()` function didn't make the changes to the original DataFrame `df`, let us add the new DataFrame `df_Married` to DataFrame `df`, and delete the original `Married` column.


In [None]:
# Concatenate with the encoded dataframe:
df = df.join(df_Married)

# Remove the original column from your DataFrame df
df.drop(columns = 'Married', inplace=True)

# Remove from list to_encode
to_encode.remove('Married')


Let's inspect DataFrame `df`

In [None]:
df.columns

## One-Hot Encode the Data Using Scikit-Learn

Instead of transforming each column using the NumPy `np.where()` or Pandas `pd.get_dummies()` functions, we can use the more robust `OneHotEncoder` transformation class from `sklearn`. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Note that you may have to handle missing values in your data prior to using  `OneHotEncoder`.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Create the encoder:
encoder = OneHotEncoder(handle_unknown="error", sparse=False)

# Apply the encoder:
df_enc = pd.DataFrame(encoder.fit_transform(df[to_encode]))


# Reinstate the original column names:
df_enc.columns = encoder.get_feature_names(to_encode)



Let's glance at the one-hot encoded columns.

In [None]:
df_enc.head()

We can now merge the transformed categorical features into DataFrame `df` and remove the original columns that we have just transformed.

In [None]:
# Concatenate with the encoded dataframe:
df = df.join(df_enc)

# Remove the original categorical features from X_train and X_test:
df.drop(columns = to_encode ,axis=1, inplace=True)



Let's now inspect DataFrame `df` and see how it has been transformed. Notice the new dimensions of `df`.

In [None]:
df.columns

In [None]:
df.shape