# Encoding categorical data to numerical data

In this exercise, I show how to go about converting data that is in categorical string format to numerical format.  Machine learning models only work with data that is numeric format.

I chose an arbitrary dataset called **police.csv**  taken from an exercise presented by **Kevin Markam** on his YouTube channel DataSchool.  Thanks Kevin.

The dataset basically records over 91,000 cases of traffic violations and includes the race, gender, age etc. of each person.

My aim is to train a logistic regression model on the data in order to predict who will get arrested.   Before I can do that, I need to encode the categorial features.

## 1. Import the data

In [88]:
import pandas as pd
data = pd.read_csv('police.csv')  ## Import the csv file
data.head()


Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


## 2. Subset the data

As we can see above, there are a number of columns with various features.  I plan to only use a subset of the features in training the model.  I will therefore only select the following:  driver_gender, driver_age, driver_race, search_conducted & is_arrested.

This is shown below.


In [89]:
data2 = data[['driver_gender','driver_age','driver_race','violation','search_conducted','is_arrested']]
data2.head()

Unnamed: 0,driver_gender,driver_age,driver_race,violation,search_conducted,is_arrested
0,M,20.0,White,Speeding,False,False
1,M,40.0,White,Speeding,False,False
2,M,33.0,White,Speeding,False,False
3,M,19.0,White,Other,False,True
4,F,21.0,White,Speeding,False,False


## 3.  Encode the categorical features using "get_dummies."

As we can see above, we have truncated our dataset to only include the subset of features.  We need to take note that all the columns have categorical data, except "driver_age".  It is therefore necessary to encode them.

We will do so by using the Pandas .get_dummies function shown below.  This carries out  OneHotEncoding

In [90]:
data2_dummies =  pd.get_dummies(data2)

data2_dummies.head()

Unnamed: 0,driver_age,search_conducted,driver_gender_F,driver_gender_M,driver_race_Asian,driver_race_Black,driver_race_Hispanic,driver_race_Other,driver_race_White,violation_Equipment,violation_Moving violation,violation_Other,violation_Registration/plates,violation_Seat belt,violation_Speeding,is_arrested_False,is_arrested_True
0,20.0,False,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0
1,40.0,False,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0
2,33.0,False,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0
3,19.0,False,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1
4,21.0,False,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0


So we can see from above, that all the original columns were one hot encoded, except "driver_age" and "search_conducted."

To be honest, as I write this, I am not sure why it didn't apply it to "search_conducted."

Below I show the original features and the number and thereafter I show the features after encoding as well as their number.
The number of features (columns) increased from 6 to 17.

In [94]:
print('Original features:', list(data2.columns), '\n')
print('Number of original features:', data2.shape[1] , '\n')
print('New features after encoding:', list(data2_dummies.columns), '\n')
print('Number of features after encoding:', data2_dummies.shape[1])

Original features: ['driver_gender', 'driver_age', 'driver_race', 'violation', 'search_conducted', 'is_arrested'] 

Number of original features: 6 

New features after encoding: ['driver_age', 'search_conducted', 'driver_gender_F', 'driver_gender_M', 'driver_race_Asian', 'driver_race_Black', 'driver_race_Hispanic', 'driver_race_Other', 'driver_race_White', 'violation_Equipment', 'violation_Moving violation', 'violation_Other', 'violation_Registration/plates', 'violation_Seat belt', 'violation_Speeding', 'is_arrested_False', 'is_arrested_True'] 

Number of features after encoding: 17


Because the "search_conducted" column was not encoded, I will make use of LabelEncoder to do this.  I need to read up more on why get_dummies did not apply to this column.

Below I use sklearn LabelEncoder and apply it to the column.

## 4.  Encode remaining column with LabelEncoder

In [48]:
from sklearn.preprocessing import LabelEncoder as LE
l_encode = LE()
data2_dummies.search_conducted =l_encode.fit_transform(data2.search_conducted)
data2_dummies.shape

(91741, 17)

## 5.  Clean the data

The next step is to clean up the data.  A simple exercise is to find the rows that have NaNs in them and delete them.
We use the pandas function df.isnull().sum()

In [95]:
## Clean up the data.  Remove NaNs
type(data2_dummies)
data2_dummies.isnull().sum()
#

driver_age                       5621
search_conducted                    0
driver_gender_F                     0
driver_gender_M                     0
driver_race_Asian                   0
driver_race_Black                   0
driver_race_Hispanic                0
driver_race_Other                   0
driver_race_White                   0
violation_Equipment                 0
violation_Moving violation          0
violation_Other                     0
violation_Registration/plates       0
violation_Seat belt                 0
violation_Speeding                  0
is_arrested_False                   0
is_arrested_True                    0
dtype: int64

Above we can see that "driver_age" has 5621 NaNs, which could be due to the police officer not recording the age during the traffic stop.
I will delete all the rows in the dataset that includes the NaNs.

Below we use .dropna(inplace = True)

We can see that the number of rows has decreased by 5621.

In [96]:
data2_dummies.dropna(inplace = True)
data2_dummies.shape

(86120, 17)

## 6. Train a logistic regression model

The next step is train the logistic regression model.
First we need to identify the features and target data.

In [86]:
## machine learning
X = data2_dummies.loc[:,'driver_age':'violation_Speeding'].values  # convert to numpy array
y = data2_dummies.loc[:,'is_arrested_True'].values   # convert to numpy array

I will import the LogisticRegression model and the train_test_split
for splitting up data.

In [97]:
from sklearn.linear_model import LogisticRegression as LR
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0)

logreg = LR()
logreg.fit(X_train, y_train)

print('Logistic regression score on test set :{}'.format(logreg.score(X_test, y_test)))

Logistic regression score on test set :0.9665582907570831
