# Is One-Hot Encoding Really Necessary?

![title](http://oi65.tinypic.com/2nvf248.jpg)

For a specific project, I can only Integer Encode, but not One-Hot Encode.

Many Machine Learning experts such as Sebastian Raschka ("Python Machine Learning") urge coders to do one-hot encoding. The main argument is that for categorical labels without an order, Integer Encoding can screw up your model. But there are also some disadvantages (introduction of correlation, overfitting...).

I did not want to look stupid in front of my customer, so I did what any reasonable ML coder does: I created my own dataset to test it.

SYNTHETIC DATASET:
- 50 sales people
- A - H features (random variables ranging from 1-100)
- 20 examples (i.e each sales is represented 20 times).
- 1 sales person always with label 1.

The hypothesis is, that if I only Integer Encode each sales (sales 1 = 1, sales 2 = 2...), the model should struggle because sales people are not ordinal (i.e. sales 40 is not better than sales 4).

# Difference between Integer Encoding vs. One-Hot Encoding

Logistic Regression: 
Integer Encoding: 
- Precision 1:  67%
- Recall 1: 100%

One-Hot Encoding:
- Precision 1: 67%
- Recall 1: 100%

Decision Tree Classifier: 
Integer Encoding: 
- Precision 1:  80%
- Recall 1: 100%

One-Hot Encoding:
- Precision 1: 100% !!!
- Recall 1: 100%

# Conclusion:
Logistic Regression seems not to care about One-Hot Encoding, but the Decision Tree clearly does.

In [246]:
import pandas as pd
import numpy as np

In [247]:
# Necessary libraries from Sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [248]:
# Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [249]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [250]:
data = pd.read_excel("One_Hot_Test.xlsx")

In [251]:
data.head()

Unnamed: 0,Name,A,B,C,D,E,F,G,H,Label
0,Deadra,90,56,47,72,80,64,10,24,0
1,Frances,37,66,66,47,84,70,22,33,0
2,Perry,45,64,70,38,15,91,21,35,0
3,Ronny,100,50,48,76,22,8,11,69,0
4,Terry,37,59,79,25,38,30,54,27,0


In [252]:
data.shape

(1000, 10)

# Label Encoder (not one-hot encoding!)

In [253]:
from sklearn import preprocessing

In [254]:
le = preprocessing.LabelEncoder()

In [255]:
le.fit(data['Name'])

LabelEncoder()

In [256]:
Name_ = le.transform(data['Name'])

In [257]:
df = pd.DataFrame(Name_)

In [258]:
df.columns.names = ['Name']

In [259]:
X = pd.concat([data, df], axis=1)

In [260]:
X.drop(['Name', 'Label'], axis=1, inplace=True)

In [261]:
y = data['Label']

In [262]:
validation_size = 0.2
seed = 12
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=validation_size, random_state=seed)

In [263]:
num_fold = 10
kfold = KFold(n_splits=10, random_state=12)
seed = 12 

In [264]:
# LOGISTIC REGRESSION
model = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, 
                           intercept_scaling=1, class_weight="balanced", random_state=12, 
                           solver='warn', max_iter=100, multi_class='warn', verbose=0, 
                           warm_start=False, n_jobs=None)

# DECISION TREE
model = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=3, min_samples_split=2,
                               min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None,
                               random_state=12, max_leaf_nodes=None, min_impurity_decrease=0.0, 
                               min_impurity_split=None, class_weight="balanced", presort=False)

In [265]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=12,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

In [266]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [267]:
from sklearn.metrics import classification_report
predicted = model.predict(X_test)
report = classification_report(y_test, predicted)
print(report)

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       196
           1       0.67      1.00      0.80         4

   micro avg       0.99      0.99      0.99       200
   macro avg       0.83      0.99      0.90       200
weighted avg       0.99      0.99      0.99       200



# One-Hot Encoding

In [268]:
X2 = pd.concat([data.drop('Name', axis=1), pd.get_dummies(data['Name'])], axis=1)

In [269]:
X2.drop(['Label'], axis=1, inplace=True)

In [270]:
y2 = data['Label']

In [271]:
validation_size = 0.2
seed = 12
X2_train, X2_test, y2_train, y2_test, = train_test_split(X2, y2, test_size=validation_size, random_state=seed)

In [272]:
num_fold = 10
kfold = KFold(n_splits=10, random_state=12)
seed = 12 

In [273]:
# LOGISTIC REGRESSION
model_2 = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, 
                           intercept_scaling=1, class_weight="balanced", random_state=12, 
                           solver='warn', max_iter=100, multi_class='warn', verbose=0, 
                           warm_start=False, n_jobs=None)

# DECISION TREE
model_2 = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=3, min_samples_split=2,
                               min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None,
                               random_state=12, max_leaf_nodes=None, min_impurity_decrease=0.0, 
                               min_impurity_split=None, class_weight="balanced", presort=False)

In [274]:
model_2.fit(X2_train, y2_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=12,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

In [275]:
# make predictions for test data
y2_pred = model_2.predict(X2_test)
predictions = [round(value) for value in y2_pred]

In [276]:
from sklearn.metrics import classification_report
predicted = model_2.predict(X2_test)
report = classification_report(y2_test, predicted)
print(report)

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       196
           1       0.67      1.00      0.80         4

   micro avg       0.99      0.99      0.99       200
   macro avg       0.83      0.99      0.90       200
weighted avg       0.99      0.99      0.99       200

