# [Using categorical data in machine learning with python](https://blog.myyellowroad.com/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512)

Intro
Categorical data is very common in business datasets. For example, users are typically described by country, gender, age group etc., products are often described by product type, manufacturer, seller etc., and so on.

Categorical data is very convenient for people but very hard for most machine learning algorithms, due to several reasons:

High cardinality- categorical variables may have a very large number of levels (e.g., city or URL), were most of the levels appear in a relatively small number of instances.
Many machine learning models (e.g., SVM) are algebraic, thus their input must be numerical. Using these models, categories must be transformed into numbers first before we can apply the learning algorithm.
While some ML packages might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs (like our beloved scikit-learn).
For the machine (and we are in machine learning :)), categorical data doesn’t contain the same context or information that we humans can easily associate and understand. For example, when looking on a feature called “City”, we humans can understand that for many business aspects New York is a similar concept to New Jersey, while New York and Tehran are much different. But for the machine, New York, New Jersey and Tehran, are just three different levels (possible values) of the same feature “City”. If we won’t represent the additional contextual information it will be impossible for the machine to generalize out of this information.
In this post I will share some basic strategies of using categorical data that worked for us (at YellowRoad) in recent projects, while on part 2, I will share some more advanced methods. I will discuss on why they work and why different methods are better for different algorithms in different scenarios, while sharing my code implementation in Python.

We will cover different methods, from a simple dummy variables to complex methods like leveraging deep learning for category embedding.

Evaluation
We will evaluate every method on a sample of 2M rows from the Avatzo CTR prediction Kaggle challenge dataset that has many categorical features. Our evaluation metric will be logarithmic loss (like in the contest). On every feature representation I will apply Logistic Regression and Random Forest algorithms to evaluate it’s performance. I will review 6 different embedding strategies in a series of two posts (the more advanced methods will be in a following post).

As with any ML solution, we will start by creating a simple baseline method, and assess it’s performance. For that, our baseline prediction will be to predict a constant CTR based on the mean CTR proportion that was seen in the training set:

In [1]:
import numpy as np
import pandas
from sklearn.metrics import log_loss
train_file = "train.csv"
train = pandas.read_csv(train_file)
msk = np.random.rand(len(X)) < 0.8
features = [3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23]
x_train = train[msk].iloc[:,features]
x_test = test[~msk].iloc[:,features]
y_train = train[msk].iloc[:,1]
y_test = test[~msk].iloc[:,1]
print(log_loss(y_test,np.ones(len(y_test))*y_train.mean()))

IOError: File train.csv does not exist