In [None]:


In general, features can be numerical (e.g. price, length, width, etc…) or categorical (e.g. color, size, etc..).
Categorical features are further split into nominal and ordinal features.

Ordinal features can be sorted and ordered. For example, size (small, medium, large), we can order these sizes 
large > medium > small. While nominal features do not have an order for example, color, it doesn’t make any sense 
to say that red is larger than blue.

Most machine learning algorithm require that you convert categorical features into numerical values. One solution 
would to assign each value a different number starting from zero. (e.g. small à 0 ,medium à 1 ,large à 2)

This works well for ordinal features but might cause problems with nominal features (e.g. blue à 0, white à 1, 
yellow à 2) because even though colors are not ordered the learning algorithm will assume that white is larger than
blue and yellow is larger than white and this is not correct.

To get around this problem is to use one-hot encoding, the idea is to create a new feature for each unique value of 
the nominal feature.

# COlor          #   Red  Blue  Green
1 Red            1   1    0      0
2 Blue           2   0    1      0
3 Green          3   0    0      0

One-Hot Encoding
One-Hot Encoding transforms each categorical feature with n possible values into n binary features, with only one
active.



In [6]:
from sklearn.feature_extraction import DictVectorizer
instances = [{'city': 'New York'},{'city': 'San Francisco'},{'city': 'Chapel Hill'}]

onehot_encoder = DictVectorizer()
onehot_encoder.fit_transform(instances).toarray()


array([[ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.]])

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['UNC played Duke in basketball',
        'Duke lost the basketball game','I ate a sandwich']
vectorizer = CountVectorizer()
print (vectorizer.fit_transform(corpus).todense())
print (vectorizer.vocabulary_)

#The  rst word in the dictionary is UNC, so the  rst element in the vector is equal to one. The last word in the
#dictionary is game. The  rst document does not contain the word game, so the eighth element in its vector is set to 0

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
{'in': 4, 'unc': 9, 'played': 6, 'sandwich': 7, 'basketball': 1, 'lost': 5, 'duke': 2, 'game': 3, 'ate': 0, 'the': 8}


In [9]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train = ["paris", "paris", "tokyo", "amsterdam", "amsterdam"]
test = ["tokyo", "tokyo", "paris"]
#le.fit(train).transform(test)
print(le.fit_transform(train))
le.transform(test)

[1 1 2 0 0]


array([2, 2, 1])