<H2>FEATURE EXTRACTION is an important topic of MACHINE LEARNING</H2>

How features are extracted from real world data

Example of extracting numerical features from textual data

In scikit learn, data are expected to be n_samples x n_features

Nominal features != categorial features. 

Categorial features can be ordered, compared, when ordinal can't really. For IRIS dataset, it means that petal length is categorial, while the color is Nominal. The algo assumes that features are categorial, so there is a workaround to avoid the algo to treat them as categorial : one-hot encoding representation : in this case, we select specific color (purple, blue, red) that have meaning with value of 1.0 if the color is exactly this one, and 0 if it is the opposite (NB : leads to sparse matrix)

DictVectorizer encode categorical features


<h3>DictVectorizer</h3>


In [2]:
measurements = [
    {'city': 'Dubaï', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.}
]

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
print('vectorizer parameters :', vec)

print(vec.fit_transform(measurements).toarray())

print('labels:', vec.get_feature_names())



vectorizer parameters : DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=True)
[[  1.   0.   0.  33.]
 [  0.   1.   0.  12.]
 [  0.   0.   1.  18.]]
labels: ['city=Dubaï', 'city=London', 'city=San Francisco', 'temperature']


<h3>Derive features</h3>


In [1]:
import os
import pandas as pd

#titanic = pd.read_csv(os.path.join('datasets', 'titanic3.csv'))
titanic = pd.read_csv('files/titanic3.csv')
print(titanic.columns)



Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')


We dont want to use boat and body in a classification survived/not survived because they already contain this information


As domain expert, we must know and understand wether values like integers are nominal or categorical. In the case the class is typical categorical and needs to be treated as this. In this case, the class, number 1, 2 or 3, will be translated into three different columns


In [3]:
labels = titanic.survived.values
features =  titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

features.head()

# We need to transform 'sex' and 'embarked' into binary data
pd.get_dummies(features).head()

# pclass is already binary however it is also categorial, so we need to specify it to the encoder
features_dummies = pd.get_dummies(features, columns=['pclass', 'sex', 'embarked'])
print(features_dummies.head(n=16))

data = features_dummies.values

print(data)
print('data shape ', data.shape)

      age  sibsp  parch      fare  pclass_1  pclass_2  pclass_3  sex_female  \
0   29.00      0      0  211.3375       1.0       0.0       0.0         1.0   
1    0.92      1      2  151.5500       1.0       0.0       0.0         0.0   
2    2.00      1      2  151.5500       1.0       0.0       0.0         1.0   
3   30.00      1      2  151.5500       1.0       0.0       0.0         0.0   
4   25.00      1      2  151.5500       1.0       0.0       0.0         1.0   
5   48.00      0      0   26.5500       1.0       0.0       0.0         0.0   
6   63.00      1      0   77.9583       1.0       0.0       0.0         1.0   
7   39.00      0      0    0.0000       1.0       0.0       0.0         0.0   
8   53.00      2      0   51.4792       1.0       0.0       0.0         1.0   
9   71.00      0      0   49.5042       1.0       0.0       0.0         0.0   
10  47.00      1      0  227.5250       1.0       0.0       0.0         0.0   
11  18.00      1      0  227.5250       1.0       0.

If we have missing values, we can simply get rid of the column, or, if we want to use it, we can also simply assign the mean value to the unknown ones.

In [4]:
import numpy as np
np.isnan(data).any()


True

In [5]:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import Imputer

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 0)

imp = Imputer()
imp.fit(train_data)
train_data_finite = imp.transform(train_data)
test_data_finite = imp.transform(test_data)


from sklearn.dummy import DummyClassifier

clf = DummyClassifier('most_frequent')
clf.fit(train_data_finite, train_labels)
print('Prediction accuracy: ', clf.score(test_data_finite, test_labels))


Prediction accuracy:  0.634146341463


Now let's use some better classifier like Logistic Regression and Random Forest classifier

In [11]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(train_data_finite, train_labels)
print('logistic regression score : ', lr.score(test_data_finite, test_labels))

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, random_state = 0).fit(train_data_finite, train_labels)
print('RandomForest classifier score : ', rf.score(test_data_finite, test_labels))


features_dummies_sub = pd.get_dummies(features[['pclass', 'sex', 'age', 'sibsp', 'fare']])
data_sub = features_dummies_sub.values

train_data_sub, test_data_sub, train_labels, test_labels = train_test_split(data_sub, labels, random_state = 0)

imp = Imputer()
imp.fit(train_data_sub)
train_data_finite_sub = imp.transform(train_data_sub)
test_data_finite_sub = imp.transform(test_data_sub)

lr = LogisticRegression().fit(train_data_finite_sub, train_labels)
print('logistic regression score w/o embark&parch: ', lr.score(test_data_finite_sub, test_labels))

rf = RandomForestClassifier(n_estimators=500, random_state = 0).fit(train_data_finite_sub, train_labels)
print('RandomForest classifier score w/o embark&parch: ', rf.score(test_data_finite_sub, test_labels))



logistic regression score :  0.792682926829
RandomForest classifier score :  0.77743902439
logistic regression score w/o embark&parch:  0.789634146341
RandomForest classifier score w/o embark&parch:  0.80487804878
