## Feature Extraction  

For categorical data like color (e.g. red, blue, purple,...), it is best to use one-hot-encoding, thereby creating new features which are true on for one color option. This avoids artificial bias introduced by numbering each color seperately, as the numerical value will affect the models predictions.  
The DictVectorizer can also encode categorical features in this way:

In [1]:
data = [
     {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

In [2]:
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
vec

DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
               sparse=True)

In [3]:
vec.fit_transform(data).toarray()

array([[ 1.,  0.,  0., 33.],
       [ 0.,  1.,  0., 12.],
       [ 0.,  0.,  1., 18.]])

In [4]:
vec.get_feature_names()

['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

## Titanic

In [7]:
import pandas as pd

titanic = pd.read_csv('scipy-2018-sklearn-master/notebooks/datasets/titanic3.csv')
print(titanic.columns)

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')


In [8]:
titanic.head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,,,"Belfast, NI"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


In [9]:
labels = titanic.survived.values
features = titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

In [11]:
features.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
0,1,female,29.0,0,0,211.3375,S
1,1,male,0.9167,1,2,151.55,S
2,1,female,2.0,1,2,151.55,S
3,1,male,30.0,1,2,151.55,S
4,1,female,25.0,1,2,151.55,S


In [20]:
features_dummies = pd.get_dummies(features) # one hot encodes the categorical features
features_dummies.head(n=16)

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,1,29.0,0,0,211.3375,1,0,0,0,1
1,1,0.9167,1,2,151.55,0,1,0,0,1
2,1,2.0,1,2,151.55,1,0,0,0,1
3,1,30.0,1,2,151.55,0,1,0,0,1
4,1,25.0,1,2,151.55,1,0,0,0,1
5,1,48.0,0,0,26.55,0,1,0,0,1
6,1,63.0,1,0,77.9583,1,0,0,0,1
7,1,39.0,0,0,0.0,0,1,0,0,1
8,1,53.0,2,0,51.4792,1,0,0,0,1
9,1,71.0,0,0,49.5042,0,1,1,0,0


In [21]:
data = features_dummies.values

In [22]:
import numpy as np

np.isnan(data).any()


True

In [26]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer # tp fill in the NaN's

train_data, test_data, train_labels, test_labels = train_test_split(
    data, labels, random_state=0)
imp = SimpleImputer()
imp.fit(train_data)
train_data_finite = imp.transform(train_data)
test_data_finite = imp.transform(test_data)

In [28]:
np.isnan(train_data_finite).any()

False

In [30]:
from sklearn.dummy import DummyClassifier

clf = DummyClassifier('most_frequent')
clf.fit(train_data_finite, train_labels)
clf.score(test_data_finite, test_labels)

0.6341463414634146

In [31]:
from sklearn.linear_model import LogisticRegression

lin_clf = LogisticRegression()

In [32]:
lin_clf.fit(train_data_finite, train_labels)
lin_clf.score(test_data_finite, test_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.7926829268292683

In [35]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=500, random_state=0)

In [36]:
forest_clf.fit(train_data_finite, train_labels)
forest_clf.score(test_data_finite, test_labels)

0.7804878048780488