# Ensemble Ghouls, i.e. Majority Voting
This is the last thing I tried as part of this competition. This notebook is going to compare a bunch of different models. Then Implement a Majority Vote Classifier. Heres what we want to try

- SVM
- KNN
- RandomForest
- OLS

It is important to note that the MajorityVoteClassifier used here is from Python Machine Learning and is available as a built-in in sklearn as VotingClassifier. This notebook is my first attempt at this kind of classifier (outside of a RandomForest). **Although it is not implemented, this would have been a good place to try out the bias / variance method of testing our Classifier**.

### A note on Pipelines
I also tried building out big pipelines here as a way to make all of this easier. **Ultimately I wanted to build a single pipeline that fed all my individual models to a single Majority Voting classifier at the end of the pipe but didn't get the implementation right. I would also have liked to use predict for the weights in the Majority Voting Classifier but didn't get that to work either**

In [2]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.feature_selection import SelectKBest

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.metrics import classification_report

In [102]:
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

class ColumnExtractor(TransformerMixin,BaseEstimator):
    """takes in a dataframe, parses it by columns and returns an np array"""
    def __init__(self, columns=[]):
        self.columns = columns

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def transform(self, X, **transform_params):
        return X[self.columns]

    def fit(self, X, y=None, **fit_params):
        return self

In [103]:
class GetDummies(TransformerMixin,BaseEstimator):
    """I hate LabelEncoder and OneHotEncoder this is my workaround"""
    def __init__(self):
        pass
#         self.columns = columns
        
    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)
    
    def transform(self, X, **transform_params):
#         return pd.get_dummies(X[self.columns]).values #this assumed we were passing in a df, X is a np array
        return pd.get_dummies(X).values
    
    def fit(self, X, y=None, **fit_params):
        return self

In [261]:
# p.205 from Python Machine Learning
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator

class MajorityVoteClassifier(BaseEstimator,ClassifierMixin):
    """ A majority vote ensemble classifier
    
    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers] 
      Different classifiers for the ensemble
    
    vote : str, {'classlabel','probability}
      Default: 'classlabel'
      If 'classlabel' the prediction is based on 
      the argmax of class labels. Else if
      'probability', the argmax of the sum of
      probabilities is used to predict the class label
      (recommended for calibrated classifiers)
    
    weights : array-like, shape = [n_classifiers]
      Optional, default: None
      If a list of 'int' or 'float' values are
      provided, the classifiers are weighted by
      importance; uses uniform weights if 'weights=None
    """
    def __init__(self, classifiers, vote='classlabel',weights=None):
        
        self.classifiers = classifiers
        self.named_classifiers = {key: value 
                                  for key, value in 
                                  _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights
        
    def fit(self, X, y):
        """Fit classifiers.
        
        Parameters
        ----------
        X : {array-like, sparse matrix},
            shape = [n_samples, n_features]
            Matrix of training samples.
            
        y : array-like, shape = [n_samples]
            Vector of target class labels.
            
        Returns
        -------
        self : object
        """
        # Use LabelEncoder to ensure class labels start
        # with 0, which is important for np.argmax
        # call in self.predict
        self.labelenc_ = LabelEncoder()
        self.labelenc_.fit(y)
        self.classes_ = self.labelenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X,
                                       self.labelenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self
    
    def predict(self, X):
        """Predict class labels for X.
        
        Parameters
        ----------
        X : {array-like, sparse matrix},
            Shape = [n_samples, n_features]
            Matrix of training samples.
            
        Returns
        -------
        maj_vote : array-like, shape = [n_samples]
            Predicted class labels.
            
        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X),axis=1)
            
        else: # 'classlabel' vote
            # Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                    for clf in 
                                    self.classifiers_]).T
            maj_vote = np.apply_along_axis(
                            lambda x:
                            np.argmax(np.bincount(x,weights=self.weights)),
                                                 axis=1,
                                                 arr=predictions)
        maj_vote = self.labelenc_.inverse_transform(maj_vote)
        return maj_vote
    
    def predict_proba(self, X):
        """ Predict class probabilities for X
        
        Parameters
        ----------
        X : {array-like, sparse matrix},
            shape = [n_samples, n_features]
            Training vectors, where n_samples is
            the number of samples and
            n_features is the number of features.
            
        Returns
        -------
        avg_proba : array-like
            shape = [n_samples, n_classes]
            Weighted average probability for
            each class per sample.
        """
        probas = np.asarray([clf.predict_proba(X) for clf in self.classifiers_])
        
        avg_proba = np.average(probas, axis=0, weights=self.weights)

        return avg_proba
    
    def get_params(self, deep=True):
        """Get classifier parameter names for GridSearch"""
        if not deep:
            return super(MajorityVoteClassifier, 
                         self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in sixiteritems(self.named_classifiers):
                for key, value in six.iteritems(step.get_params(deep=True)):
                    out['%s__%s' % (name, key)] = value
            return out
                

In [104]:
# a quick test of the above
x = GetDummies()
x.fit_transform(train.color.values)

array([[ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.]])

In [105]:
# we see that our little categorical pipeline works as intended
x = ColumnExtractor(['color'])
y = GetDummies()

xx = x.fit_transform(train)
y.fit_transform(xx)

array([[ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.]])

In [106]:
z = SVC()
z.fit(y.fit_transform(xx), train.type)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [55]:
train.columns

Index(['id', 'bone_length', 'rotting_flesh', 'hair_length', 'has_soul',
       'color', 'type'],
      dtype='object')

In [54]:
# # a quick refresh of OneHotEncoder
# from sklearn.preprocessing import LabelEncoder
# from sklearn.preprocessing import OneHotEncoder

# # first we have to encode our categorical data
# color_le = LabelEncoder()
# X = color_le.fit_transform(train.color.values)

# # # then we encode dummy columns
# ohe = OneHotEncoder(sparse=False)
# XX =  ohe.fit_transform(X).tolist()
# XX

## SVM
This is our pilot for a fancy pipeline we want the following:
- continuous features
    - extract continuous columns
    - make a polynomial
- categorical features
    - extract categorical columns
    - convert to binary indicators in seperate columns (i.e. pd.get_dummies)
- call an estimator

In [157]:
# a fancy pipeline
"""note since the first thing we do is extract features we pass in X as a dataframe and Y as a np array when fitting"""

cont_cols = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul']
cat_cols =  ['color']

pipe_svm = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures())])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
                     ('clf',SVC(random_state=1))
                    ])

# pipe_svm.fit(train, train.type.values)

In [158]:
# pipe_svm.predict(test)

### If we can train this using GridSearchCV we are basically where we want to be

In [159]:
# GridSearch to determine best parameters
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
param_grid = [{'features__f_cont__poly__degree':[2,3],
               'clf__C': param_range,
               'clf__kernel': ['linear']},
              {'features__f_cont__poly__degree':[2,3],
               'clf__C': param_range,
               'clf__gamma': param_range,
               'clf__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svm,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=4,
                  verbose=True)

In [160]:
gs.fit(train.drop(['id'],axis=1),train.type.values)
print(gs.best_score_)
print(gs.best_params_)

# Error: AttributeError: 'ColumnExtractor' object has no attribute 'get_params'
# we can fix this with the BaseEstimator class
# http://stackoverflow.com/questions/27810855/python-sklearn-how-to-pass-parameters-to-the-customize-modeltransformer-clas

# ANOTHER LESS OBVIOUS ERROR IS x.shape[1] = 20 is not equal to 21 the number of features at training time
# this is the result of the get dummies where if a color isn't in the training sample it is dropped and there is 1 less feature
# this is clearly an issue - it also rasies the question of weather or not our get dummies is generating columns
# reliabily and repeatably

Fitting 4 folds for each of 112 candidates, totalling 448 fits
0.765498652291
{'features__f_cont__poly__degree': 2, 'clf__C': 1.0, 'clf__gamma': 0.1, 'clf__kernel': 'rbf'}


[Parallel(n_jobs=1)]: Done 448 out of 448 | elapsed:   10.0s finished


In [161]:
scores = cross_val_score(estimator=gs.best_estimator_, X=train, y=train.type.values, cv=4, scoring='accuracy')
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(np.mean(scores),np.std(scores)))

Accuracy: 0.765 +/- 0.015


## KNN

In [150]:
cont_cols = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul']
cat_cols =  ['color']

pipe_knn = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures())])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
                     ('clf',KNeighborsClassifier())
                    ])

In [154]:
# GridSearch to determine best parameters
param_range = [2,3,4,5,6,7,8,9]
param_grid = [{'features__f_cont__poly__degree':[2,3],
               'clf__n_neighbors': param_range,
               'clf__weights': ['uniform','distance']}]

gs = GridSearchCV(estimator=pipe_knn,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=4,
                  verbose=True)

In [155]:
gs.fit(train.drop(['id'],axis=1),train.type.values)
print(gs.best_score_)
print(gs.best_params_)

Fitting 4 folds for each of 32 candidates, totalling 128 fits
0.746630727763
{'clf__weights': 'uniform', 'features__f_cont__poly__degree': 3, 'clf__n_neighbors': 8}


[Parallel(n_jobs=1)]: Done 128 out of 128 | elapsed:    2.3s finished


In [156]:
scores = cross_val_score(estimator=gs.best_estimator_, X=train, y=train.type.values, cv=4, scoring='accuracy')
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(np.mean(scores),np.std(scores)))

Accuracy: 0.746 +/- 0.029


## RandomForest

In [169]:
cont_cols = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul']
cat_cols =  ['color']

pipe_rfc = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures())])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
                     ('clf',RandomForestClassifier(random_state=1))
                    ])

In [170]:
# GridSearch to determine best parameters

param_grid = [{'features__f_cont__poly__degree':[2,3],
               'clf__n_estimators': [5,10,15,20,25],
               'clf__criterion': ['gini','entropy'],
               'clf__min_samples_split': [2,4,6,8,10],
               'clf__max_features': [2,4,6,8,10,'log2','sqrt']}]

gs = GridSearchCV(estimator=pipe_rfc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=4,
                  verbose=True)

In [171]:
# scores without the color features
gs.fit(train.drop(['id'],axis=1),train.type.values)
print(gs.best_score_)
print(gs.best_params_)

Fitting 4 folds for each of 700 candidates, totalling 2800 fits
0.743935309973
{'clf__min_samples_split': 10, 'clf__criterion': 'entropy', 'features__f_cont__poly__degree': 3, 'clf__max_features': 4, 'clf__n_estimators': 20}


[Parallel(n_jobs=1)]: Done 2800 out of 2800 | elapsed:  2.8min finished


In [172]:
scores = cross_val_score(estimator=gs.best_estimator_, X=train, y=train.type.values, cv=4, scoring='accuracy')
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(np.mean(scores),np.std(scores)))

Accuracy: 0.744 +/- 0.013


In [167]:
# scores for with color
gs.fit(train.drop(['id'],axis=1),train.type.values)
print(gs.best_score_)
print(gs.best_params_)

Fitting 4 folds for each of 700 candidates, totalling 2800 fits
0.746630727763
{'clf__min_samples_split': 6, 'clf__criterion': 'gini', 'features__f_cont__poly__degree': 3, 'clf__max_features': 2, 'clf__n_estimators': 20}


[Parallel(n_jobs=1)]: Done 2800 out of 2800 | elapsed:  3.1min finished


In [168]:
scores = cross_val_score(estimator=gs.best_estimator_, X=train, y=train.type.values, cv=4, scoring='accuracy')
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(np.mean(scores),np.std(scores)))

Accuracy: 0.747 +/- 0.010


## OLS

In [180]:
cont_cols = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul']
cat_cols =  ['color']

pipe_lrc = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures())])),
                          ('f_cat',Pipeline([
                                ('extract',ColumnExtractor(cat_cols)),
                                ('dummies',GetDummies())]))
                           ])),
                     ('clf',LogisticRegression(random_state=1))
                    ])

In [187]:
# GridSearch to determine best parameters

param_grid = [{'features__f_cont__poly__degree':[2,3],
               'clf__penalty': ['l2'],
               'clf__C': [0.001,0.01,0.1,1.0,10.0,100.0,1000.0],
               'clf__class_weight': [None,'balanced'],
               'clf__solver': ['liblinear','newton-cg','lbfgs']}]

gs = GridSearchCV(estimator=pipe_lrc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=4,
                  verbose=True)

In [188]:
gs.fit(train.drop(['id'],axis=1),train.type.values)
print(gs.best_score_)
print(gs.best_params_)

Fitting 4 folds for each of 84 candidates, totalling 336 fits
0.749326145553
{'features__f_cont__poly__degree': 3, 'clf__solver': 'newton-cg', 'clf__penalty': 'l2', 'clf__C': 100.0, 'clf__class_weight': None}


[Parallel(n_jobs=1)]: Done 336 out of 336 | elapsed:   18.2s finished


In [189]:
scores = cross_val_score(estimator=gs.best_estimator_, X=train, y=train.type.values, cv=4, scoring='accuracy')
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(np.mean(scores),np.std(scores)))

Accuracy: 0.749 +/- 0.020


## Majority Vote Classifier
We are going to implement a simple majority vote classifier - straight from the pages of Python Machine Learning

In [None]:
cont_cols = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul']
cat_cols =  ['color']

In [197]:
# we will implement a series of classifiers
# {'features__f_cont__poly__degree': 2, 'clf__C': 1.0, 'clf__gamma': 0.1, 'clf__kernel': 'rbf'}
pipe_svm = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures(degree=2))])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
                     ('clf',SVC(C=1.0,gamma=0.1,kernel='rbf',random_state=1))
                    ])
scores = cross_val_score(pipe_svm,train,train.type.values,cv=10)
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(scores.mean(),scores.std()))

Accuracy: 0.749 +/- 0.063


In [198]:
# {'clf__weights': 'uniform', 'features__f_cont__poly__degree': 3, 'clf__n_neighbors': 8}
pipe_knn = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures(degree=3))])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
                     ('clf',KNeighborsClassifier(weights='uniform',
                                                 n_neighbors=8))
                    ])
scores = cross_val_score(pipe_knn,train,train.type.values,cv=10)
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(scores.mean(),scores.std()))

Accuracy: 0.719 +/- 0.045


In [199]:
# {'clf__min_samples_split': 10, 'clf__criterion': 'entropy', 
#  'features__f_cont__poly__degree': 3, 'clf__max_features': 4, 'clf__n_estimators': 20}
pipe_rfc = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures(degree=3))])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
                     ('clf',RandomForestClassifier(min_samples_split=10,
                                                   criterion='entropy',
                                                   max_features=4,
                                                   n_estimators=20,
                                                   random_state=1))
                    ])
scores = cross_val_score(pipe_rfc,train,train.type.values,cv=10)
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(scores.mean(),scores.std()))

Accuracy: 0.711 +/- 0.054


In [213]:
# {'features__f_cont__poly__degree': 3, 'clf__solver': 'newton-cg', 
#  'clf__penalty': 'l2', 'clf__C': 100.0, 'clf__class_weight': None}
pipe_lrc = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures(degree=3))])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
                     ('clf',LogisticRegression(solver = 'newton-cg',
                                               class_weight = None,
                                               penalty = 'l2',
                                               C = 100.0,
                                               random_state=1))
                    ])

scores = cross_val_score(pipe_lrc,train, train.type.values, cv=10)
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(scores.mean(),scores.std()))

Accuracy: 0.735 +/- 0.053


#### The Majority Vote Classifier

In [279]:
mv_clf = MajorityVoteClassifier(classifiers=[pipe_svm,pipe_knn,pipe_rfc,pipe_lrc],
                               weights=[.4,.2,.1,.3],
                               vote='classlabel')

clf_labels = ['SVM','KNN','RFC','LRC','Majority Voting']
all_clf = [pipe_svm,pipe_knn,pipe_rfc,pipe_lrc,mv_clf]
for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                            X=train.drop(['id'],axis=1),
                            y=train.type.values,
                            cv=5,
                            scoring='accuracy')
    print('Accuracy: {0:.3f} +/- {1:.3f} {2}'.format(scores.mean(),scores.std(),label))

Accuracy: 0.757 +/- 0.032 SVM
Accuracy: 0.720 +/- 0.062 KNN
Accuracy: 0.690 +/- 0.041 RFC
Accuracy: 0.747 +/- 0.031 LRC
Accuracy: 0.752 +/- 0.035 Majority Voting


In [280]:
scores = cross_val_score(mv_clf,train.drop(['id'],axis=1),train.type.values,cv=4)
print('Accuracy: {0:.3f} +/- {1:.3f}'.format(scores.mean(),scores.std()))

Accuracy: 0.757 +/- 0.009


In [281]:
# lets see how this does predicting our Ghosts
mv_clf.fit(X=train,y=train.type.values)

mv_pred = mv_clf.predict(test)
mv_pred[:10]

array(['Ghoul', 'Goblin', 'Ghoul', 'Ghost', 'Ghost', 'Ghost', 'Ghoul',
       'Ghoul', 'Goblin', 'Ghoul'], dtype=object)

In [257]:
submission = pd.DataFrame({'id':test.id,'type':mv_pred})
submission.head()

Unnamed: 0,id,type
0,3,Ghoul
1,6,Goblin
2,9,Ghoul
3,10,Ghost
4,13,Ghost


In [258]:
# submission.to_csv('ghouls_mvc.csv',index=False)
# .737157 not an improvement

## A Big Pipe
Because fuck it - why not (apparently I just need to say Fuck it more because those are my best 2 submissions)

In [271]:
class ModelTransformer(TransformerMixin,BaseEstimator):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return DataFrame(self.model.predict(X))

In [309]:
pipe_big = Pipeline([('features',FeatureUnion([
                          ('f_cont',Pipeline([
                                ('extract',ColumnExtractor(cont_cols)),
                                ('poly',PolynomialFeatures(degree=3)),
                                ('kpca',KernelPCA())#,
#                                 ('kbest',SelectKBest(10))
                                            ])),
#                           ('f_cat',Pipeline([
#                                 ('extract',ColumnExtractor(cat_cols)),
#                                 ('dummies',GetDummies())]))
                           ])),
#                      ('estimators',FeatureUnion([
#                             ('svm',ModelTransformer(SVC(C=1.0,gamma=0.1,kernel='rbf',random_state=1))),
#                             ('knn',ModelTransformer(KNeighborsClassifier(weights='uniform',
#                                                            n_neighbors=8))),
#                             ('rfc',ModelTransformer(RandomForestClassifier(min_samples_split=10,
#                                                            criterion='entropy',
#                                                            max_features=4,
#                                                            n_estimators=20,
#                                                            random_state=1))),
#                             ('lrc',ModelTransformer(LogisticRegression(solver = 'newton-cg',
#                                                            class_weight = None,
#                                                            penalty = 'l2',
#                                                            C = 100.0,
#                                                            random_state=1)))
#                             ])),
                     ('clf',SVC())#C=1.0,gamma=0.1,kernel='rbf',random_state=1))
                    ])

In [326]:
KernelPCA?

In [335]:
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
param_grid = [{'features__f_cont__kpca__kernel': ["linear","poly","rbf"],
               'features__f_cont__kpca__degree': [2,3,4],
               'features__f_cont__poly__degree':[2,3],
               'clf__C': param_range,
               'clf__kernel': ['linear','rbf']}]
# ,
#               {'features__f_cont__poly__degree':[2,3],
#                'clf__C': param_range,
#                'clf__gamma': param_range,
#                'clf__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_big,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=4,
                  verbose=True)

In [336]:
gs.fit(train.drop(['id'],axis=1),train.type.values)
print(gs.best_score_)
print(gs.best_params_)

Fitting 4 folds for each of 252 candidates, totalling 1008 fits
0.765498652291
{'features__f_cont__poly__degree': 2, 'features__f_cont__kpca__kernel': 'rbf', 'clf__C': 10.0, 'features__f_cont__kpca__degree': 2, 'clf__kernel': 'linear'}


[Parallel(n_jobs=1)]: Done 1008 out of 1008 | elapsed:  2.1min finished


In [312]:
scores = cross_val_score(gs.best_estimator_, train, train.type.values,cv=4)
print(scores.mean(), scores.std())

0.765303203661 0.0146433813717


In [337]:
pred = gs.best_estimator_.predict(test)

In [338]:
submission3 = pd.DataFrame({'id':test.id.values,'type':pred})

In [339]:
(submission2.type == submission3.type).value_counts()

True     507
False     22
Name: type, dtype: int64

In [325]:
# this is essentially the same as submission3 except with KPCA set to default values
submission2.to_csv('ghouls_big_pipe.csv',index=False)

In [340]:
# this is the big pipe with KPCA as part of the grid search
submission3.to_csv('ghouls_big_pipe02.csv',index=False)

# although 22 values here are different than before - it received the same score as submission3 0.74291