# Custom Classes: Estimators

We'll be introducing some new tools to implement what we did last session. Using these custom classes (regressors, classifiers, cluster-ers, transformers, feature unions and pipelines) can be powerful additions to your tool belt.

This introduction is modeled after Adam Rogers's titanic_finished-ish.py script we worked through last time.

We start by pulling in the datasets and importing our libraries. Data available at https://www.kaggle.com/c/titanic/data.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import sklearn as sk
import pandas as pd
import numpy as np
from scikitDemoHelpers import genericLevelsToDummiesTransformer

In [3]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

# combining early to apply transformations uniformly
combinedSet = pd.concat([train , test], axis = 0)
combinedSet = combinedSet.reset_index(drop = True)

As a reminder this set includes:

| Variable      | Description  |  Values  |
| ------------- |:-------------:| -----:|
| survived      | Survival | (0 = No; 1 = Yes) |
| pclass     | Passenger Class     |   (1 = 1st; 2 = 2nd; 3 = 3rd) |
| name  | Name     |    String |
| sex | Sex      |    ('male' or 'female') |
| age | Age     |    Float 0-80  |
| sibsp | Number of Siblings/Spouses Aboard      |    Int |
| parch | Number of Parents/Children Aboard      |    Int |
| ticket | Ticket Number      |    String  |
| fare | Passenger Fare      |    Float |
| cabin| Cabin     |    String (e.g. C134) |
| embarked| Port of Embarkation      |    ('C' = Cherbourg; 'Q' = Queenstown; 'S' = Southampton) |


<a id='estBegin'></a>

## Estimators
We are familiar with estimators like LogisticRegression, NearestNeighbors, and DecisionTreeClassifier.

We instantiate an estimator then fit, predict, possibly score.

In [4]:
from sklearn import tree
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix

In [5]:
combinedSet.loc[1:10,['Pclass', 'Age', 'Fare', 'Survived']]

Unnamed: 0,Pclass,Age,Fare,Survived
1,1,38.0,71.2833,1.0
2,3,26.0,7.925,1.0
3,1,35.0,53.1,1.0
4,3,35.0,8.05,0.0
5,3,,8.4583,0.0
6,1,54.0,51.8625,0.0
7,3,2.0,21.075,0.0
8,3,27.0,11.1333,1.0
9,2,14.0,30.0708,1.0
10,3,4.0,16.7,1.0


In [12]:
X_tree = combinedSet.loc[:,['Age', 'Fare']]\
        .fillna(0)
y_tree = combinedSet['Survived']\
        .fillna(0)

In [16]:
X_tree_train, X_tree_test, y_tree_train, y_tree_test = \
    cross_validation.train_test_split(X_tree, y_tree, random_state=13)

In [17]:
treeClf= tree.DecisionTreeClassifier()
treeClf.fit(X_tree_train, y_tree_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [18]:
tree_predictions = treeClf.predict(X_tree_test)

In [20]:
print 'Mean Accuracy Score: ', treeClf.score(X_tree_test, y_tree_test)
print 'Confusion Matrix: \n', \
pd.DataFrame(confusion_matrix(tree_predictions, y_tree_test))

Mean Accuracy Score:  0.682926829268
Confusion Matrix: 
     0   1
0  200  65
1   39  24


Let's create a custom estimator based on the majority survival rate grouped by passenger class, e.g. if most the people in 1st class survived, estimate any test observation from first class survived.



You will want to include the fit, predict, and score methods:
``` python                                                                                                                                        
class Estimator(base.BaseEstimator, base.ClassifierMixin):
  def __init__(self, ...):
  # initialization code
  
  def fit(self, X, y):
  # fit the model ...
    return self
    
  def predict(self, X):
    return # prediction
    
  def score(self, X, y):
    return # custom score implementation
```

### Example to show customization of inputs compared to base estimators:

In [89]:
class PClassEstDFonly(sk.base.BaseEstimator, sk.base.ClassifierMixin):
    def __init__(self):
        # initialization code
        self.modelDF=pd.DataFrame()

    def fit(self, train_DF):
        #fit the model to the majority vote
        self.modelDF=train_DF.loc[:,['Pclass', 'Survived']]\
                        .groupby('Pclass')\
                        .mean()\
                        .round()\
                        .astype(int)    
        return self

    def predict(self, test_DF):
        return self.modelDF.loc[test_DF['Pclass'], 'Survived'].values

    def score(self, X, y):
        # custom score implementation
        #F1 score : 2 * precision * recall/(precision + recall)
        predictions = self.predict(X)

        # true positives
        tp = sum(predictions * y) * 1.0
        # false positives
        fp = sum((1-predictions) * y) * 1.0
        # false negatives
        fn = sum(predictions * (1-y)) * 1.0
        
        precision =  tp / (tp + fp)
        recall = tp / (tp + fn)
        return 2 * precision * recall/(precision + recall)


In [100]:
pClassClfDFonly= PClassEstDFonly()
pClassClfDFonly.fit(train[1:700])

PClassEstDFonly()

In [101]:
pClassClfDFonly.score(train[701:], train.Survived[701:])

0.56896551724137923

### Example to follow fit(X, y), predict(X) pattern:

In [93]:
class PClassEst2(sk.base.BaseEstimator, sk.base.ClassifierMixin):
    def __init__(self):
        # initialization code
        self.modelDF=pd.DataFrame()

    def fit(self, train_DF, train_labels):
        #fit the model to the 
        
        self.modelDF=train_DF.loc[:,['Pclass', 'Survived']]\
                        .groupby('Pclass')\
                        .mean()\
                        .round()
    
        return self

    def predict(self, test_DF):
        return self.modelDF.loc[test_DF['Pclass'], 'Survived']

    def score(self, X, y):
        # custom score implementation
        # F1 score : 2 * precision * recall/(precision + recall)
        predictions = self.predict(X)
        # let's use scikit learn's implementation
        return sk.metrics.f1_score(y, predictions)  

In [97]:
pClassClfDFonly2= PClassEst2()
pClassClfDFonly2.fit(train[1:700], train.Survived[1:700])

PClassEst2()

In [99]:
pClassClfDFonly.score(train[701:], train.Survived[701:])

0.56896551724137923

## On to [Pipelines](./pipelines.ipynb#pipeBegin)