# Custom Classes: [Estimators](#estBegin)

We'll be introducing some new tools to implement what we did last session. Using these custom classes (regressors, classifiers, cluster-ers, transformers, feature unions and pipelines) can be powerful additions to your tool belt.

This introduction is modeled after Adam Rogers's titanic_finished-ish.py script we worked through last time.

We start by pulling in the datasets and importing our libraries. Data available at https://www.kaggle.com/c/titanic/data.

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import sklearn as sk
import pandas as pd
import numpy as np
from scikitDemoHelpers import genericLevelsToDummiesTransformer

In [149]:
train = pd.read_csv('../titanic/data/train.csv')
test = pd.read_csv('../titanic/data/test.csv')

# combining early to apply transformations uniformly
combinedSet = pd.concat([train , test], axis = 0)
combinedSet = combinedSet.reset_index(drop = True)

As a reminder this set includes:

| Variable      | Description  |  Values  |
| ------------- |:-------------:| -----:|
| survived      | Survival | (0 = No; 1 = Yes) |
| pclass     | Passenger Class     |   (1 = 1st; 2 = 2nd; 3 = 3rd) |
| name  | Name     |    String |
| sex | Sex      |    ('male' or 'female') |
| age | Age     |    Float 0-80  |
| sibsp | Number of Siblings/Spouses Aboard      |    Int |
| parch | Number of Parents/Children Aboard      |    Int |
| ticket | Ticket Number      |    String  |
| fare | Passenger Fare      |    Float |
| cabin| Cabin     |    String (e.g. C134) |
| embarked| Port of Embarkation      |    ('C' = Cherbourg; 'Q' = Queenstown; 'S' = Southampton) |


<a id='estBegin'></a>

## Estimators
We are familiar with estimators like LogisticRegression, NearestNeighbors, and DecisionTreeClassifier.

We instantiate an estimator then fit, predict, possibly score.

In [None]:
from sklearn import tree
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix

In [None]:
combinedSet.loc[1:10,['Pclass', 'Age', 'Fare', 'Survived']]

In [None]:
X_tree = passengers.loc[:,['Age', 'Fare']]\
        .fillna(0)
y_tree = passengers['Survived']\
        .fillna(0)

In [None]:
X_tree

In [None]:
treeClf= tree.DecisionTreeClassifier()
treeClf.fit(X_tree_train, y_tree_train)

In [None]:
tree_predictions = treeClf.predict(X_tree_test)

In [None]:
print 'Mean Accuracy Score: ', treeClf.score(X_tree_test, y_tree_test)
print 'Confusion Matrix: \n', \
pd.DataFrame(confusion_matrix(tree_predictions, y_tree_test))



You will want to include the fit, predict, and score methods:
``` python                                                                                                                                        
class Estimator(base.BaseEstimator, base.ClassifierMixin):
  def __init__(self, ...):
  # initialization code
  
  def fit(self, X, y):
  # fit the model ...
    return self
    
  def predict(self, X):
    return # prediction
    
  def score(self, X, y):
    return # custom score implementation
```

Let's create a custom estimator based on the majority survival rate grouped by passenger class, e.g. if most the people in 1st class survived, estimate any test observation from first class survived.

### Example to show customization of inputs compared to base estimators:

In [None]:
class PClassEstDFonly(sk.base.BaseEstimator, sk.base.ClassifierMixin):
    def __init__(self):
        # initialization code
        self.modelDF=pd.DataFrame()

    def fit(self, train_DF):
        #fit the model to the majority vote
        self.modelDF=train_DF.loc[:,['Pclass', 'Survived']]\
                        .groupby('Pclass')\
                        .mean()\
                        .round()\
                        .astype(int)
    
        return self

    def predict(self, test_DF):
        return self.modelDF.loc[test_DF['Pclass'], 'Survived']

    def score(self, X, y):
        # custom score implementation
        return 0


In [None]:
pClassClfDFonly= PClassEstDFonly()
pClassClfDFonly.fit(passengers[1:700])

In [None]:
pClassClfDFonly.predict(passengers[701:]).head(10)

### Example to follow fit(X, y), predict(X) pattern:

In [None]:
class PClassEst2(sk.base.BaseEstimator, sk.base.ClassifierMixin):
    def __init__(self):
        # initialization code
        self.modelDF=pd.DataFrame()

    def fit(self, train_DF, train_labels):
        #fit the model to the 
        
        self.modelDF=train_DF.loc[:,['Pclass', 'Survived']]\
                        .groupby('Pclass')\
                        .mean()\
                        .round()
    
        return self

    def predict(self, test_DF):
        return self.modelDF.loc[test_DF['Pclass'], 'Survived']

    def score(self, X, y):
        # custom score implementation
        return 0

In [None]:
X_tree_train, X_tree_test, y_tree_train, y_tree_test = \
    cross_validation.train_test_split(passengers.drop('Survived',1), passengers.Survived)

In [None]:
X_tree_train

In [None]:
# Split into training and test sets
X_train, X_test, y_train, y_test = \
    cross_validation.train_test_split(\
                                      passengers.drop(['PassengerId','Survived'],1),
                                      passengers['Survived'], \
                                      test_size=0.25, \
                                      random_state=13)