In [173]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Math

%matplotlib inline

In [281]:
train = pd.read_csv('data/titanic_train_model.csv')
X = train.drop('Survived', axis=1)
y = train.Survived

In [323]:
train.sort_values(by='Survived')

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,PersonType,Title,Survived
0,-0.820037,-0.737695,-0.595254,0.432793,-0.473674,-0.502445,-0.571933,-0.742596,-0.712190,0
519,-0.820037,-0.737695,0.174278,-0.474545,-0.473674,-0.489442,-0.571933,-0.742596,-0.712190,0
521,-0.820037,-0.737695,-0.595254,-0.474545,-0.473674,-0.489442,-0.571933,-0.742596,-0.712190,0
522,-0.820037,-0.737695,0.011224,-0.474545,-0.473674,-0.502949,1.000883,-0.742596,-0.712190,0
524,-0.820037,-0.737695,0.011224,-0.474545,-0.473674,-0.502864,1.000883,-0.742596,-0.712190,0
525,-0.820037,-0.737695,0.828379,-0.474545,-0.473674,-0.492378,2.573699,-0.742596,-0.712190,0
527,0.431081,-0.737695,0.011224,-0.474545,-0.473674,3.817033,-0.571933,-0.742596,-0.712190,0
528,-0.820037,-0.737695,0.712950,-0.474545,-0.473674,-0.488854,-0.571933,-0.742596,-0.712190,0
529,1.682199,-0.737695,-0.518301,1.340132,0.767630,-0.416873,-0.571933,-0.742596,-0.712190,0
531,-0.820037,-0.737695,0.011224,-0.474545,-0.473674,-0.502864,1.000883,-0.742596,-0.712190,0


In [283]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

## Linear Models

![validation](images/linear_regression.png)

$$ \frac{1}{2m} \sum_i (h(x_i) - y_i)^2 \text{ mit } h(x) = m*x + t$$
$$  $$

In [243]:
from sklearn import linear_model
model_lr = linear_model.LogisticRegression()

In [244]:
model_lr.fit(X,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [245]:
model_lr.score(X,y)

0.81818181818181823

There are lots of estimators in sklearn. They all share a common interface. Thus it's easy to use estimators even when you don't fully understand how they work. Though make sure to read at least the doc string documentation about an estimator before using it.

## Estimator API  

### 1. estimator.fit(X_train, y_train)  
- Trains the model using training data X und training labels y

### 2. estimator.predict(X_test)
- Uses the prior trained model to predict labels (callsification), values (regression) for test data

### 3. estimator.score(X_val, y_val)
- Firstly, uses estimator.predict(X_val) to predict the labels/values, then uses the given labels y_val to score the result

## New concept: Validation
Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias, variance and noise. The bias of an estimator is its average error for different data sets. The variance of an estimator indicates how sensitive it is to varying data sets. Noise is a property of the data. The estimator has no impact on the noise in the data thus we can only try to lower the variance and the bias of an estimator. 

error = bias + variance + (noise)

Our goal is find an estimator which is able to generalize well to new/unseen data sets.

### How do we know if our estimator has high bias/variance?
We need some new/unknown data for which we know the correct labels. We can't use our test data as we don't know the correct labels. That's why there is now other possibility but to take it from our valuable training data. This data set is called validation data set.
#### 1. Caluclation based
We could just use  some math and compute the bias and the variance of our estimator. Unfortunately this isn't easy to do it for classification problems. 
#### 2. Graph based
Another way is to compare the graphs for the training error and the validation error to find out whether our estimator is overfitting oder underfitting. This helps us to get an idea of how to change our model to reduce the error.

**Overfitting:** Small training error + Large validation error => Reduce model complexity + Regularization  
**Underfitting:** Large training error + Large validation error => Increase model complexity

In [328]:
train.sort_values(by='Survived', inplace=True)
X = train.drop('Survived', axis=1)
y = train.Survived

In [355]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=3, stratify=None, shuffle=False)

In [356]:
X_val[y_val == 0]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,PersonType,Title


In [359]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=3, stratify=y)

In [358]:
X_val[y_val == 0]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,PersonType,Title
672,1.682199,-0.737695,3.098497,-0.474545,-0.473674,-0.437007,-0.571933,-0.742596,-0.712190
83,0.431081,-0.737695,-0.133535,-0.474545,-0.473674,0.299922,-0.571933,-0.742596,-0.712190
888,-0.820037,1.355574,0.011224,0.432793,2.008933,-0.176263,-0.571933,0.771484,1.038323
782,0.431081,-0.737695,-0.056582,-0.474545,-0.473674,-0.044381,-0.571933,-0.742596,-0.712190
711,0.431081,-0.737695,0.011224,-0.474545,-0.473674,-0.113846,-0.571933,-0.742596,-0.712190
714,1.682199,-0.737695,1.713341,-0.474545,-0.473674,-0.386671,-0.571933,-0.742596,-0.712190
48,-0.820037,-0.737695,0.011224,1.340132,-0.473674,-0.211918,1.000883,-0.742596,-0.712190
225,-0.820037,-0.737695,-0.595254,-0.474545,-0.473674,-0.460162,-0.571933,-0.742596,-0.712190
592,-0.820037,-0.737695,1.328575,-0.474545,-0.473674,-0.502445,-0.571933,-0.742596,-0.712190
789,0.431081,-0.737695,1.251622,-0.474545,-0.473674,0.946246,1.000883,-0.742596,-0.712190


In [259]:
model_lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [260]:
model_lr.score(X_val, y_val)

0.81343283582089554

In [255]:
model_ridge = linear_model.RidgeClassifier(alpha=800) #

In [257]:
model_ridge.fit(X_train,y_train)
model_ridge.score(X_val,y_val)

0.80223880597014929

In [360]:
from sklearn.ensemble import GradientBoostingClassifier

In [390]:
gbc = GradientBoostingClassifier(n_estimators=80, learning_rate=0.1)

In [391]:
gbc.fit(X_train, y_train, )

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=80,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [392]:
gbc.score(X_train, y_train)

0.884430176565008

In [393]:
gbc.score(X_val, y_val)

0.85074626865671643