#Introduction to Kaggle Machine Learning Competitions
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

In this tutorial, we are going to build several machine learning models to predict which passengers survived the tragedy.

* **Dataset**: Titanic: Machine Learning from Disaster URL: https://www.kaggle.com/c/titanic

* **Purpose**: Apply the tools of machine learning to predict which passengers survived the tragedy.

* **Tools**: Python/Numpy, Pandas, Scikit-Learn and IPython notebook

---
###Loading Training and Testing Datasets
* Pandas is the most popular python library for reading tabular data. 
* Similar to R's DataFrame 

For more details about Pandas library please read  "10 Minutes to pandas" [http://pandas.pydata.org/pandas-docs/stable/10min.html]

In [1]:
import pandas as pd
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

* Pandas comes with a lot to tools for exploring your datasets. 
* We are going to use few in this tutorials

In [2]:
# print first n number of rows
train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


VARIABLE DESCRIPTIONS:

**survival**:        Survival
                (0 = No; 1 = Yes)
                
**pclass**:          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
                
**name**:            Name

**sex**:             Sex

**age**:            Age

**sibsp**:           Number of Siblings/Spouses Aboard

**parch**:           Number of Parents/Children Aboard

**ticket**:          Ticket Number

**fare**:            Passenger Fare

**cabin**:           Cabin

**embarked**:        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

In [3]:
# print last n number of rows
train.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [4]:
# print first n number of rows
test.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [5]:
# print last n number of rows
test.tail(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
417,1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


In [6]:
# returns non-NA/null observations 
train.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [7]:
# returns non-NA/null observations 
test.count()

PassengerId    418
Pclass         418
Name           418
Sex            418
Age            332
SibSp          418
Parch          418
Ticket         418
Fare           417
Cabin           91
Embarked       418
dtype: int64

* In order to work with scikit-searn we need double/integer type numpy arrays. Above object type array will not work with scikit-searn.
* Also, need to separate response variable (i.e. Survived column) 

In [8]:
train.groupby('Survived').count()['PassengerId']

Survived
0    549
1    342
Name: PassengerId, dtype: int64

* Target variable is not balanced. So if you predict the majority class, you have a classifier with 61.61% (549/(549 + 342)) accuracy.
* So simple predictive model can be created without machine learning.

### Let's Create Our First Machine Learning Model
* Our first model will be based on numerical features only. 
* Working with numerical variables is much easier than categorical variables.
* Later we will consider categorical variables as well. 

In [9]:
# Use following method to extract a sub-set of columns from original DataFrame
numerical_features = train[['Age', 'Fare', 'Pclass']]

In [10]:
#returns non-NA/null observations
numerical_features.count()

Age       714
Fare      891
Pclass    891
dtype: int64

* Now we have a little issue, **Age** column has some missing values
* First of all, we have to impute those missing values.
* Our strategy for handling missing values called **mean imputation**: calculate the mean value of numerical columns and replace missing values with the mean.

In [11]:
numerical_features_without_na = numerical_features.dropna()
mean = numerical_features_without_na.mean()
print mean

Age       29.699118
Fare      34.694514
Pclass     2.236695
dtype: float64


In [12]:
# Now you can impute all missing values with the mean vector we just calculated
imputed_features_training = numerical_features.fillna(mean)

In [13]:
imputed_features_training.count()

Age       891
Fare      891
Pclass    891
dtype: int64

In [14]:
# Now we can convert our DataFrame to numpy arrays as shown below
X_train = imputed_features_training.values
y_train = train['Survived'].values

* You can use the same method to impute missing values in the testing dataset

In [15]:
numerical_features_testing = test[['Age', 'Fare', 'Pclass']]

In [16]:
numerical_features_testing.count()

Age       332
Fare      417
Pclass    418
dtype: int64

In [17]:
imputed_features_testing = numerical_features_testing.fillna(mean)

In [18]:
imputed_features_testing.count()

Age       418
Fare      418
Pclass    418
dtype: int64

### Let's Build a Simple Model Using Logistic Regression

In [19]:
%%time 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import KFold
from sklearn.preprocessing import scale

folds = KFold(y_train.shape[0], n_folds=5, shuffle=True)
cv_accuracies = []
for trining_idx, testing_idx in folds:
    X_train_cv = X_train[trining_idx]
    y_train_cv = y_train[trining_idx]
    
    X_test_cv = X_train[testing_idx]
    y_test_cv = y_train[testing_idx]
    
    logistic_regression = LogisticRegression()
    logistic_regression.fit(scale(X_train_cv), y_train_cv)
    y_predict_cv = logistic_regression.predict(scale(X_test_cv))
    current_accuracy = accuracy_score(y_test_cv, y_predict_cv)
    cv_accuracies.append(current_accuracy)
    print 'cross validation accuracy: %f' %(current_accuracy)
    
print '---------------------------------------'
print 'average corss validation accuracy: %f' %(sum(cv_accuracies)/len(cv_accuracies))    
print '---------------------------------------'

cross validation accuracy: 0.698324
cross validation accuracy: 0.724719
cross validation accuracy: 0.713483
cross validation accuracy: 0.679775
cross validation accuracy: 0.691011
---------------------------------------
average corss validation accuracy: 0.701463
---------------------------------------
CPU times: user 156 ms, sys: 28.3 ms, total: 184 ms
Wall time: 692 ms


In [20]:
from sklearn.linear_model import LogisticRegression

?LogisticRegression

* Performance of the above model was measured by 5-Fold cross validation. For more details about cross validation please read Chapter 5 of [1] and Chapter 1 of [2]. 
* Actually, above model has a small issue. When we are calculating the missing values, we considered the entire training dataset. Therefore, cross validation fold in the above model contains some information that should not be included in the cross validation fold. 
* You can read details about **LogisticRegression** in Chapter 4 of [1] and Chapter 4 of [2]. 

## Let's Submit Our Ebarrassingly Bad Solution to Kaggle

In [34]:
# First train Logistic Regression using the full training dataset
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
X_test = imputed_features_testing.values
y_test = logistic_regression.predict(X_test)

# save data
test_result = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':y_test})
test_result.to_csv('../data/submission.csv', index=False)

## Adding More Features
Let's add few categorical variables such as **Sex**, **Cabin**, and **Embarked** 

In [70]:
train[['Sex']].head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [22]:
# categorical feature should be encoded before feeding to scikit-learn algorithms
pd.get_dummies(train['Sex'], prefix='Sex').head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


* One hot (a.k.a. one of K) encoding is one of the common techniques for encoding categorical features.
* For instance, if you want to represent weekdays (such as Sunday, Monday, etc.) as features:
    * Create 7 dummy features and make only one active
      
| Day| D_1 | D_2 | D_3 | D_4 | D_5 | D_6 | D_7
| :- |--: | :-:
|Sunday | 1 | 0 | 0 | 0 | 0 | 0 | 0 
|Friday | 0 | 0 | 0 | 0 | 0 | 1 | 0 

* Scikit-learn also comes with one hot encoding support. Please read http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html for details. 
    

In [23]:
more_features_training = pd.concat([train[['Age', 'Fare', 'Pclass', 'SibSp', 'Parch']], 
                                  pd.get_dummies(train[['Sex']]),
                                  pd.get_dummies(train[['Embarked']])], axis=1)

In [24]:
more_features_training.head(5)

Unnamed: 0,Age,Fare,Pclass,SibSp,Parch,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,22,7.25,3,1,0,0,1,0,0,1
1,38,71.2833,1,1,0,1,0,1,0,0
2,26,7.925,3,0,0,1,0,0,0,1
3,35,53.1,1,1,0,1,0,0,0,1
4,35,8.05,3,0,0,0,1,0,0,1


In [25]:
more_features_training.count()

Age           714
Fare          891
Pclass        891
SibSp         891
Parch         891
Sex_female    891
Sex_male      891
Embarked_C    891
Embarked_Q    891
Embarked_S    891
dtype: int64

In [26]:
mean = more_features_training.dropna().mean()
more_features_training_without_nan = more_features_training.fillna(mean)

In [27]:
more_features_training_without_nan.count()

Age           891
Fare          891
Pclass        891
SibSp         891
Parch         891
Sex_female    891
Sex_male      891
Embarked_C    891
Embarked_Q    891
Embarked_S    891
dtype: int64

In [30]:
# Now we can convert our DataFrame to numpy arrays as shown below
X_train = imputed_features_training.values
y_train = train['Survived'].values

In [28]:
%%time 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import KFold
from sklearn.preprocessing import scale

X_train = more_features_training_without_nan.values
y_train = train['Survived'].values

folds = KFold(y_train.shape[0], n_folds=5, shuffle=True)
cv_accuracies = []
for trining_idx, testing_idx in folds:
    X_train_cv = X_train[trining_idx]
    y_train_cv = y_train[trining_idx]
    
    X_test_cv = X_train[testing_idx]
    y_test_cv = y_train[testing_idx]
    
    logistic_regression = LogisticRegression()
    logistic_regression.fit(scale(X_train_cv), y_train_cv)
    y_predict_cv = logistic_regression.predict(scale(X_test_cv))
    current_accuracy = accuracy_score(y_test_cv, y_predict_cv)
    cv_accuracies.append(current_accuracy)
    print 'cross validation accuracy: %f' %(current_accuracy)
    
print '---------------------------------------'
print 'average corss validation accuracy: %f' %(sum(cv_accuracies)/len(cv_accuracies))  

cross validation accuracy: 0.787709
cross validation accuracy: 0.775281
cross validation accuracy: 0.786517
cross validation accuracy: 0.825843
cross validation accuracy: 0.808989
---------------------------------------
average corss validation accuracy: 0.796868
CPU times: user 10.3 ms, sys: 0 ns, total: 10.3 ms
Wall time: 11.3 ms


### Let's Train a RandomForestClassifier.

In [31]:
%%time 
from sklearn.ensemble import RandomForestClassifier

X_train = more_features_training_without_nan.values
y_train = train['Survived'].values

folds = KFold(y_train.shape[0], n_folds=5, shuffle=True)
cv_accuracies = []
for trining_idx, testing_idx in folds:
    X_train_cv = X_train[trining_idx]
    y_train_cv = y_train[trining_idx]
    
    X_test_cv = X_train[testing_idx]
    y_test_cv = y_train[testing_idx]
    
    random_forest = RandomForestClassifier(n_estimators = 100)
    random_forest.fit(scale(X_train_cv), y_train_cv)
    y_predict_cv = random_forest.predict(scale(X_test_cv))
    current_accuracy = accuracy_score(y_test_cv, y_predict_cv)
    cv_accuracies.append(current_accuracy)
    print 'cross validation accuracy: %f' %(current_accuracy)

    
print '---------------------------------------'
print 'average corss validation accuracy: %f' %(sum(cv_accuracies)/len(cv_accuracies)) 
print '---------------------------------------\n'

cross validation accuracy: 0.810056
cross validation accuracy: 0.820225
cross validation accuracy: 0.786517
cross validation accuracy: 0.764045
cross validation accuracy: 0.780899
---------------------------------------
average corss validation accuracy: 0.792348
---------------------------------------

CPU times: user 356 ms, sys: 2.15 ms, total: 359 ms
Wall time: 362 ms


In [32]:
%%time 
from sklearn.svm import SVC

X_train = more_features_training_without_nan.values
y_train = train['Survived'].values

folds = KFold(y_train.shape[0], n_folds=5, shuffle=True)
cv_accuracies = []
for trining_idx, testing_idx in folds:
    X_train_cv = X_train[trining_idx]
    y_train_cv = y_train[trining_idx]
    
    X_test_cv = X_train[testing_idx]
    y_test_cv = y_train[testing_idx]
    
    svc = SVC(C = 1.4)
    svc.fit(scale(X_train_cv), y_train_cv)
    y_predict_cv = svc.predict(scale(X_test_cv))
    current_accuracy = accuracy_score(y_test_cv, y_predict_cv)
    cv_accuracies.append(current_accuracy)
    print 'cross validation accuracy: %f' %(current_accuracy)

    
print '---------------------------------------'
print 'average corss validation accuracy: %f' %(sum(cv_accuracies)/len(cv_accuracies)) 

cross validation accuracy: 0.821229
cross validation accuracy: 0.792135
cross validation accuracy: 0.786517
cross validation accuracy: 0.820225
cross validation accuracy: 0.842697
---------------------------------------
average corss validation accuracy: 0.812560
CPU times: user 84.3 ms, sys: 0 ns, total: 84.3 ms
Wall time: 82.4 ms


### Hyperparameter Optimization
* Above machine learning algorithms have some hyperparameters.
* We just used some random numbers for these hyperparameters. For the best performance, we need to pick optimum parameters.
* Scikit-learn comes with:
    * GridSearch http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV
    * RandomizedSearch http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV

In [34]:
%%time 
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

X_train = more_features_training_without_nan.values
y_train = train['Survived'].values

cls = RandomForestClassifier()
parameters = {
    'n_estimators' : [10, 20, 40, 80, 160, 320, 640],
    'max_depth' : [2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 40],
    'criterion' : ['gini', 'entropy']
}
gs = GridSearchCV(cls, parameters, cv=5, n_jobs=8, scoring="accuracy")
gs.fit(X_train, y_train)

print gs.best_score_

0.832772166105
CPU times: user 1.4 s, sys: 62.3 ms, total: 1.46 s
Wall time: 22 s


### You Can Further Increase Performance
* Create more features (a.k.a Feature Engineering)
* Ensemble above classifiers and create a meta-classifier.
* You can try other libraries such as **XGBoost** [https://github.com/dmlc/xgboost], Gradient Boosting (GBDT, GBRT or GBM) Library, Deep Learning libraries (such as **Theano** and **Keras**)
* You can also use **R language** for building your models.

## References
[1]. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, **An Introduction to Statistical Learning** freely available [http://www-bcf.usc.edu/~gareth/ISL/]

[2]. Christopher M. Bishop **Pattern Recognition and Machine Learning** (Information Science and Statistics) Springer-Verlag New York, Inc. Secaucus, NJ, USA ©2006 
ISBN:0387310738

#### Also well worth reading:
* Level-Up Your Machine Learning https://www.metacademy.org/roadmaps/cjrd/level-up-your-ml
* How do I start doing Kaggle competitions? https://www.quora.com/How-do-I-start-doing-Kaggle-competitions
* What do top Kaggle competitors focus on? https://www.quora.com/What-do-top-Kaggle-competitors-focus-on
* How do I learn machine learning? https://www.quora.com/How-do-I-learn-machine-learning-1 