## About dataset
### Kaggle Competition | Titanic Machine Learning from Disaster

>The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.  This sensational tragedy shocked the international community and led to better safety regulations for ships.

>One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.  Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

>In this contest, we ask you to complete the analysis of what sorts of people were likely to survive.  In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

>This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning."

From the competition [homepage](http://www.kaggle.com/c/titanic-gettingStarted).



In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Uncomment this if you are using Google Colab
#!wget https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/Titanic/train.csv

## Loading dataframe
Let's read our data in using pandas:

In [None]:
df = pd.read_csv("train.csv") 
print (df.shape)
print (df.columns)
df.head()

### Data description

The files we read in the previous screen are available on the data page for the Titanic competition on Kaggle. That page also has a data dictionary, which explains the various columns that make up the data set. Below are the descriptions contained in that data dictionary:

- PassengerID - A column added by Kaggle to identify each row and make submissions easier
- Survived - Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes)
- Pclass - The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd)
- Sex - The passenger's sex
- Age - The passenger's age in years
- SibSp - The number of siblings or spouses the passenger had aboard the Titanic
- Parch - The number of parents or children the passenger had aboard the Titanic
- Ticket - The passenger's ticket number
- Fare - The fare the passenger paid
- Cabin - The passenger's cabin number
- Embarked - The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

## Dealing with Null values

Let's count the missing values for all columns.

In [None]:
df.isnull().sum()

### Take care of missing values:
The features `cabin` has many missing values and so can’t add much value to our analysis. To handle this we will drop them from the dataframe to preserve the integrity of our dataset. It is also a good idea do delete the name of the passanger and the number of ticket - this data also cannot probably improve the quality of our predictions. Let's do it with pandas.

    df = df.drop(['ticket','cabin', 'Name'], axis=1) 

The next step is to do something with other missing values. Please note that we have 2 missing values in `Embarked` and 177 in `Age` columns. There are several possible solutions, one of them is simply delete all the rows with missing values. You can do it using the following pandas function.
   
    df = df.dropna()


In [None]:
df = df.drop(['Ticket','Cabin', 'Name'], axis=1)
# Remove NaN values
df = df.dropna() 

In [None]:
df.shape

## Task 1 -  Let's take a look at our data 

### Task 1 a) Data analysis
Let's check:
- how many people survived, and how many don't *(hint - google for pandas value_counts)*
- what is the Passanger Class (`Pclass` column) distribution of our data
- what is the age distribution of our data *(hint - use pandas histogram)*

In [None]:
df.Survived.value_counts()

In [None]:
df.Pclass.value_counts()

In [None]:
df['Age'].plot.hist()

### Task 1b -  Correlation analysis 
Prepare the correlation analysis of the data. In analysis you should display the information about correlations between each pair of variables.

*Hint: You may google for `pandas correlation matrix heatmap` to find the solution

In [None]:
import seaborn as sns
corr = df.select_dtypes(include=np.number).corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

In [None]:
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "7pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .format("{:.2f}")\
    .set_table_styles(magnify())

## Task 2 - Select features and labels
Create the new dataframes/series:
- `X` - with the features
- `y` - with the labels

These dataframes should be ready to use train_test_split, and then to perform machine learning.

**Note: This task will be probably done iteratively - you may get back to this task every time you want to improve your results **

In [None]:
y = df.Survived
y.head()

In [None]:
X = df[['Pclass', 'Fare']] # Doesn't work well
X = df[['Pclass', 'Fare', 'Sex', 'SibSp', 'Parch']] # Works well
X = df[['Pclass', 'Fare', 'Sex']] # Also works well
X.head()

## Task 3 One-hot Encoding
Perform one-hot encoding if necessary. 

*Hint: Use `pd.get_dummies`*

In [None]:
X = pd.get_dummies(X, columns=['Pclass', 'Sex']) 

In [None]:
X.head()

## Task 3 - Train test split
<img src="./images/TrainTestSplit.jpg">
Split the data into training and testing sets. Use the following parameters:
- Size of testing set = 25% of entire datasize
- Your training / testing sets should contain aproximately the same ratio of survived (use `stratify`)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify = y, random_state = 1)

print ('X train shape:', X_train.shape)
print ('X test shape:', X_test.shape)
print ('y train shape:', y_train.shape)
print ('y test shape:', y_test.shape)

## Task 4 - Train your model
Create and train your model. You may use any model you wish.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

print ('Model trained!')

## Task 5 - Test your model
Test your model. If necessary change something and train your model again.
Your goal is to prepare the model with the accuracy >= 75%.

In [None]:
print ("Train set accuracy:", model.score(X_train, y_train))
print ("Test set accuracy:", model.score(X_test, y_test))

# Let's test different models now

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(min_samples_leaf = 1)
model.fit(X_train, y_train)

print ('Train accuracy:', model.score(X_train, y_train))
print ('Test accuracy:', model.score(X_test, y_test))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print ('Train accuracy:', model.score(X_train, y_train))
print ('Test accuracy:', model.score(X_test, y_test))