# Setup

First we will install kaggle command line tool. This tool enables us to download datsets and work with competitions without going to kaggle website.

In [1]:
!pip install kaggle



**Connecting colab with kaggle command line tool.**

In order to do this you need to set up an API key on kaggle and put the key in your google drive.

Go to "My Account" on Kaggle. You can find this by clicking on your profile icon in the upper right hand corner of the site. Scroll down and find the API section. Click on the "Create New API Token" button and download the resulting file. Upload it to your google drive.

Then you can continue by running the code below:

In [2]:
# This authenticates with Kaggle using the api key you put in your google drive
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth

auth.authenticate_user()

drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])

filename = "/root/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)

request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Download 100%.


Now you are ready to download dataset and start building your submission. The code below will download dataset for you. 

Tip: 
1. Read more about Titanic competition on Kaggle https://www.kaggle.com/c/titanic - a good Data Scientist always tries to get familiarized with the data and the story behind it
2. Take a note where the files are saved to, you will need those paths to access them

In [3]:
!kaggle competitions download --force -c titanic

Downloading test.csv to /content
  0% 0.00/28.0k [00:00<?, ?B/s]
100% 28.0k/28.0k [00:00<00:00, 10.3MB/s]
Downloading gender_submission.csv to /content
  0% 0.00/3.18k [00:00<?, ?B/s]
100% 3.18k/3.18k [00:00<00:00, 3.16MB/s]
Downloading train.csv to /content
  0% 0.00/59.8k [00:00<?, ?B/s]
100% 59.8k/59.8k [00:00<00:00, 52.7MB/s]


# Exploratory Data Analysis

Load data as pandas dataframe and do some basic exploratory analysis on the data to get a feel how it looks like and what needs to be cleaned.

Tips:
1. Check which columns are not useful as features
2. Check for missing values and replace them
3. Rememebr some classifiers do not handle categorical features, convert categorical featurs to binary
4. Remember to apply same transformations to test dataset!
  * I recommend you do all transformations in one function that takes a data_frame object as parameter and returns a modified/transformed data_frame. Then you can simply call `train_df = do_transformations(train_df)` and `test_df = do_transformations(test_df)`
5. You may also want to do some features engineering, again, make sure same engineering is performed on test dataset
  * plot some graphs, see distributions
  * You can leave this step for after you submitted initial results
6. Since submission file needs `PassengerId` column, do not drop it in the `do_transformations` function!
  * But don't use it for the model, this is not an useful feature but rather an index.

In [4]:
import pandas as pd
train_df_raw = pd.read_csv('/content/train.csv')
test_df_raw = pd.read_csv('/content/test.csv')

In [5]:
train_df_raw


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [6]:
def do_transformations(t_df, df):
  df = df.drop("Ticket", axis=1)
  df = df.drop("Cabin", axis=1)
  df = df.drop("Name", axis=1)
  df = pd.get_dummies(df, ['Sex','Embarked'])
  df.Age = df.Age.fillna(t_df.Age.mean())
  df.Fare = df.Fare.fillna(t_df.Fare.mean())
  
  return df

In [7]:
train_df = do_transformations(train_df_raw, train_df_raw)
test_df = do_transformations(train_df_raw, test_df_raw)

train_df = train_df.drop("PassengerId", axis=1)

In [8]:
train_df

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.000000,1,0,7.2500,0,1,0,0,1
1,1,1,38.000000,1,0,71.2833,1,0,1,0,0
2,1,3,26.000000,0,0,7.9250,1,0,0,0,1
3,1,1,35.000000,1,0,53.1000,1,0,0,0,1
4,0,3,35.000000,0,0,8.0500,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,27.000000,0,0,13.0000,0,1,0,0,1
887,1,1,19.000000,0,0,30.0000,1,0,0,0,1
888,0,3,29.699118,1,2,23.4500,1,0,0,0,1
889,1,1,26.000000,0,0,30.0000,0,1,1,0,0


# Machine Learning

As you know there is multiple ways to build a good model. In competative ML you have to find the best algorithm for a given task. You are given training dataset and a test dataset. However, test dataset do not have labels. You will still make predicitons on test dataset but the predicted labels will be submitted to kaggle and they will assign the score based on how good your predictions were. 

All you can work with to build a model is `training.csv`.  There is multiple apporaches to use it, but the most important thin gwhen building a model is to make sure 

Tips:
1. Check what cost function is used for a given competition and use it to evaluate models
2. Try a few different algorithms and parameters
3. It's recommTo build a robust model (not overfitted) use cross validation to find the best algorithm and parameters. Then you can build the model using the best algorithm and parameters and the whole training dataset
  * Instead of building a single model on the whole dataset try submitting a model that makes a prediciton based on predictions of the models built in cross validation (for 5 fold CV you will have 5 models each correspondign to training fold)

In [9]:
from sklearn.tree import DecisionTreeClassifier

X = train_df[train_df.columns[1:]]
y = train_df["Survived"]

model = DecisionTreeClassifier()
model.fit(X,y)

best_model = model


# Submission

When you are done with selecting the best model generate a submission file. Kaggle expects a .csv file with first column being id and second predicted value. Code below defines a function that will save the submission in `/content/submission.csv` file. 

In [10]:
def generate_submission_file(best_model, test_df):
  test_df2 = test_df.drop('PassengerId', axis=1)
  Y = best_model.predict(test_df2)
  
  
  submission = pd.DataFrame({
    "PassengerId": test_df["PassengerId"],
    "Survived": Y
  })
  submission.to_csv('/content/submission.csv', index=False)

You actually have to run that function to generate submission file. 
Assumptions for the code below:
1. Your best model is stored under `best_model` variable
2. You have applied all necessary transformations to `test.csv` and that data is stored in `test_df` variable
  * necessary = same as on training dataset

In [11]:
generate_submission_file(best_model, test_df)

Once the file is generated you can submit it using the command below.

Notes:
1. You have to accept competitions rules on the website!
2. You can modify the message if you want to provide more meaningful descriptions to your submissions

In [12]:
!kaggle competitions submit -c titanic -f /content/submission.csv -m "A test submission"

100% 2.77k/2.77k [00:04<00:00, 613B/s]
Successfully submitted to Titanic: Machine Learning from Disaster