<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/04_classification/04_classification_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Classification Project

In this project you will apply what you have learned about classification and TensorFlow to complete a project from Kaggle. The challenge is to achieve a high accuracy score while trying to predict which passengers survived the Titanic ship crash. After building your model, you will upload your predictions to Kaggle and submit the score that you get.

## The Titanic Dataset

[Kaggle](https://www.kaggle.com) has a [dataset](https://www.kaggle.com/c/titanic/data) containing the passenger list on the Titanic. The data contains passenger features such as age, gender, ticket class, as well as whether or not they survived.

Your job is to create a binary classifier using TensorFlow to determine if a passenger survived or not. The `Survived` column lets you know if the person survived. Then, upload your predictions to Kaggle and submit your accuracy score at the end of this Colab, along with a brief conclusion.


To get the dataset, you'll need to accept the competition's rules by clicking the "I understand and accept" button on the [competition rules page](https://www.kaggle.com/c/titanic/rules). Then upload your `kaggle.json` file and run the code below.

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && cp kaggle.json ~/.kaggle/ && echo 'Done'
! kaggle competitions download -c titanic
! ls

**Note: If you see a "403 - Forbidden" error above, you still need to click "I understand and accept" on the [competition rules page](https://www.kaggle.com/c/titanic/rules).**

Three files are downloaded:

1. `train.csv`: training data (contains features and targets)
1. `test.csv`: feature data used to make predictions to send to Kaggle
1. `gender_submission.csv`: an example competition submission file

## Step 1: Exploratory Data Analysis

Perform exploratory data analysis and data preprocessing. Use as many text and code blocks as you need to explore the data. Note any findings. Repair any data issues you find.

**Student Solution**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

test_df.describe()

---

In [None]:
# Repairing Age Column
train_df['Age'].fillna(train_df['Age'].mean(), inplace=True)
test_df['Age'].fillna(test_df['Age'].mean(), inplace=True)

# Reparing the single missing cell in the fare column of the test data
test_df['Fare'].fillna(test_df['Fare'].mean(), inplace=True)

In [None]:
# Creating one hot encoded gender column

train_df['is_male'] = (train_df['Sex'] == 'male').astype(int)
test_df['is_male'] = (test_df['Sex'] == 'male').astype(int)

In [None]:
# Defining target and feature columns
TARGET = 'Survived'
FEATURES = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'is_male']
NUMERIC_FEATURES = ['Age', 'SibSp', 'Parch', 'Fare']

X_train = train_df[FEATURES]
y_train = train_df[TARGET]
X_test = test_df[FEATURES]
test_df.columns

## Step 2: The Model

Build, fit, and evaluate a classification model. Perform any model-specific data processing that you need to perform. If the toolkit you use supports it, create visualizations for loss and accuracy improvements. Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

In [None]:
import tensorflow as tf
import numpy as np
from sklearn.metrics import accuracy_score

In [None]:
# normalize the data
train_df.loc[:, NUMERIC_FEATURES] = ((train_df[NUMERIC_FEATURES] - 
train_df[NUMERIC_FEATURES].min()) / (train_df[NUMERIC_FEATURES].max() - 
                                     train_df[NUMERIC_FEATURES].min())) 
test_df.loc[:, NUMERIC_FEATURES] = ((test_df[NUMERIC_FEATURES] - 
test_df[NUMERIC_FEATURES].min()) / (test_df[NUMERIC_FEATURES].max() - 
                                     test_df[NUMERIC_FEATURES].min())) 

# define network
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation=tf.nn.relu, 
                          input_shape=(len(FEATURES),)),
    tf.keras.layers.Dense(32, activation=tf.nn.sigmoid),
    tf.keras.layers.Dense(16, activation=tf.nn.sigmoid),
    tf.keras.layers.Dense(8, activation=tf.nn.sigmoid),
    tf.keras.layers.Dense(4, activation=tf.nn.sigmoid),
    tf.keras.layers.Dense(2, activation=tf.nn.sigmoid),
    tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

# compile the model
opt = tf.keras.optimizers.Adam(learning_rate=1e-4)
model.compile(
    loss='binary_crossentropy',
    optimizer=opt,
    metrics=[tf.keras.metrics.Accuracy()]
)

# training
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=200)
history = model.fit(X_train, y_train, epochs=7000, verbose=1, callbacks = [callback])

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))

plt.subplot(1,2,2)
plt.plot(history.history['loss'])
plt.title('Training Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss'], loc='best')

---

## Step 3: Make Predictions and Upload To Kaggle

In this step you will make predictions on the features found in the `test.csv` file and upload them to Kaggle using the [Kaggle API](https://github.com/Kaggle/kaggle-api). Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

In [None]:
# making predictions
predictions = model.predict(X_test)
predictions = [0 if x < .5 else 1 for x in predictions]

submission = pd.read_csv('gender_submission.csv')
submission['Survived'] = predictions

submission.to_csv(r'submission.csv', index = False)

What was your Kaggle score?

> *.76315*

---

## Step 4: Iterate on Your Model

In this step you're encouraged to play around with your model settings and to even try different models. See if you can get a better score. Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

We just ended up modifying the network above. We messed around with the activation functions, added two additional hidden layers, and modified the learning rate to end up with our final score of.

---