<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/04_classification/04_classification_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Classification Project

In this project you will apply what you have learned about classification and TensorFlow to complete a project from Kaggle. The challenge is to achieve a high accuracy score while trying to predict which passengers survived the Titanic ship crash. After building your model, you will upload your predictions to Kaggle and submit the score that you get.

## The Titanic Dataset

[Kaggle](https://www.kaggle.com) has a [dataset](https://www.kaggle.com/c/titanic/data) containing the passenger list on the Titanic. The data contains passenger features such as age, gender, ticket class, as well as whether or not they survived.

Your job is to create a binary classifier using TensorFlow to determine if a passenger survived or not. The `Survived` column lets you know if the person survived. Then, upload your predictions to Kaggle and submit your accuracy score at the end of this Colab, along with a brief conclusion.


To get the dataset, you'll need to accept the competition's rules by clicking the "I understand and accept" button on the [competition rules page](https://www.kaggle.com/c/titanic/rules). Then upload your `kaggle.json` file and run the code below.

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && cp kaggle.json ~/.kaggle/ && echo 'Done'
! kaggle competitions download -c titanic
! ls

**Note: If you see a "403 - Forbidden" error above, you still need to click "I understand and accept" on the [competition rules page](https://www.kaggle.com/c/titanic/rules).**

Three files are downloaded:

1. `train.csv`: training data (contains features and targets)
1. `test.csv`: feature data used to make predictions to send to Kaggle
1. `gender_submission.csv`: an example competition submission file

## Step 1: Exploratory Data Analysis

Perform exploratory data analysis and data preprocessing. Use as many text and code blocks as you need to explore the data. Note any findings. Repair any data issues you find.

**Student Solution**

#### Imports and Data

In [None]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
from seaborn.axisgrid import FacetGrid
import re
from sklearn import linear_model
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split, GridSearchCV

In [None]:
# Getting Data

test_df = pd.read_csv('test.csv')

train_df = pd.read_csv('train.csv')

### Exploring Taining Data

In [None]:
train_df.info()

In [None]:
train_df

In [None]:
train_df.columns

From the the dataset we know that there is 891 examples. There are 11 features that include name, cabin, ticket, passenger id, sex, age, survived, etc.

In [None]:
train_df.describe()

We can see that according to the dataset 38% of the passangers listed survived. Now we can explore each column

In [None]:
train_df.columns = ['Passenger ID', 'Survived', 'P Class', 'Name', 'Sex', 
                    'Age', 'SibSp', 'Parch', 'Ticket', 
                    'Fare', 'Cabin', 'Embarked'
                 
]


In [None]:
train_df.columns

In [None]:
print(train_df['Passenger ID'].isna().any())
print(train_df['Passenger ID'].unique().shape)
for location in sorted(train_df['Passenger ID'].unique()):
  print(location)

Passenger ID column looks clean

In [None]:
print(train_df['Survived'].isna().any())

print(train_df['Survived'].unique().shape)

for location in sorted(train_df['Survived'].unique()):
  
  print(location)

In [None]:
print(train_df['P Class'].isna().any())

print(train_df['P Class'].unique().shape)

for location in sorted(train_df['P Class'].unique()):
  
  print(location)

In [None]:
print(train_df['Name'].isna().any())

print(train_df['Name'].unique().shape)

for location in sorted(train_df['Name'].unique()):

  print(location)

In [None]:
print(train_df['Sex'].isna().any())

print(train_df['Sex'].unique().shape)

for location in sorted(train_df['Sex'].unique()):

  print(location)

In [None]:
print(train_df['Age'].isna().any())

print(train_df['Age'].unique().shape)

for location in sorted(train_df['Age'].unique()):

  print(location)

In [None]:
print(train_df['SibSp'].isna().any())

print(train_df['SibSp'].unique().shape)

for location in sorted(train_df['SibSp'].unique()):

  print(location)


In [None]:
print(train_df['Parch'].isna().any())

print(train_df['Parch'].unique().shape)

for location in sorted(train_df['Parch'].unique()):

  print(location)

In [None]:
print(train_df['Ticket'].isna().any())

print(train_df['Ticket'].unique().shape)

for location in sorted(train_df['Ticket'].unique()):

  print(location)

In [None]:
print(train_df['Fare'].isna().any())

print(train_df['Fare'].unique().shape)

for location in sorted(train_df['Fare'].unique()):

  print(location)

In [None]:
print(train_df['Cabin'].isna().any())

print(train_df['Cabin'].unique().shape)



In [None]:
print(train_df['Embarked'].isna().any())

print(train_df['Embarked'].unique().shape)
\

---

### Analysis

In [None]:
train_df.head(10)

In [None]:
train_df.columns.values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,10))
_ = sns.heatmap(train_df.corr(), cmap='coolwarm', annot=True)

In [None]:
FacetGrid = sns.FacetGrid(train_df, row='Embarked')
FacetGrid.map(sns.pointplot, 'P Class', 'Survived', 'Sex')
FacetGrid.add_legend()

What seems to be the trend is women that was embarked (I'm assuming that means where they are ported at) Q and S had a higher probabilty of survival then women at C. However men at C had a higher chance of survival then men at Q and S. P Class doesnt seem to correlated as well as Embarked. 

In [None]:
sns.barplot(x='Sex', y='Survived', data=train_df)


This graph shows that more women survived then men . 

In [None]:
train_df['Sex'].value_counts()

There were more male passangers then there were women passangers. 

In [None]:
train_df['Survived'].value_counts()

We see that 549 people died in this dataset. 

In [None]:
train_df.pivot_table('Survived', index = 'Sex', columns = 'P Class')

In [None]:
train_df.pivot_table('Survived', index = 'Sex', columns = 'P Class').plot()

In [None]:
age = pd.cut(train_df['Age'], [0,18,80])
train_df.pivot_table('Survived',['Sex', age], 'P Class')

**Summary :** What we learn about the data through the anlysis is that there were more men on the titatnic than women. Women had a higher survival rate then men. Women in the Q and S port more likely to survive then women in the C port. Lastly 549 people died and 342 people lived. The last thing we need to do is drop the 'Passenger ID' and 'Cabin' because they do not have a major impact on the survival rate as the other columns. 

### Cleaning Data

In [None]:
# Counting the empty vaules in the columns

train_df.isna().sum()

A large portion of Cabin is missing so we dont need that but we still need age and embarked

In [None]:
#First we need to see the values type of values in each column

for x in train_df:
  print(train_df[x].value_counts())
  print()

In [None]:
train_df.columns

In [None]:
#Drop the Missing Values in Rows

train_df = train_df.dropna(subset = ['Embarked', 'Age'])


In [None]:
train_df.shape

In [None]:
#Dropping the Cabin Column in train data
train_df = train_df.drop(['Cabin'], axis=1)

In [None]:
#Dropping the Passenger ID, Name, and Ticket column in train data
train_df = train_df.drop(['Name', 'Ticket'], axis=1)


In [None]:
#Checking for the different value types in train data
train_df.dtypes

In [None]:
print(train_df['Sex'].unique())
print(train_df['Embarked'].unique())

In [None]:
train_df.dtypes

In [None]:
#Using sklearn to encode columns
labelencoder = LabelEncoder()
#Sex
train_df.iloc[:, 3] = labelencoder.fit_transform(train_df.iloc[:, 3].values)


#Embarked
train_df.iloc[:, 8] = labelencoder.fit_transform(train_df.iloc[:, 8].values)


In [None]:
train_df.dtypes

In [None]:
train_df

## Step 2: The Model

Build, fit, and evaluate a classification model. Perform any model-specific data processing that you need to perform. If the toolkit you use supports it, create visualizations for loss and accuracy improvements. Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

In [None]:
X_train = train_df.drop("Survived", axis=1)
Y_train  = train_df["Survived"]


In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train= sc.fit_transform(X_train)


In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=10, criterion='entropy',
                                random_state = 0)
forest.fit(X_train, Y_train)
print('[0] Forest Training Accuracy: ', forest.score(X_train, Y_train))

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest1 = RandomForestClassifier(n_estimators=50, criterion='entropy',
                                random_state = 0)
forest1.fit(X_train, Y_train)
print('[1] Forest Training Accuracy: ', forest1.score(X_train, Y_train))

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest2 = RandomForestClassifier(n_estimators=100, criterion='entropy',
                                random_state = 0)
forest2.fit(X_train, Y_train)
print('[2] Forest Training Accuracy: ', forest2.score(X_train, Y_train))

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=1000, criterion='entropy',
                                random_state = 0)
model.fit(X_train, Y_train)
print('[3] Forest Training Accuracy: ', model.score(X_train, Y_train))

The model we decided to use was the forest classifier. We saw the higher of the number of estimators the more accurate the model became. 

---

## Step 3: Make Predictions and Upload To Kaggle

In this step you will make predictions on the features found in the `test.csv` file and upload them to Kaggle using the [Kaggle API](https://github.com/Kaggle/kaggle-api). Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

### Extracting Data

In [None]:
df = pd.read_csv('test.csv')


### Cleaning Data

In [None]:
df.columns

In [None]:
df = df.drop([ 'Name', 'Ticket', 'Cabin'], axis=1)

In [None]:
df.head(7)

In [None]:
print(df['Sex'].unique())
print(df['Embarked'].unique())

In [None]:
df.dtypes

In [None]:
#Sex
df.iloc[:, 2] = labelencoder.fit_transform(test_df.iloc[:, 2].values)


#Embarked
df.iloc[:, 7] = labelencoder.fit_transform(test_df.iloc[:, 7].values)

In [None]:
df = df.fillna(0)


In [None]:
df.isna().sum()

In [None]:
df

In [None]:
df.dtypes

### Preprocessing

In [None]:
df['Survived'] = np.where(df['Pclass'] >= 2, 0, 1)


In [None]:
X_test = df.drop("Survived", axis=1)
Y_test  = df["Survived"]

In [None]:
X_test

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_test= sc.fit_transform(X_test)

In [None]:
model = RandomForestClassifier(n_estimators=100, criterion='entropy',
                                random_state = 0)
model.fit(X_train, Y_train)
print('[3] Forest Training Accuracy: ', model.score(X_train, Y_train))

### Predictions

In [None]:
pred =  model.predict(X_test)

print(pred)

print(Y_test)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy: ', round(accuracy_score(pred, Y_test), 3))
print('Precision: ', round(precision_score(pred, Y_test), 3))
print('Recall: ', round(recall_score(pred, Y_test), 3))
print('F1: ', round(f1_score(pred, Y_test), 3))

In [None]:
from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(Y_test, pred).ravel()

print(f'True Positive: {tp}\nTrue Negative: {tn}\nFalse Positive: {fp}\nFalse Negative: {fn}')

### Uploading

In [None]:
#Submitting to haggle code: 
results = pd.DataFrame({
  'PassengerId': test_df['PassengerId'],
  'Survived': pred,
})

results.to_csv('titanic_predictions.csv', index=False)

! head titanic_predictions.csv
!kaggle competitions submit -f titanic_predictions.csv -m 'Keras submission' titanic
!kaggle competitions submissions titanic


What was your Kaggle score?

**.674**

---

## Step 4: Iterate on Your Model

In this step you're encouraged to play around with your model settings and to even try different models. See if you can get a better score. Use as many text and code blocks as you need to explore the data. Note any findings.

**Student Solution**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn.fit(X_test, Y_test)
print('[1] K Neighbors Training Accuracy: ', knn.score(X_test, Y_test))

In [None]:
from sklearn.svm import SVC

svc_lin = SVC(kernel='linear', random_state=0)
svc_lin.fit(X_test, Y_test)

print('[2] SVC Training Accuracy: ', svc_lin.score(X_test, Y_test))

In [None]:
svc_rbf = SVC(kernel='rbf', random_state=0)
svc_rbf.fit(X_test, Y_test)
print('[3] RBF Training Accuracy: ', svc_rbf.score(X_test, Y_test))

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy', random_state=0)
tree.fit(X_test, Y_test)
print('[4] Tree Training Accuracy: ', tree.score(X_test, Y_test))

In [None]:

from sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state= 0)
log.fit(X_test, Y_test)


print('[5] Logistic Regression Training Accuracy: ', log.score(X_test, Y_test))

---

From the models we used for step four we found the decision tree gave us the best score of  nearly 99% accuracy. 