<a href="https://colab.research.google.com/github/SimoneKris/KGS-Data-Analytics-Portfolio/blob/main/Copy_of_Titanic_Machine_Learning_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning: Final Project

### Predicting Survival on the *Titanic*

The final project is intended to simulate participation in a Kaggle competition. Your challenge is to build the most accurate model for predicting which passangers would survive the sinking of the *Titanic*. The ***Titanic Machine Learning Final Project.ipynb*** Colab notebook provides some guidance for tackling the project and suggests some things to think about as you get started. However, many of the model-building decisions are left up to you. 
**Note**: Use comments in your code and text blocks to explain your decisions and results.

### Build a Pipeline for a Kaggle Competition!

Kaggle was started in 2010 as a platform for machine learning competitions, which aim to identify how best to optimize supervised learning problems. These initiatives offer a two-way benefit. They help companies improve their internal algorithms and they provide prospective data professionals opportunities to prove their worth.

Though Kaggle usually has a singular aim of maximizing a specific metric, the idea of finding the best possible algorithm and furthermore optimizing its hyperparameters is the daily task of a data scientist. Moreover, success in Kaggle can be great for a future resume (since your information is saved on their site).

Obviously, the timeframe for this lesson is not realistic in terms of a typical Kaggle workflow, as competitors spend weeks or even months optimizing every piece of an algorithm they can. However, you can get started with preliminary testing and use these principles to enter your own Kaggle competitions in the future!

# Step 1: Importing Libraries

It is best practice to import all libraries and packages early in the process.

You'll probably want to import Pandas plus some packages from scikit-learn.

| Type | Path | Regression | Classification |
| --- | --- | --- | --- |
| **Linear Models** | `sklearn.linear_model` | `LinearRegression` | `LogisticRegression` |
|  |  |`Ridge` | `RidgeClassifier` |
|  |  |`Lasso` |  |
| **K Nearest Neighbors** | `sklearn.neighbors` | `KNeighborsRegressor` | `KNeighborsClassifier` |
| **Support Vector Machines** | `sklearn.svm.` | `SVR` | `SVC` |
| **Naive Bayes** |  `sklearn.naive_Bayes` |  |`CategoricalNB` (Categorical) |
|  |  |  | `MultinomialNB` (Sentiment Analysis) |
| **Decision Trees** | `sklearn.tree` | `DecisionTreeRegressor` | `DecisionTreeClassifier` |
| **Ensemble - Random Forests** | `sklearn.ensemble` | `RandomForestRegressor` | `RandomForestClassifier`
| **Ensemble - Boosting** | `sklearn.ensemble` | `AdaBoostRegressor` | `AdaBoostClassifier` |
|  | `sklearn.ensemble` | `GradientBoostRegressor` | `GradientBoostClassifier` |



| Type | Path | Package |
| --- | --- | --- |
| Preprocessing | `sklearn.preprocessing` | `StandardScaler` |
| |`sklearn.preprocessing` | `MinMaxScaler` |
| |`sklearn.preprocessing` | `MaxAbsScaler` |
| Model Selection - Splitting| `sklearn.model_selection` | `train_test_split` |
| Model Selection - Grid Search | `sklearn.model_selection` | `GridSearchCV` |
| Model Selection - Scoring | `sklearn.model_selection` | `cross_val_score` |
| Metrics | `sklearn.metrics` | `confusion_matrix` |


**Note**: Use comments in your code and text blocks to explain your decisions and results.




In [20]:
#Step 1 importing libaries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error as MSE

# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB


#Step 2:  Load the `Titanic.csv` Data
You may want to refer back to one of your previous Colab notebooks to copy the Google Import code.

**Note**: Use comments in your code and text blocks to explain your decisions and results.

In [21]:
from google.colab import files
titanic1 = files.upload()

Saving titanic1.csv to titanic1 (1).csv


In [22]:
# Create Pandas DataFrame from the CSV file and call if "df"
df = pd.read_csv('titanic1.csv')

In [23]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [24]:
# Convert the Variable sex into a numeric data type and chnage sex to gender and label 1 for female and 0 for male
df.sex[df.sex == 'male']=0
df.sex[df.sex == 'female'] =1
print(df)

      pclass  survived                                             name sex  \
0          1         1                    Allen, Miss. Elisabeth Walton   1   
1          1         1                   Allison, Master. Hudson Trevor   0   
2          1         0                     Allison, Miss. Helen Loraine   1   
3          1         0             Allison, Mr. Hudson Joshua Creighton   0   
4          1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   1   
...      ...       ...                                              ...  ..   
1304       3         0                             Zabour, Miss. Hileni   1   
1305       3         0                            Zabour, Miss. Thamine   1   
1306       3         0                        Zakarian, Mr. Mapriededer   0   
1307       3         0                              Zakarian, Mr. Ortin   0   
1308       3         0                               Zimmerman, Mr. Leo   0   

         age  sibsp  parch  ticket      fare    cab

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [25]:
df.loc[df['embarked'] == 'S', 'embarked'] = 1
df.loc[df['embarked'] == 'C', 'embarked'] = 0
df.loc[df['embarked'] == 'Q', 'embarked'] = 2

In [26]:
print(df)

      pclass  survived                                             name sex  \
0          1         1                    Allen, Miss. Elisabeth Walton   1   
1          1         1                   Allison, Master. Hudson Trevor   0   
2          1         0                     Allison, Miss. Helen Loraine   1   
3          1         0             Allison, Mr. Hudson Joshua Creighton   0   
4          1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   1   
...      ...       ...                                              ...  ..   
1304       3         0                             Zabour, Miss. Hileni   1   
1305       3         0                            Zabour, Miss. Thamine   1   
1306       3         0                        Zakarian, Mr. Mapriededer   0   
1307       3         0                              Zakarian, Mr. Ortin   0   
1308       3         0                               Zimmerman, Mr. Leo   0   

         age  sibsp  parch  ticket      fare    cab

#Step 3: Split the Data

The next step is to separate the target column from the feature matrix and perform a train/test split. 

*   What is the target and what are the features in the data?
*   Are there any features that you want to drop?
*   Is there any feature engineering that you need to do?

**Note**: Use comments in your code and text blocks to explain your decisions and results.

 

In [27]:
#dropping columns boat and body
df = df.drop(['boat', 'body'], axis = 1)

In [28]:
#Step 3 Split the data into target variable "survived"

y = df[['survived']]
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'embarked', 'home.dest']]

In [29]:
#splitting the data into a training/validation and tesst set

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42) 


In [30]:
print(X_train_val.shape)
print(X_test.shape)

(981, 9)
(328, 9)


In [31]:
# Splitting the data again into Validation/Training set using Random state of 42 and test size of .333

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333, random_state=42)

In [32]:
print(X_train.shape)
print(X_test.shape)

(873, 9)
(436, 9)


In [33]:
X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,ticket,fare,embarked,home.dest
0,1,1,29.0,0,0,24160,211.3375,1,"St Louis, MO"
1,1,0,0.9167,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"
2,1,1,2.0,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"
3,1,0,30.0,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"
4,1,1,25.0,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"


In [34]:
y.head()

Unnamed: 0,survived
0,1
1,1
2,0
3,0
4,0


In [35]:
# rerun the code without the boat and name
X = df[['pclass', 'sex', 'age', 'fare', 'sibsp', 'parch', 'embarked']]
y = df ['survived']


In [36]:
# Before moving to encoding, it is important to define "embarked" with the most common city from which people joined the Titatnic
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

#Step 4: Clean and Preprocess the Data

Use the code block below to clean and preprocess your data. Some considerations you may want to think about include the following:  
*  Are there any missing values that need to be imputed?
*  Do you need to encode any categorical features?
*  Do you need to standardize any quantitative features?
 
**Note**: Use comments in your code and text blocks to explain your decisions and results.

 

#Step 5: Build the Baseline Model

Ideally, you will want to set a baseline algorithm to build off of. The most logical start is *linear regression* for *regression* and *logistic regression* for *classification*, as they are the basis for their respective algorithms.

Once you have the baseline set, you will want to choose an algorithm that surpasses the baseline.

Select a baseline model and fit it to your data.

**Note**: Use comments in your code and text blocks to explain your decisions and results.



In [37]:
# Step 5 using the logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()


#Step 6: Evaluate the Baseline Model

Use cross-validation to calculate the appropriate model evaluation metric. 

Is your model doing a good job fitting the data?  

If you have ideas for how to improve your model fit, go back and make those changes to earlier steps.

**Note**: Use comments in your code and text blocks to explain your decisions and results.


In [51]:
# evaluate the Baseline model using cross validation
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1051 non-null   object 
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      335 non-null    object 
 10  embarked   1309 non-null   int64  
 11  home.dest  747 non-null    object 
dtypes: float64(1), int64(5), object(6)
memory usage: 122.8+ KB


In [52]:
df = df.drop(['fare'], axis = 1)

In [58]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
pclass,1309.0,,,,2.294882,0.837836,1.0,2.0,3.0,3.0,3.0
survived,1309.0,,,,0.381971,0.486055,0.0,0.0,0.0,1.0,1.0
name,1309.0,1307.0,"Connolly, Miss. Kate",2.0,,,,,,,
sex,1309.0,2.0,0.0,843.0,,,,,,,
age,1051.0,99.0,24,47.0,,,,,,,
sibsp,1309.0,,,,0.498854,1.041658,0.0,0.0,0.0,1.0,8.0
parch,1309.0,,,,0.385027,0.86556,0.0,0.0,0.0,0.0,9.0
ticket,1309.0,929.0,CA. 2343,11.0,,,,,,,
cabin,335.0,186.0,?,41.0,,,,,,,
embarked,1309.0,,,,0.887701,0.536505,0.0,1.0,1.0,1.0,2.0


In [62]:
print(f"How many 'S' on embarked column : {df[df['embarked'] == 'S'].shape[0]}")
print(f"How many 'C' on embarked column : {df[df['embarked'] == 'C'].shape[0]}")
print(f"How many 'Q' on embarked column : {df[df['embarked'] == 'Q'].shape[0]}")

How many 'S' on embarked column : 0
How many 'C' on embarked column : 0
How many 'Q' on embarked column : 0


# Step 7: Fit the Data to at Least One Other Model

Select one (or more) other appropriate model and use it to model the data. Calculate the cross-validation accuracy of each model. 

**Note**: Use comments in your code and text blocks to explain your decisions and results.

In [39]:
#Step 7import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, confusion_matrix, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, cross_validate

In [40]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",1,29.0,0,0,24160,211.3375,B5,1,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",0,0.9167,1,2,113781,151.55,C22 C26,1,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",1,2.0,1,2,113781,151.55,C22 C26,1,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",0,30.0,1,2,113781,151.55,C22 C26,1,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",1,25.0,1,2,113781,151.55,C22 C26,1,"Montreal, PQ / Chesterville, ON"


In [41]:
df.loc[df['embarked'] == 'S', 'embarked'] = 1
df.loc[df['embarked'] == 'C', 'embarked'] = 0
df.loc[df['embarked'] == 'Q', 'embarked'] = 2

In [42]:
y = df['embarked']
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'embarked', 'home.dest']]

In [43]:
X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,ticket,fare,embarked,home.dest
0,1,1,29.0,0,0,24160,211.3375,1,"St Louis, MO"
1,1,0,0.9167,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"
2,1,1,2.0,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"
3,1,0,30.0,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"
4,1,1,25.0,1,2,113781,151.55,1,"Montreal, PQ / Chesterville, ON"


In [44]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: embarked, dtype: int64

In [45]:
#Step 7

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [46]:
df.describe()

Unnamed: 0,pclass,survived,sibsp,parch,fare,embarked
count,1309.0,1309.0,1309.0,1309.0,1308.0,1309.0
mean,2.294882,0.381971,0.498854,0.385027,33.295479,0.887701
std,0.837836,0.486055,1.041658,0.86556,51.758668,0.536505
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0,7.8958,1.0
50%,3.0,0.0,0.0,0.0,14.4542,1.0
75%,3.0,1.0,1.0,0.0,31.275,1.0
max,3.0,1.0,8.0,9.0,512.3292,2.0


In [49]:

pipe = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
               ('scaler', StandardScaler()), 
               ('log_reg', LogisticRegression(random_state=0))])

pipe.fit(X_train, y_train

SyntaxError: ignored

# Step 8: Evaluate Your Best Model

Evaluate your best model using the test set. 

*   Which model fit the data best?
*   What was the best accuracy you were able to achieve?  

**Note**: Use comments in your code and text blocks to explain your decisions and results.

In [None]:
# Step 8

#Step 9: Final Reporting

Summarize your model building process:  
* How did you identify the model target and features?  
* What steps did you take to prepare the data for modeling?  
* Which baseline model did you choose and why? How did you evaluate the model's performance?  
* Which other model(s) did you choose and why? How did you evaluate the model's performace?  
* What was the best model you developed? How well did the model perform on the test data?

#Step 9: