# Machine Learning: Final Project

### Predicting Survival on the *Titanic*

The final project is intended to simulate participation in a Kaggle competition. Your challenge is to build the most accurate model for predicting which passangers would survive the sinking of the *Titanic*. The ***Titanic Machine Learning Final Project.ipynb*** Colab notebook provides some guidance for tackling the project and suggests some things to think about as you get started. However, many of the model-building decisions are left up to you. 
**Note**: Use comments in your code and text blocks to explain your decisions and results.

### Build a Pipeline for a Kaggle Competition!

Kaggle was started in 2010 as a platform for machine learning competitions, which aim to identify how best to optimize supervised learning problems. These initiatives offer a two-way benefit. They help companies improve their internal algorithms and they provide prospective data professionals opportunities to prove their worth.

Though Kaggle usually has a singular aim of maximizing a specific metric, the idea of finding the best possible algorithm and furthermore optimizing its hyperparameters is the daily task of a data scientist. Moreover, success in Kaggle can be great for a future resume (since your information is saved on their site).

Obviously, the timeframe for this lesson is not realistic in terms of a typical Kaggle workflow, as competitors spend weeks or even months optimizing every piece of an algorithm they can. However, you can get started with preliminary testing and use these principles to enter your own Kaggle competitions in the future!

# Step 1: Importing Libraries

It is best practice to import all libraries and packages early in the process.

You'll probably want to import Pandas plus some packages from scikit-learn.

| Type | Path | Regression | Classification |
| --- | --- | --- | --- |
| **Linear Models** | `sklearn.linear_model` | `LinearRegression` | `LogisticRegression` |
|  |  |`Ridge` | `RidgeClassifier` |
|  |  |`Lasso` |  |
| **K Nearest Neighbors** | `sklearn.neighbors` | `KNeighborsRegressor` | `KNeighborsClassifier` |
| **Support Vector Machines** | `sklearn.svm.` | `SVR` | `SVC` |
| **Naive Bayes** |  `sklearn.naive_Bayes` |  |`CategoricalNB` (Categorical) |
|  |  |  | `MultinomialNB` (Sentiment Analysis) |
| **Decision Trees** | `sklearn.tree` | `DecisionTreeRegressor` | `DecisionTreeClassifier` |
| **Ensemble - Random Forests** | `sklearn.ensemble` | `RandomForestRegressor` | `RandomForestClassifier`
| **Ensemble - Boosting** | `sklearn.ensemble` | `AdaBoostRegressor` | `AdaBoostClassifier` |
|  | `sklearn.ensemble` | `GradientBoostRegressor` | `GradientBoostClassifier` |



| Type | Path | Package |
| --- | --- | --- |
| Preprocessing | `sklearn.preprocessing` | `StandardScaler` |
| |`sklearn.preprocessing` | `MinMaxScaler` |
| |`sklearn.preprocessing` | `MaxAbsScaler` |
| Model Selection - Splitting| `sklearn.model_selection` | `train_test_split` |
| Model Selection - Grid Search | `sklearn.model_selection` | `GridSearchCV` |
| Model Selection - Scoring | `sklearn.model_selection` | `cross_val_score` |
| Metrics | `sklearn.metrics` | `confusion_matrix` |


**Note**: Use comments in your code and text blocks to explain your decisions and results.




In [169]:
#Step 1
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn import tree 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from pandas.core.common import pipe
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier

#Step 2:  Load the `Titanic.csv` Data
You may want to refer back to one of your previous Colab notebooks to copy the Google Import code.

**Note**: Use comments in your code and text blocks to explain your decisions and results.

In [170]:
#Step 2
# 5 rows sample of titanic dataset
titanic =pd.read_csv("/content/Titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [171]:
titanic.shape
# The dataset has 891 rows and 12 columns

(891, 12)

#Step 3: Split the Data

The next step is to separate the target column from the feature matrix and perform a train/test split. 

*   What is the target and what are the features in the data?
*   Are there any features that you want to drop?
*   Is there any feature engineering that you need to do?

**Note**: Use comments in your code and text blocks to explain your decisions and results.

 

In [172]:
#Step 3
y =titanic["Survived"]
X = titanic.drop(columns=["PassengerId","Survived","Name","Ticket","Cabin"], axis =1)
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


In [None]:
# The target is Survived
# Other features are PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin and Embarked
# Yes, i dropped passengerId,Name,Ticket,Cabin and Survived

In [173]:
categ =["Embarked","Sex","Pclass","SipSp","Parch"]


In [174]:
cat = pd.get_dummies(["Sex"])

In [175]:
X1=pd.concat([X,cat],axis=1)

In [176]:
X=X1.drop(columns =["Sex"],axis=1)
print(X.columns)

Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], dtype='object')


In [177]:
cat = pd.get_dummies(["Embarked"])

In [178]:
X1=pd.concat([X,cat],axis=1)

In [179]:
X=X1.drop(columns =["Embarked"],axis=1)
print(X.columns)

Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')


In [180]:
cat = pd.get_dummies(["Pclass"])

In [181]:
X1=pd.concat([X,cat],axis=1)

In [182]:
X=X1.drop(columns =["Pclass"],axis=1)
X = X.rename(columns={1:"1stclass", 2:"2ndclass",3:"3rdclass"})
print(X.columns)

Index(['Age', 'SibSp', 'Parch', 'Fare'], dtype='object')


In [183]:
cat = pd.get_dummies(["SibSp"])

In [184]:
X1=pd.concat([X,cat],axis=1)

In [185]:
X=X1.drop(columns =["SibSp"],axis=1)
print(X.columns)

Index(['Age', 'Parch', 'Fare'], dtype='object')


In [186]:
cat = pd.get_dummies(["Parch"])

In [187]:
X1=pd.concat([X,cat],axis=1)

In [188]:
X=X1.drop(columns =["Parch"],axis=1)
print(X.columns)

Index(['Age', 'Fare'], dtype='object')


#Step 4: Clean and Preprocess the Data

Use the code block below to clean and preprocess your data. Some considerations you may want to think about include the following:  
*  Are there any missing values that need to be imputed?
*  Do you need to encode any categorical features?
*  Do you need to standardize any quantitative features?
 
**Note**: Use comments in your code and text blocks to explain your decisions and results.

 

In [189]:
#Step 4
#Search for missing values, then using simpleimputer to fix the problem and standardize.
X.isna().sum
X["Age"].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: Age, Length: 88, dtype: int64

In [190]:
X.dropna

<bound method DataFrame.dropna of       Age     Fare
0    22.0   7.2500
1    38.0  71.2833
2    26.0   7.9250
3    35.0  53.1000
4    35.0   8.0500
..    ...      ...
886  27.0  13.0000
887  19.0  30.0000
888   NaN  23.4500
889  26.0  30.0000
890  32.0   7.7500

[891 rows x 2 columns]>

In [191]:

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=55)

In [192]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

#Step 5: Build the Baseline Model

Ideally, you will want to set a baseline algorithm to build off of. The most logical start is *linear regression* for *regression* and *logistic regression* for *classification*, as they are the basis for their respective algorithms.

Once you have the baseline set, you will want to choose an algorithm that surpasses the baseline.

Select a baseline model and fit it to your data.

**Note**: Use comments in your code and text blocks to explain your decisions and results.



In [193]:

dtree = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('model', DecisionTreeClassifier(random_state=55))])

#Step 6: Evaluate the Baseline Model

Use cross-validation to calculate the appropriate model evaluation metric. 

Is your model doing a good job fitting the data?  

If you have ideas for how to improve your model fit, go back and make those changes to earlier steps.

**Note**: Use comments in your code and text blocks to explain your decisions and results.


In [194]:
#Step6
#Using Decision Tree Classifier, it is under fitting the data
dtree.fit(X_train,y_train)

Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', DecisionTreeClassifier(random_state=55))])

In [195]:

Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', DecisionTreeClassifier(random_state=55))])

Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', DecisionTreeClassifier(random_state=55))])

In [196]:
scorescv = cross_val_score(dtree, X_train, y_train, cv=10)


In [197]:
print('DecisionTreeaccuracy',scorescv.mean())

DecisionTreeaccuracy 0.6497286295793758


In [198]:
print("Train",dtree.score(X_train,y_train))

Train 0.9520958083832335


In [199]:
print("Test",dtree.score(X_test,y_test))

Test 0.5560538116591929


# Step 7: Fit the Data to at Least One Other Model

Select one (or more) other appropriate model and use it to model the data. Calculate the cross-validation accuracy of each model. 

**Note**: Use comments in your code and text blocks to explain your decisions and results.

In [200]:
# Using Random Forest Classifier
Random= Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('model', RandomForestClassifier())])

In [201]:
Random.fit(X_train,y_train)

Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', RandomForestClassifier())])

In [202]:
Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', RandomForestClassifier())])

Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', RandomForestClassifier())])

In [204]:
scores = cross_val_score(dtree, X_train, y_train, cv=10)
print('Random',scores.mean())

Random 0.6497286295793758


In [205]:
print("Train", Random.score(X_train,y_train))

Train 0.9520958083832335


In [165]:
print("Test",Random.score(X_test,y_test))

Test 0.5964125560538116


In [214]:
# Using Gradientboost classifier
GradientBoostingClassifier,
gradboost = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('model', GradientBoostingClassifier(random_state=55))])

In [208]:
Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', GradientBoostingClassifier(random_state=55))])

Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('model', GradientBoostingClassifier(random_state=55))])

In [209]:
scores = cross_val_score(gradboost, X_train, y_train, cv=10)

In [215]:
print('GradientBoostClassifieraccuracy',scores.mean(),scores.std())

GradientBoostClassifieraccuracy 0.6990954319312528 0.04972339996306941


In [216]:
gradboost = gradboost.fit(X_train,y_train)

In [217]:
print('Train', gradboost.score(X_train,y_train))

Train 0.812874251497006


In [218]:
print('Test', gradboost.score(X_test,y_test))

Test 0.600896860986547


# New Section

# Step 8: Evaluate Your Best Model

Evaluate your best model using the test set. 

*   Which model fit the data best?
*   What was the best accuracy you were able to achieve?  

**Note**: Use comments in your code and text blocks to explain your decisions and results.

In [None]:
# Step 8
# Out of the three algorithms,gradientboost fit the data best
# I had 81% accuracy

#Step 9: Final Reporting

Summarize your model building process:  
* How did you identify the model target and features?  
* What steps did you take to prepare the data for modeling?  
* Which baseline model did you choose and why? How did you evaluate the model's performance?  
* Which other model(s) did you choose and why? How did you evaluate the model's performace?  
* What was the best model you developed? How well did the model perform on the test data?

#Step 9:
 1 The target feature is survived,since we are building a model to know which passenger would survive the sinking of the titanic.

 2 step 1: import relevant libraries
step 2: Load the dataset
step3: Split the data into the target and other features(X and y), find missing values
step 4 Clean and process my data using simpleimputer and standarzing (StandardScaler())
step 5: Building the model - Name the pipeline, fit the pipeline(X_ and y_train), Calculate the accuracy of the 3 models using cross_val_score for the train and test data, then calculate the mean and standard deviation.

3. I chose Decion Tree Classifier because we are dealing with a categorical data, i evaluated the model performance using cross_val_score and mean $ SD values

4. Random Forest and Gradientboost Classifiers, evaluuated as above.

5. Gradientboost fit in best with 60% test result.



