# Titanic - Machine Learning from Disaster

## Importing Packages
Python packages are software modules which can be installed into your class hierarchy. [Pandas](https://pandas.pydata.org/) and [Scikit-Learn](https://scikit-learn.org/stable/index.html) are popular for machine learning and data science.

### Pandas
Pandas is a well known python library that deals with modifying, analyzing, and creating dataframes. We will use this for all of our data manipulation needs as we get further into making our models. 
We use the 'as' keyword to modify the name of the module. You will see this used later when we call functions from the pandas class.

In [1]:
import pandas as pd

### Scikit-Learn
Sklearn is free software machine learning library which provides several models and functions to help us evalute, modify, and train our model. 
We use the 'from' keyword to limit the scope of the module. Instead of importing the entire scikit-learn library, we can choose to just import [RandomForestClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and the [roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). We will get more into this model later, and how we measure it with roc_auc_score.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

### Python Standard Library - os
This module provides a portable way of using operating system dependent functionality. We will be it for file structure and file I/O.

In [3]:
import os

## Loading the Data
We will demonstrate how to load data and read data into python using pandas and os.

### Path
We start by using os to get the absolute path of a file. This will tell python where exactly our file is located and load it from that directory. The absolute path to the data folder will change between you and your teammates' computers. Since our "train.csv" is in a different directory from our Jupyter notebook, we have to modify the abspath we get to be the abspath of "train.csv". We can join paths using variables.

In [4]:
root_dir = os.path.abspath("")
root_dir = os.path.dirname(root_dir)
data_dir = "data\\train.csv"
data_dir = os.path.join(root_dir, data_dir)
print(data_dir)

c:\Users\jmccutcheon\Boot Camp Github\bootcamp\titanic\data\train.csv


### DataFrame
Using the path above, we can call the [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function from pandas. This will read in our data from "train.csv" and return a DataFrame with labeled axes. You can see the first five rows of the dataframe by calling the `head()`function. This is the data we will be using in order to train our model.


In [5]:
train_data = pd.read_csv(data_dir)
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 

## Setting up the model
We will now go into the basic procedure that goes into creating a model with pandas and sklearn.


### Y variable and features
Before we can create our model, we must first setup our data to be ready to train and test on our model. We do this by first identifying our y variable, this is the variable in which we will have our model predict on. In this data set, our y variable is the "Survived" column.



In [6]:
y = train_data["Survived"]

Now that we have our y variable, we must decide what features our model will use in order to predict. Features are just the data we have in our dataset. We can choose to use all of our data points in our dataset, but to keep this model simple we'll just use four ("Pclass", "Sex", "SibSp", "Parch"). We will store these four features in an array.

In [7]:
features = ["Pclass", "Sex", "SibSp", "Parch"]

### Load Testing Data
Before we loaded in our training data from "train.csv", we will do the same procedure in order to load in our testing data from "test.csv". It is this data we will use in order to test the accuracy of our model against.

In [8]:
root_dir = os.path.abspath("")
root_dir = os.path.dirname(root_dir)
data_dir = "data\\test.csv"
data_dir = os.path.join(root_dir, data_dir)
print(data_dir)

test_data = pd.read_csv(data_dir)
test_data.head()

c:\Users\jmccutcheon\Boot Camp Github\bootcamp\titanic\data\test.csv


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Data Conversion
Now that we have loaded in our testing and training data, we will filter them into new dataframes using only the features we discussed earlier. It is on these dataframes that we will test and train our model. However, in our features we have a mix of categorical and numerical data points. This will not work well as our model needs only numerical data. To fix this, we will convert our categorical data to numeric through the pandas function `get.dummies()`.This function will take all unique categorical variables and create their own column within the dataset that's marked with either a 1 or 0. For more on [get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

`pd.get_dummies()`

In [9]:
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

## RandomForestClassifier Model
We will now finally be creating our model. As mentioned before, we will be using a RandomForestClassifier as our model. This type a model uses what are known as decision trees in order to decide how it will predict. A decision tree uses the features we pass into it in order to make a prediction on our y variable, in this case "Survival". Once all the decision trees have made their decision, the most voted prediction is chosen from the group of trees and that is the result returned by the RandomForestClassifier.\
 **Note** - the RandomForestClassifier has many parameters that can change which can all be explored [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [10]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)

## Training and Testing 
We will now train the model with our training data and our y variable. This easily done using sklearn `fit()` function.

In [11]:
model.fit(X,y)

RandomForestClassifier(max_depth=5, random_state=1)

Finally, we test the model vs the testing data and save this as our predictions variable.

In [12]:
predictions = model.predict(X_test)

## Results
Now that we have trained and tested our model, we can see the accuracy using the roc_auc_score. The roc_auc_score is a way for predicting the accuracy of our model by taking the area under its Receiver Operating Characteristic Curve (ROC Curve). The ROC Curve is taken by plotting the True Positive Rate vs the False Positive Rate of the model. For more on ROC Curve is calculated [here](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc). We check the roc_auc_score as so.


In [13]:
result = roc_auc_score(predictions, y)
print("ROC AUC Score: " + str(result))

ValueError: Found input variables with inconsistent numbers of samples: [418, 891]