# Workshop PR02: Intro to modelling

## Learning objectives
* Practice `pandas` for manipulating data
* Cursory understanding of using `sklearn` for modelling

## Prepping the data
This is the part we'll leave up to you:

* Read in the titanic data
* Read the training labels ("Survived") into a variable called `train_y` (calling the labels "y" is a convention)
* Generate a new feature "SexEncoded", which encodes the column "Sex" as 0 if "female" and 1 if "male". This is necessary because most of our classifiers need to be fed numbers (they can't understand raw strings)
* Read "Age", "SexEncoded", "Pclass", and "Fare" into a variable called `train_X` (also a conventional name)

Go back to Workshop PR01 if you need to double check how to do this.

In [1]:
import pandas as pd

In [2]:
# read in the data
df = pd.read_csv("titanic.csv")

In [3]:
# read the training labels into train_y
train_y = df["Survived"]

In [4]:
# generate the new feature "SexEncoded"
df["SexEncoded"] = df["Sex"].apply(lambda sex: 0 if "female" else 1)

In [5]:
# read the training features into train_X
train_X = df[["Age", "SexEncoded", "Pclass", "Fare"]]

## Modelling with `sklearn`

`sklearn` makes it super easy to build models - it comes with a fat mixed bag of pre-implemented models (https://scikit-learn.org/stable/supervised_learning.html). But the sequence of steps you need to go through to run them is always the same:
1. Import the classifier from sklearn (nobody every imports the whole `sklearn` library, it's just too large)
2. Instantiate a classifier object
3. Train the classifier object

In [6]:
# (1) let's import sklearn's RandomForestClassifier
# side note: as far as algorithms go, the random forest algorithm is actually pretty straightforward
#   you an look it up here (https://en.wikipedia.org/wiki/Random_forest#Algorithm), but you might have
#   to do a bit of detour reading on decision trees and bootstrap aggregating
from sklearn.ensemble import RandomForestClassifier

In [7]:
# (2) instantiate a RandomForestClassifier object
random_forest = RandomForestClassifier()

In [8]:
# (3) and train using "train_X" and "train_y"
random_forest.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [9]:
# and if you want to see how it performs
random_forest.score(train_X, train_y)

0.9447576099210823

In [10]:
# actually, it's a bit dodgy for us to be scoring our classifier using the same data that we used to train it
# what we should really be doing is setting aside some portion of the data to be used as "test" data, not using
# that portion in training, and then using it for evaluation