First, let's import a few common modules, ensure MatplotLib plots figures inline. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [708]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# ignore convergence warning for now
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning,
                            module="sklearn")


# Tackle the Titanic dataset

The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

First, let's load the data:

In [709]:
import pandas as pd
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

The data is already split into a training set and a test set. However, the test data does *not* contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and upload them to WTClass to see your score.

Let's take a peek at the top few rows of the training set:

In [710]:
train_data.head();

The attributes have the following meaning:
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

Let's get more info to see how much data is missing:

In [711]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          572 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Cabin        159 non-null    object 
 11  Embarked     710 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 66.9+ KB


Okay, the **Age**, **Cabin** and **Embarked** attributes are sometimes null (less than 712 non-null), especially the **Cabin** (77% are null). We will ignore the **Cabin** for now and focus on the rest. The **Age** attribute has about 20% null values, so we will need to decide what to do with them. Replacing null values with the median age seems reasonable.

The **Name** and **Ticket** attributes may have some value, but they will be a bit tricky to convert into useful numbers that a model can consume. So for now, we will ignore them.

Let's take a look at the numerical attributes:

In [712]:
train_data.describe();

* Yikes, only 37.6% **Survived**. :(  That's close enough to 40%, so **accuracy** will be a reasonable metric to evaluate our model.
* The mean **Fare** was £32.60, which does not seem so expensive (but it was probably a lot of money back then).
* The mean **Age** was less than 30 years old.

Let's check that the target is indeed 0 or 1:

In [713]:
train_data["Survived"].value_counts();

Now let's take a quick look at all the categorical attributes:

In [714]:
train_data["Pclass"].value_counts();

In [715]:
train_data["Sex"].value_counts();

In [716]:
train_data["Embarked"].value_counts();

In [717]:
#Anecia's solutions just to see
train_data["Parch"].value_counts();

In [718]:
#Anecia's solutions just to see
train_data["SibSp"].value_counts();

In [719]:
#I need to see the correlation to determine what categories are best to group together
#PassengerId does not have much of a correlation and I will not be using it
#Fare has a high correlation with survived
#Fare has a noteable correlation with Sibsp and Parch
#Fare has a high negative correlation with Pclass
train_data.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.030183,0.005104,-0.003894,0.027698,-0.057984,-0.019042
Survived,-0.030183,1.0,-0.32175,-0.059695,-0.047602,0.078311,0.246641
Pclass,0.005104,-0.32175,1.0,-0.35595,0.086933,0.012679,-0.546794
Age,-0.003894,-0.059695,-0.35595,1.0,-0.320916,-0.20704,0.088103
SibSp,0.027698,-0.047602,0.086933,-0.320916,1.0,0.440355,0.153011
Parch,-0.057984,0.078311,0.012679,-0.20704,0.440355,1.0,0.22218
Fare,-0.019042,0.246641,-0.546794,0.088103,0.153011,0.22218,1.0


In [720]:
train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()
train_data;

In [721]:
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()
train_data;

In [722]:
#Anecia's solutions
#I created a new column called Title to separate the title of the person from their name
train_data['Title']=train_data.Name.str.extract('\, ([A-Z][^ ]*\.)',expand=False)
train_data


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBucket,RelativesOnboard,Title
0,1,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5000,C124,S,45.0,0,Mr.
1,2,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0000,,S,15.0,0,Mr.
2,3,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.9250,,S,30.0,0,Mr.
3,4,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S,15.0,1,Mr.
4,5,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.2750,,S,0.0,6,Miss.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,708,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,0,0,343120,7.6500,,S,15.0,0,Miss.
708,709,0,1,"Cairns, Mr. Alexander",male,,0,0,113798,31.0000,,S,,0,Mr.
709,710,0,3,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,,S,30.0,2,Mr.
710,711,1,1,"Carter, Miss. Lucile Polk",female,14.0,1,2,113760,120.0000,B96 B98,S,0.0,3,Miss.


In [723]:
#Anecia's solution
#I need to see the different number of titles
train_data["Title"].value_counts()

Mr.        419
Miss.      143
Mrs.        96
Master.     33
Rev.         5
Dr.          5
Major.       2
Col.         2
Mlle.        2
Capt.        1
Mme.         1
Ms.          1
Lady.        1
Name: Title, dtype: int64

In [724]:
#Anecia's solutions
#Dr.|Rev.|
train_data['FancyN']=1*train_data['Title'].str.contains('Mlle.|Col.|Lady.|Sir.|Mme.|Capt.|Major.|Ms',regex=True)
train_data.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBucket,RelativesOnboard,Title,FancyN
0,1,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S,45.0,0,Mr.,0
1,2,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S,15.0,0,Mr.,0
2,3,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S,30.0,0,Mr.,0
3,4,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S,15.0,1,Mr.,0
4,5,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S,0.0,6,Miss.,0
5,6,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C,15.0,1,Mr.,0
6,7,0,1,"Butt, Major. Archibald Willingham",male,45.0,0,0,113050,26.55,B38,S,45.0,0,Major.,1
7,8,0,2,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,,C,15.0,1,Mr.,0
8,9,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S,,0,Mr.,0
9,10,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5,C52,S,,0,Mr.,0


In [725]:
train_data["FancyN"].value_counts()

0    701
1     10
Name: FancyN, dtype: int64

In [726]:
#Anecia's solution
#Just to make sure I kept the title column the same
train_data["Title"].value_counts()

Mr.        419
Miss.      143
Mrs.        96
Master.     33
Rev.         5
Dr.          5
Major.       2
Col.         2
Mlle.        2
Capt.        1
Mme.         1
Ms.          1
Lady.        1
Name: Title, dtype: int64

In [727]:
#Anecia's solution
#Having done a correlation I decided to match these together
train_data["RichS"] = train_data["Sex"] + train_data["Title"] +train_data["Embarked"]
train_data[["RichS", "Survived"]].groupby(['RichS']).mean()
train_data;

In [728]:
#Anecia's solution
#I did this to add it with Age bucket
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
train_data['embarked_encoded'] = encoder.fit_transform(train_data.Embarked)
train_data;

In [729]:
#Anecia's solution
train_data["Richie"] = train_data["embarked_encoded"] + train_data["AgeBucket"] 
train_data[["Richie", "Survived"]].groupby(['Richie']).mean()
train_data;

In [730]:
#Anecia's solution
#There is defintely a correlation with age and Pclass. This will give me better results. Younger people in a higher class will have most likely survived.
train_data["RichY"] = train_data["Pclass"] + train_data["Age"] +train_data["FancyN"]
train_data[["RichY", "Survived"]].groupby(['RichY']).mean()
train_data;

In [731]:
#Anecia's solution
#Single people who paid a higher fare may be of higher status and will have most likely survived.
train_data["UltraRich"] = train_data["Fare"] + train_data["RelativesOnboard"] 
train_data[["UltraRich", "Survived"]].groupby(['UltraRich']).mean()
train_data;

In [732]:
#Anecia's solution
#I need sex to match with an integer/float category
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
train_data['sex_encoded'] = encoder.fit_transform(train_data.Sex)
train_data;

In [733]:
#Anecia's solution
#Females in higher class will have most likely survived.
train_data["UC"] = train_data["sex_encoded"] + train_data["Pclass"] 
train_data[["UC", "Survived"]].groupby(['UC']).mean()
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,RelativesOnboard,Title,FancyN,RichS,embarked_encoded,Richie,RichY,UltraRich,sex_encoded,UC
0,1,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5000,...,0,Mr.,0,maleMr.S,2,47.0,46.5,28.5000,1,2
1,2,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0000,...,0,Mr.,0,maleMr.S,2,17.0,25.0,13.0000,1,3
2,3,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.9250,...,0,Mr.,0,maleMr.S,2,32.0,35.0,7.9250,1,4
3,4,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,...,1,Mr.,0,maleMr.S,2,17.0,29.0,8.8542,1,4
4,5,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.2750,...,6,Miss.,0,femaleMiss.S,2,2.0,9.0,37.2750,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,708,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,0,0,343120,7.6500,...,0,Miss.,0,femaleMiss.S,2,17.0,24.0,7.6500,0,3
708,709,0,1,"Cairns, Mr. Alexander",male,,0,0,113798,31.0000,...,0,Mr.,0,maleMr.S,2,,,31.0000,1,2
709,710,0,3,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,...,2,Mr.,0,maleMr.S,2,32.0,44.0,16.1083,1,4
710,711,1,1,"Carter, Miss. Lucile Polk",female,14.0,1,2,113760,120.0000,...,3,Miss.,0,femaleMiss.S,2,2.0,15.0,123.0000,0,1


In [734]:
train_data["Plclass"] = train_data["embarked_encoded"] + train_data["Pclass"]
train_data[["Plclass", "Survived"]].groupby(['Plclass']).mean()
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Title,FancyN,RichS,embarked_encoded,Richie,RichY,UltraRich,sex_encoded,UC,Plclass
0,1,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5000,...,Mr.,0,maleMr.S,2,47.0,46.5,28.5000,1,2,3
1,2,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0000,...,Mr.,0,maleMr.S,2,17.0,25.0,13.0000,1,3,4
2,3,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.9250,...,Mr.,0,maleMr.S,2,32.0,35.0,7.9250,1,4,5
3,4,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,...,Mr.,0,maleMr.S,2,17.0,29.0,8.8542,1,4,5
4,5,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.2750,...,Miss.,0,femaleMiss.S,2,2.0,9.0,37.2750,0,3,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,708,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,0,0,343120,7.6500,...,Miss.,0,femaleMiss.S,2,17.0,24.0,7.6500,0,3,5
708,709,0,1,"Cairns, Mr. Alexander",male,,0,0,113798,31.0000,...,Mr.,0,maleMr.S,2,,,31.0000,1,2,3
709,710,0,3,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,...,Mr.,0,maleMr.S,2,32.0,44.0,16.1083,1,4,5
710,711,1,1,"Carter, Miss. Lucile Polk",female,14.0,1,2,113760,120.0000,...,Miss.,0,femaleMiss.S,2,2.0,15.0,123.0000,0,1,3


In [735]:
#Anecia's solutions just to see
train_data["Fare"].value_counts()

8.0500     35
13.0000    33
7.8958     32
7.7500     26
26.0000    25
           ..
40.1250     1
15.1000     1
61.1750     1
59.4000     1
14.1083     1
Name: Fare, Length: 220, dtype: int64

In [736]:
#Anecia's solutions to check the correlation 
#Survived has the highest correlation with sex_encoded, then with UC. 
trainCorr =train_data.corr()
trainCorr

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,AgeBucket,RelativesOnboard,embarked_encoded,Richie,UltraRich,sex_encoded,UC,Plclass
PassengerId,1.0,-0.030183,0.005104,-0.003894,0.027698,-0.057984,-0.019042,0.007289,-0.007917,0.030591,0.008519,-0.019158,0.018778,0.013091,0.023285
Survived,-0.030183,1.0,-0.32175,-0.059695,-0.047602,0.078311,0.246641,-0.047526,0.003565,-0.14968,-0.056505,0.244957,-0.54175,-0.521085,-0.317695
Pclass,0.005104,-0.32175,1.0,-0.35595,0.086933,0.012679,-0.546794,-0.361229,0.066748,0.124445,-0.349582,-0.540662,0.128672,0.882745,0.767058
Age,-0.003894,-0.059695,-0.35595,1.0,-0.320916,-0.20704,0.088103,0.95701,-0.319651,0.015986,0.95544,0.079059,0.091598,-0.245294,-0.229125
SibSp,0.027698,-0.047602,0.086933,-0.320916,1.0,0.440355,0.153011,-0.305981,0.906387,0.079193,-0.303413,0.181026,-0.104174,0.022083,0.110894
Parch,-0.057984,0.078311,0.012679,-0.20704,0.440355,1.0,0.22218,-0.206335,0.778416,0.049868,-0.204779,0.245578,-0.250724,-0.108371,0.04095
Fare,-0.019042,0.246641,-0.546794,0.088103,0.153011,0.22218,1.0,0.089143,0.211525,-0.203755,0.075011,0.999507,-0.171665,-0.530678,-0.507173
AgeBucket,0.007289,-0.047526,-0.361229,0.95701,-0.305981,-0.206335,0.089143,1.0,-0.309688,0.021881,0.998631,0.080361,0.091919,-0.249418,-0.229037
RelativesOnboard,-0.007917,0.003565,0.066748,-0.319651,0.906387,0.778416,0.211525,-0.309688,1.0,0.078835,-0.307185,0.242122,-0.190809,-0.035551,0.096804
embarked_encoded,0.030591,-0.14968,0.124445,0.015986,0.079193,0.049868,-0.203755,0.021881,0.078835,1.0,0.07415,-0.199735,0.080068,0.140202,0.732046


In [737]:
train_data.describe();

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.

Now let's build our preprocessing pipelines. We will use the `ColumnTransformer` to apply different preprocessing and feature extraction pipelines to different subsets of features. Here the numeric data is mean-imputated, while the categorical data is one-hot encoded after imputing missing values with the most frequent value in each column.

In [738]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [739]:
train_data["UltraRich"] = train_data["Fare"] + train_data["RelativesOnboard"]
train_data["RichY"] = train_data["Pclass"] + train_data["Age"] +train_data["FancyN"]
train_data["UC"] = train_data["sex_encoded"] + train_data["Pclass"] 
train_data["AgeBucket"] = train_data["Age"] // 15 * 15


numeric_features = ["UltraRich","RichY","UC","Fare","AgeBucket"] #,"FancyN"
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))])


categorical_features = ["RichS","Plclass"] 
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(sparse=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

Cool! Now we have a nice preprocessing pipeline that takes the raw data and outputs numerical input features that we can feed to any Machine Learning model we want.

In [740]:
X_train = preprocessor.fit_transform(train_data.drop(columns=["Survived"]))
X_train

array([[ 28.5   ,  46.5   ,   2.    , ...,   1.    ,   0.    ,   0.    ],
       [ 13.    ,  25.    ,   3.    , ...,   0.    ,   1.    ,   0.    ],
       [  7.925 ,  35.    ,   4.    , ...,   0.    ,   0.    ,   1.    ],
       ...,
       [ 16.1083,  44.    ,   4.    , ...,   0.    ,   0.    ,   1.    ],
       [123.    ,  15.    ,   1.    , ...,   1.    ,   0.    ,   0.    ],
       [ 78.2875,  22.    ,   2.    , ...,   1.    ,   0.    ,   0.    ]])

Note: We drop the "Survived" column from train_data before fit the preprocessor, because there is no "Survived" column in test_data. 

Let's not forget to get the labels:

In [741]:
y_train = train_data["Survived"]

We are now ready to train a classifier.

In [742]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

LogisticRegression()

Great, our model is trained, let's use it to make predictions on the test set:

In [743]:
#NB I have no idea why I keep getting this error since I have added these columns. I did not alter this part at all.
X_test = preprocessor.transform(test_data);
y_pred= log_reg.predict(X_test);

KeyError: "['UltraRich', 'RichY', 'UC', 'AgeBucket'] not in index"

In [None]:
test_data["Survived"]=y_pred
test_data.to_csv("my_solution_McMurrinBala.csv",index=False, columns=["PassengerId", "Survived"])

And now we just build a CSV file with these predictions, then upload it and hope for the best. But wait! We can do better than hope. Why don't we use cross-validation to have an idea of how good our model is?

In [None]:
#Anecia's solution
#I changed cv to 17...
#I did not get to 85 but I tried many different things. I also could have used cabin, lastname but for me there is no logical correlation
#I therefore decided only to use the categories above and I have gotten to 84 so fingers crossed.
from sklearn.model_selection import cross_val_score

scores = cross_val_score(log_reg, X_train, y_train, cv=17)
scores.mean()


Okay, over 80% accuracy, clearly better than random chance, but it's not a great score. 

**Your Goal**: try to build a model that reaches higher accuracy, for example, 85%.

To improve this result further, you could:
* Tune hyperparameters using cross validation,
* Do more feature engineering, for example:
  * replace **SibSp** and **Parch** with their sum,
  * try to identify parts of names that correlate well with the **Survived** attribute (e.g. if the name contains "Countess", then survival seems more likely),
* try to convert numerical attributes to categorical attributes: for example, different age groups had very different survival rates (see below), so it may help to create an age bucket category and use it instead of the age. Similarly, it may be useful to have a special category for people traveling alone since only 30% of them survived (see below).

In [None]:
#train_data["AgeBucket"] = train_data["Age"] // 15 * 15
#train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()

In [None]:
#train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
#train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()