# Predicting survival on the Titanic
## An introduction to Machine Learning with Python
*By Alexander I.R Jackson, University of Southampton*

In this notebook we will work with data about passengers on the RMS Titanic's maiden voyage; which famously ended in disaster.

We'll use python and machine learning libraries to try and understand and predict survival on this fateful voyage.

## Load Packages
Here we will load all the packages needed for our analysis. You can modify this list and add to it if you want.


In [38]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split


## Import the data
We will use Pandas to import the data from 'titanic.csv' file and store it in a dataframe

In [57]:
# This line reads the data and stores the information in the titanic_df variable
titanic_df = pd.read_csv('data/titanic.csv')
titanic_df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S
...,...,...,...,...,...,...,...,...,...,...,...
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0000,0.0,0.0,315082,7.8750,,S


## Explore the data




### Getting an overview
Pandas DataFrames have a number of useful methods to explore DataFrames. 
We use these by writing the variable name (`titanic_df`) then a dot (`.`) and then the method like `head(x)` - which shows the x first rows.
Some other methods you can use include:
* `tail()`
* `sample()`
* `shape` - notice this isn't actually a method, there is no () instead it's an attribute or property.
* `info()`
* `describe()`

Try these out and see if you can work out what they do. We've done an example with `head()`.

In [59]:
# Head
titanic_df.head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S
5,1.0,1.0,"Anderson, Mr. Harry",male,48.0,0.0,0.0,19952,26.55,E12,S
6,1.0,1.0,"Andrews, Miss. Kornelia Theodosia",female,63.0,1.0,0.0,13502,77.9583,D7,S
7,1.0,0.0,"Andrews, Mr. Thomas Jr",male,39.0,0.0,0.0,112050,0.0,A36,S
8,1.0,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2.0,0.0,11769,51.4792,C101,S
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C


In [17]:
# Tail

In [61]:
# Sample


In [None]:
# Shape

In [115]:
# Info

In [30]:
# Describe


### Data understanding
Some of these columns don't make a lot of sense! Luckily we've done the hard work and found what the different random numbers and letters mean. In the real world this can be much harder
```markdown
Variable    Definition      Key

survival    Survival        0 = No, 1 = Yes
pclass      Ticket class    1 = 1st, 2 = 2nd, 3 = 3rd
name        Passenger Name
sex         Sex
Age         Age in years
sibsp       # of siblings / spouses aboard the Titanic
parch       # of parents / children aboard the Titanic
ticket      Ticket number
fare        Passenger fare
cabin       Cabin number
embarked    Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
```

### Select a column
If you just want to select one column there are a few ways.
The simplest is using square brackets `[]` and the columns name in quotation marks `''` or `""`

for example:
`titanic_df['name']`

Try this out below!

P.S. If you are really getting into this then it's worth reading [the Pandas documentation on this](https://pandas.pydata.org/docs/user_guide/indexing.html) as there are different ways to select and index data, which should be used at different times!

In [None]:
# Select a column by name

### Getting some insights

Now you can select a column lets try and understand something about the data...

These methods can be used on individual columns (called Series in Pandas) and can be useful in lots of ways

* `mean()`
* `median()`
* `min()`, `max()`
* `value_counts()`

Try some out below. Remember it's hard to calculate mean on a name! (or any string) and `value_counts()` probably isn't the best way to summarise the fare.

In [33]:
# Try it out below

In [None]:
# Where did most passengers board the Titanic?

In [None]:
# This section is a bit more complicated... can you figure it out?

titanic_df.groupby('pclass').agg(mean_fare=('fare', 'mean'),
                                 mean_age=('age', 'mean'),
                                 count=('pclass', 'count'))

## Tidy the data

This is probably the most time consuming and most important part. Don't worry though... not today!

The Titanic dataset is already pretty clean and well structured, but there are still a few things to think about


In [None]:
# The code below will 'drop' the columns that you specify.
# Which columns should we leave out? (Remember if you want to drop more than one its a list of strings)

titanic_df_clean = titanic_df.drop(columns=['COL NAME HERE'], axis=1)

# Now print that new DataFrame out. Does it look right?
titanic_df_clean

### The Cabin Column

Lets have a close look at this column.


In [114]:
titanic_df['cabin'].sample(10, random_state=42)

701       NaN
994     F E46
350       NaN
986       NaN
409       NaN
917       NaN
905       NaN
1117      NaN
1168      NaN
344       NaN
Name: cabin, dtype: object

Thre seem to be a lot of NaN (missing values). Let's try and work out why.

* What does the column mean?
* Why are there so many missing values?
* What could we do to make it useful?

<details>
    <summary markdown='span'>💡 Hint</summary>

Maybe only passengers who could afford a cabin will have one? Or maybe the data are just poor?
</details>

In [None]:
cabin_bool = []


In [99]:
# The line below does exactly the same thing but in one line

cabin_bool_quick = [True if not pd.isnull(cabin) else False for cabin in titanic_df_clean['cabin']]

# Lets check they are the same (should print True if they are the same)
cabin_bool == cabin_bool_quick


True

In [100]:
# Create a new 'cabin_bool' column in the DataFrame
titanic_df_clean['cabin_bool'] = cabin_bool

# Drop the old cabin column
titanic_df_clean.drop(columns=['cabin'], axis=1, inplace=True)

# Check the new DataFrame
titanic_df_clean

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,embarked,cabin_bool
0,1.0,1.0,female,29.0000,0.0,0.0,211.3375,S,True
1,1.0,1.0,male,0.9167,1.0,2.0,151.5500,S,True
2,1.0,0.0,female,2.0000,1.0,2.0,151.5500,S,True
3,1.0,0.0,male,30.0000,1.0,2.0,151.5500,S,True
4,1.0,0.0,female,25.0000,1.0,2.0,151.5500,S,True
...,...,...,...,...,...,...,...,...,...
1305,3.0,0.0,female,,1.0,0.0,14.4542,C,False
1306,3.0,0.0,male,26.5000,0.0,0.0,7.2250,C,False
1307,3.0,0.0,male,27.0000,0.0,0.0,7.2250,C,False
1308,3.0,0.0,male,29.0000,0.0,0.0,7.8750,S,False


## Let's start modelling!

We'll now go on and try some simple statistical modelling before building up to more complex machine learning.

Remember our goal is to predict which passengers will survive.



### Splitting the data

But first up we need a training set and a test set. We also need our X (inputs) and out y (outcome) separate.

Luckily there are packages that will do a lot of the legwork for you. Below we use the `train_test_split` function from sklearn ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))

In [52]:
# Select our X
X = titanic_df.drop(columns='survived', axis=1)

# Select our y
y = # YOUR ANSWER HERE

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

In [None]:
# Try comparing the size of the training and testing sets (remember the shape attribute)



In [None]:
## Logistic Regression


In [55]:
titanic_df.drop(columns=['home.dest', 'cabin', 'boat', 'body', 'name', 'ticket'], axis=1).dropna()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,ticket,fare,embarked
0,1.0,1.0,female,29.0000,0.0,0.0,24160,211.3375,S
1,1.0,1.0,male,0.9167,1.0,2.0,113781,151.5500,S
2,1.0,0.0,female,2.0000,1.0,2.0,113781,151.5500,S
3,1.0,0.0,male,30.0000,1.0,2.0,113781,151.5500,S
4,1.0,0.0,female,25.0000,1.0,2.0,113781,151.5500,S
...,...,...,...,...,...,...,...,...,...
1301,3.0,0.0,male,45.5000,0.0,0.0,2628,7.2250,C
1304,3.0,0.0,female,14.5000,1.0,0.0,2665,14.4542,C
1306,3.0,0.0,male,26.5000,0.0,0.0,2656,7.2250,C
1307,3.0,0.0,male,27.0000,0.0,0.0,2670,7.2250,C
