# Machine Learning with Python to solve the Kaggle Titanic case

## Getting and Cleaning both the test and train datasets

In the first steps we will go trough the essential steps of getting and cleaning the data that we will need to take before beginning to build predictive models to explore how to tackle Kaggle Titanic competition using Python and Machine Learning.

At first we run the necessary Jupyter magic so that plots are displayed inline.

In [3]:
%matplotlib inline

Now we import the Pandas library and load and read both the train and test datasets to create two DataFrames.

In [4]:
import pandas as pd

train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

Now we print the `head` of the train and test dataframes

In [10]:
print(train.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [11]:
print(test.head())

   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S  


Before we start with the actual analysis of the dataframes, it's important that we understand the structure of your data. Both the test and train datasets are DataFrame objects, the way pandas represent datasets. We can easily explore a DataFrame using the .describe() method. The .describe() method summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on.

In [8]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the .shape attribute of our DataFrame objects with the help of the numpy package.

In [18]:
import numpy as np
np.shape(train)

(891, 12)

In [19]:
import numpy as np
np.shape(test)

(418, 11)

## Analyse the datasets

How many people in your training set survived the disaster with the Titanic? To see this, you can use the value_counts() method in combination with standard bracket notation to select a single column of a DataFrame.

In [22]:
train["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

Now we see that 549 individuals died (62%) and 342 survived (38%). A simple way to predict heuristically could be: "majority wins". This would mean that we will predict every unseen observation to not survive. To dive in a little deeper we can perform similar counts and percentage calculations on subsets of the Survived column. For example, maybe gender could play a role as well? We can explore this using the .value_counts() method for a two-way comparison on the number of males and females that survived.

Passengers that survived vs passengers that passed away in total:

In [25]:
print(train["Survived"].value_counts())

0    549
1    342
Name: Survived, dtype: int64


Passengers that survived vs passengers that passed away as proportions, to get proportions, we pass in the argument: normalize = True.

In [27]:
print(train["Survived"].value_counts(normalize = True))

0    0.616162
1    0.383838
Name: Survived, dtype: float64


To explore if maybe gender played a role we use the .value_counts() method for a two-way comparison on the number of males and females that survived.

In [28]:
print(train["Survived"][train["Sex"] == 'male'].value_counts())

0    468
1    109
Name: Survived, dtype: int64


print(train["Survived"][train["Sex"] == 'female'].value_counts())

No we will do the same exploration, only this time by looking at the proportions instead of the totals. 

In [23]:
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))

0    0.811092
1    0.188908
Name: Survived, dtype: float64


In [24]:
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))

1    0.742038
0    0.257962
Name: Survived, dtype: float64


Another variable that could influence survival is age; since it's probable that children were saved first. We can test this by creating a new column with a categorical variable Child. The new variabele Child will take the value 1 in cases where age is less than 18, and a value of 0 in cases where age is greater than or equal to 18. To add this new variable we need to do two things (i) create a new column, and (ii) provide the values for each observation (i.e., row) based on the age of the passenger.

At first we create a new column called "Child" and we define that missing values will count 0

In [32]:
train["Child"] = float('NaN')

Then we assign 1 to passengers under 18, and we assing 0 to those missing or at the age of 18 and older.

In [38]:
train["Child"][train["Age"] < 18] == 1

7      True
9      True
10     True
14     True
16     True
22     True
24     True
39     True
43     True
50     True
58     True
59     True
63     True
68     True
71     True
78     True
84     True
86     True
111    True
114    True
119    True
125    True
138    True
147    True
156    True
163    True
164    True
165    True
171    True
172    True
       ... 
691    True
720    True
721    True
731    True
746    True
750    True
751    True
755    True
764    True
777    True
780    True
781    True
787    True
788    True
791    True
802    True
803    True
813    True
819    True
824    True
827    True
830    True
831    True
841    True
844    True
850    True
852    True
853    True
869    True
875    True
Name: Child, dtype: bool

In [39]:
train["Child"][train["Age"] >= 18] == 0

0      True
1      True
2      True
3      True
4      True
6      True
8      True
11     True
12     True
13     True
15     True
18     True
20     True
21     True
23     True
25     True
27     True
30     True
33     True
34     True
35     True
37     True
38     True
40     True
41     True
44     True
49     True
51     True
52     True
53     True
       ... 
854    True
855    True
856    True
857    True
858    True
860    True
861    True
862    True
864    True
865    True
866    True
867    True
870    True
871    True
872    True
873    True
874    True
876    True
877    True
879    True
880    True
881    True
882    True
883    True
884    True
885    True
886    True
887    True
889    True
890    True
Name: Child, dtype: bool

 Now we have to print the new column.

In [34]:
print(train["Child"])

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      NaN
6      0.0
7      1.0
8      0.0
9      1.0
10     1.0
11     0.0
12     0.0
13     0.0
14     1.0
15     0.0
16     1.0
17     NaN
18     0.0
19     NaN
20     0.0
21     0.0
22     1.0
23     0.0
24     1.0
25     0.0
26     NaN
27     0.0
28     NaN
29     NaN
      ... 
861    0.0
862    0.0
863    NaN
864    0.0
865    0.0
866    0.0
867    0.0
868    NaN
869    1.0
870    0.0
871    0.0
872    0.0
873    0.0
874    0.0
875    1.0
876    0.0
877    0.0
878    NaN
879    0.0
880    0.0
881    0.0
882    0.0
883    0.0
884    0.0
885    0.0
886    0.0
887    0.0
888    NaN
889    0.0
890    0.0
Name: Child, dtype: float64


We print the normalized Survival Rates for passengers above 18.

In [36]:
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

1    0.539823
0    0.460177
Name: Survived, dtype: float64


We print the normalized Survival Rates for passengers above 18.

In [40]:
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))

0    0.618968
1    0.381032
Name: Survived, dtype: float64


## First Prediction

Earlier we discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving. Hence, we could use this information for our first prediction: all females in the test set survive and all males in the test set die. We will use your test set for validating our predictions. We might have seen that contrary to the training set, the test set has no Survived column. You add such a column using your predicted values. Next, when uploading your results, Kaggle will use this variable (= your predictions) to score your performance.

At first we create a copy of the test dataframe called: test_one, we initialize a survived column to 0 in the test_one dataframe and then we set Survived to 1 if Sex equals "female"

In [46]:
test_one = test
test_one["Survived"] == 0
test_one["Survived"][test_one["Sex"] == "female"] == 1

1      True
4      True
6      True
8      True
12     True
14     True
15     True
18     True
19     True
22     True
24     True
26     True
32     True
33     True
36     True
37     True
43     True
44     True
48     True
49     True
52     True
53     True
59     True
63     True
65     True
66     True
69     True
70     True
72     True
74     True
       ... 
347    True
349    True
350    True
354    True
356    True
359    True
361    True
362    True
364    True
365    True
367    True
368    True
371    True
374    True
375    True
376    True
382    True
383    True
385    True
391    True
395    True
397    True
400    True
402    True
408    True
409    True
410    True
411    True
412    True
414    True
Name: Survived, dtype: bool

Now we print the result of our prediction.

In [50]:
print(test_one.Survived)

0      0
1      1
2      0
3      0
4      1
5      0
6      1
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     0
26     1
27     0
28     0
29     0
      ..
388    0
389    0
390    0
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    1
413    0
414    1
415    0
416    0
417    0
Name: Survived, dtype: int64


## Decision trees to automate slicing and deciding

Until now, we did all the slicing and dicing ourself to find subsets that have a higher chance of surviving. A decision tree automates this process for us and outputs a classification model or classifier. Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, it does the split and goes down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

First, let's import the necessary libraries:

In [51]:
import numpy as np
from sklearn import tree

Before we can begin constructing our trees we need to get our hands dirty and clean the data so that we can use all the features available to us. In the first chapter, we saw that the Age variable had some missing value. Missingness is a whole subject with and in itself, but we will use a simple imputation technique where we substitute each missing value with the median of the all present values. Another problem is that the Sex and Embarked variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. Embarked also has some missing values which we should impute witht the most common class of embarkation, which is "S".

At first we convert the male and female groups to integer form.

In [53]:
train["Sex"][train["Sex"] == "male"] == 0
train["Sex"][train["Sex"] == "female"] == 1

Series([], Name: Sex, dtype: bool)

Now we impute the Embarked variable.

In [55]:
train["Embarked"] = train["Embarked"].fillna("S")

Then we convert the Embarked classes to integer form.

In [57]:
train["Embarked"][train["Embarked"] == "S"] == 0
train["Embarked"][train["Embarked"] == "C"] == 1
train["Embarked"][train["Embarked"] == "Q"] == 2

Series([], Name: Embarked, dtype: bool)

We print the Sex and Embarked columns.

In [58]:
print(train["Sex"])

0      0
1      1
2      1
3      1
4      0
5      0
6      0
7      0
8      1
9      1
10     1
11     1
12     0
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     1
26     0
27     0
28     1
29     0
      ..
861    0
862    1
863    1
864    0
865    1
866    1
867    0
868    0
869    0
870    0
871    1
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    1
881    0
882    1
883    0
884    0
885    1
886    0
887    1
888    1
889    0
890    0
Name: Sex, dtype: object


In [59]:
print(train["Embarked"])

0      0
1      1
2      0
3      0
4      0
5      2
6      0
7      0
8      0
9      1
10     0
11     0
12     0
13     0
14     0
15     0
16     2
17     0
18     0
19     1
20     0
21     0
22     2
23     0
24     0
25     0
26     1
27     0
28     2
29     0
      ..
861    0
862    0
863    0
864    0
865    0
866    1
867    0
868    0
869    0
870    0
871    0
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    0
881    0
882    0
883    0
884    0
885    2
886    0
887    0
888    0
889    1
890    2
Name: Embarked, dtype: object


We will use the scikit-learn and numpy libraries to build our first decision tree. scikit-learn can be used to create tree objects from the DecisionTreeClassifier class. The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have. We will need the following to build a decision tree: a target: A one-dimensional numpy array containing the target/response from the train data. (Survival in your case) and features: A multidimensional numpy array containing the features/predictors from the train data. (ex. Sex, Age)

We now create the target and features numpy arrays, called: target and features_one. Then we fit our first decision tree, called: my_tree_one. 

In [109]:
import numpy as np
from sklearn import tree

print(train)

target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Fare"]].values

my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)



     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          

One way to quickly see the result of our decision tree is to see the importance of the features that are included. This is done by requesting the .feature_importances_ attribute of our tree object.

In [76]:
print(my_tree_one.feature_importances_)

[ 0.12379776  0.41123936  0.46496288]


The feature_importances_ attribute made it simple to interpret the significance of the predictors we included. Based on our decision tree, the variable "Fare" plays the most important role in determining whether or not a passenger survived according by having the highest percentage (46.5%) of all.

Another quick metric is the mean accuracy that we compute using the .score() function with features_one and target as arguments.

In [78]:
print(my_tree_one.score(features_one, target))

0.904601571268


To send a submission to Kaggle we need to predict the survival rates for the observations in the test set. Before now we created simple predictions based on a single subset. Luckily, with our decision tree, we can make use of some simple functions to "generate" our answer without having to manually perform subsetting.

First, we have to make use of the .predict() method. In the model (my_tree_one), we put the values of features from the dataset for which the predictions need to be made (test). To extract the features we will need to create a numpy array in the same way as we did when training the model. However, we need to take care of a small but important problem first. There is a missing value in the Fare feature that needs to be imputed.

Next, we need to make sure that our output is in line with the submission requirements of Kaggle: a csv file with exactly 418 entries and two columns: PassengerId and Survived. Then we have to use the code provided to make a new data frame using DataFrame(), and create a csv file using to_csv() method from Pandas.

In [106]:
print(test)

     PassengerId  Pclass                                               Name  \
0            892       3                                   Kelly, Mr. James   
1            893       3                   Wilkes, Mrs. James (Ellen Needs)   
2            894       2                          Myles, Mr. Thomas Francis   
3            895       3                                   Wirz, Mr. Albert   
4            896       3       Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
5            897       3                         Svensson, Mr. Johan Cervin   
6            898       3                               Connolly, Miss. Kate   
7            899       2                       Caldwell, Mr. Albert Francis   
8            900       3          Abrahim, Mrs. Joseph (Sophie Halaut Easu)   
9            901       3                            Davies, Mr. John Samuel   
10           902       3                                   Ilieff, Mr. Ylio   
11           903       1                         Jon

In [None]:
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_two.csv", index_label = ["PassengerId"])

We just created our first decision tree. We downloaded our csv file, and submitted the csv file to Kaggle. Result of our effort: we were placed very low and have to do better, let's see what is wrong and submit again the next time. When we created our first decision tree the default arguments for max_depth and min_samples_split were set to None. This means that no limit on the depth of our tree was set. That's a good thing right? Not so fast. We are likely overfitting. This means that while our model describes the training data extremely well, it doesn't generalize to new data, which is frankly the point of prediction. Just look at the Kaggle submission results for the simple model based on Gender and the complex decision tree. Which one does better?