# Introduction to Data Science using Pandas

We'll start by importing the Pandas library.

This library provides useful data structures and analysis tools for Python.

In [1]:
import pandas as pd

## Titanic dataset

In this notebook we will explore the dataset of survivors of the RMS Titanic disaster .

The dataset is split into training and testing subsets.

### Read the data from CSV files into a dataframe structure

In [2]:
train = pd.read_csv('data/titanic/train.csv')
test = pd.read_csv('data/titanic/test.csv')

### Display the column names

Let's print out the column names. 

These are specified in the first row of the CSV.

In [3]:
train_columns = list(train)
test_columns = list(test)

print(train_columns)
print(test_columns)

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


#### The test set does not contain the 'Survived' column

In [4]:
list(set(train_columns) - set(test_columns))

['Survived']

### Display the data from the the dataset

Let's look at example values from the head of the datasets.

In [5]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Creating a new categorical feature

By using the values from the Age column we can create a new categorical column Child which will contain 1 if the passenger is under 18 years old and 0 if he is older.

Create a new column in the dataframe and set all values to NaN

In [7]:
train["Child"] = float('NaN')

Using the Age column we will define a condition to fill the values in the Child column

In [8]:
train.loc[train["Age"] < 18, "Child"] = 1
train.loc[train["Age"] >= 18, "Child"] = 0

## Survival rates

### Children and Adults

Normalized Survival Rates for passengers under 18

In [9]:
train["Survived"][train["Child"] == 1].value_counts(normalize = True)

1    0.539823
0    0.460177
Name: Survived, dtype: float64

Normalized Survival Rates for passengers 18 or older

In [10]:
train["Survived"][train["Child"] == 0].value_counts(normalize = True)

0    0.618968
1    0.381032
Name: Survived, dtype: float64

### Men and Women

Let's first convert the categorical values ("male" and "female") to the integer form.

In [11]:
train.loc[train["Sex"] == "female", "Sex"] = 1
train.loc[train["Sex"] == "male", "Sex"] = 0

Survival rate of women.

In [12]:
train["Survived"][train["Sex"] == 1].value_counts(normalize = True)

1    0.742038
0    0.257962
Name: Survived, dtype: float64

Survival rate of men.

In [13]:
train["Survived"][train["Sex"] == 0].value_counts(normalize = True)

0    0.811092
1    0.188908
Name: Survived, dtype: float64

As you can see the survival rate for children is 54% and for adults 38%.

Also the survival rate for women is 74% and while for men it is 19%.

The result indicates that women and children were clearly the passengers of choice to be saved and survive.

## Decision trees

Let's build our first survival prediction model on the principle of Decision Trees.

A Decision tree is a flowchart like structure in which each node represents a test of an attribute and each branch represents an outcome.

We will import the mathematical library numpy and the tree algorithm from scikit learn.

In [14]:
#Import the Numpy library
import numpy as np

#Import 'tree' from scikit-learn library
from sklearn import tree

We need to fill out the missing values from the Embarked column and then convert them to the integer form.

In [15]:
train["Embarked"] = train["Embarked"].fillna("S")

#Convert the Embarked classes to integer form
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

Let's also fill out the missing Age values with the median of all the Age values.

In [16]:
train["Age"] = train["Age"].fillna(train["Age"].median())

### Defining our target and features

We are building a survival prediction model so the values of the Survived column are our target values.

As our features we will pick:
- Passenger class
- Sex
- Age
- Fare

In [17]:
target = train["Survived"].values

features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

### Fitting our first Decision tree

In [18]:
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

Once the tree is trained we can display what are the importances of the individual features. 

In [19]:
importances = my_tree_one.feature_importances_

for idx, val in enumerate(["Pclass", "Sex", "Age", "Fare"]):
    print("Name: %s, Importance: %.2f%%" % (val, importances[idx] * 100))    

Name: Pclass, Importance: 12.32%
Name: Sex, Importance: 31.27%
Name: Age, Importance: 23.23%
Name: Fare, Importance: 33.18%


Let's display the score of our decision tree

In [20]:
print("Score: %.2f%%" % (my_tree_one.score(features_one, target) * 100))

Score: 97.76%


### Predicting Survival on the test set

We will prepare the test set in the same way we did our train set.

In [21]:
# Impute the missing value with the median
test["Fare"] = test["Fare"].fillna(test["Fare"].median())

# Convert Sex to integers
test.loc[test["Sex"] == "female", "Sex"] = 1
test.loc[test["Sex"] == "male", "Sex"] = 0

#Impute the Embarked variable
test["Embarked"] = test["Embarked"].fillna("S")

#Convert the Embarked classes to integer form
test.loc[test["Embarked"] == "S", "Embarked"] = 0
test.loc[test["Embarked"] == "C", "Embarked"] = 1
test.loc[test["Embarked"] == "Q", "Embarked"] = 2

#Fill missing Age values with median
test["Age"] = test["Age"].fillna(test["Age"].median())

Let's define our test features in the same way we as the train features.

In [22]:
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

Making a prediction using the test set

In [23]:
my_prediction = my_tree_one.predict(test_features)

We will create a dataframe with two columns: PassengerId & Survived. Survived will contain our predictions

In [24]:
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
my_solution.head()

Unnamed: 0,Survived
892,0
893,0
894,1
895,1
896,1


Let's check that your data frame has 418 entries.

In [25]:
my_solution.shape

(418, 1)

### Writing the solution into a CSV file

Dataframes from Pandas can be exported into CSV files using the to_csv function.

In [26]:
my_solution.to_csv("data/titanic/my_solution_one.csv", index_label = ["PassengerId"])

### Adding more features to the model

We will create a new array with added features and name it features_two.

In [27]:
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

### Overfitting

Overfitting means that our moder fails to generalize on the underlying data relationships and has poor predictive performance on new data. This occurs when we increase the number of parameters and the model becomes more complex.

We can control the overfitting of our Decision Tree classifier by specifing the parameters of the maximum depth and the minimum sample split (The minimum number of samples required to split an internal node).

In [28]:
#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth=max_depth,min_samples_split=min_samples_split, random_state = 1)

### Training our second tree with more features

Let's now fit our second tree.

In [29]:
my_tree_two = my_tree_two.fit(features_two, target)

Once again we display our feature importances and score.

In [30]:
importances = my_tree_two.feature_importances_

for idx, val in enumerate(["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]):
    print("Name: %s, Importance: %.2f%%" % (val, importances[idx] * 100))    

Name: Pclass, Importance: 14.13%
Name: Age, Importance: 17.91%
Name: Sex, Importance: 41.62%
Name: Fare, Importance: 17.94%
Name: SibSp, Importance: 5.04%
Name: Parch, Importance: 1.92%
Name: Embarked, Importance: 1.44%


In [31]:
print("Score: %.2f%%" % (my_tree_two.score(features_two, target) * 100))

Score: 90.57%


Note: The score is lower as before because we introduced the parameters to counter the overfitting problem.

### Feature engineering

We could assume that larger families tend to need more time to to get together and hence lower probablity of surviving.

Let's try creating a new feature Family size.

We will copy our train dataset and calculate the family size based on:
- SibSp - number of siblings / spouses aboard
- Parch - of parents / children aboard

In [45]:
train_two = train.copy()
train_two["family_size"] = train_two["SibSp"] + train_two["Parch"] + 1

Create a new feature set and add the new feature

In [46]:
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

Define the tree classifier, then fit the model

In [47]:
my_tree_three = tree.DecisionTreeClassifier(max_depth=max_depth,min_samples_split=min_samples_split, random_state = 1)
my_tree_three = my_tree_three.fit(features_three, target)

In [48]:
# Print the score of this decision tree
print("Score: %.2f%%" % (my_tree_three.score(features_three, target) * 100))

Score: 89.90%


## Random forest

Random forest is a statistical learning method that operates by using multiple random decision trees on various subsamples of the data and uses averaging to improve the accuracy.  

Let's import the Random forest classifier from scikit learn.

In [49]:
from sklearn.ensemble import RandomForestClassifier

Select our features for the Random forest

In [50]:
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

Now we can build and fit our model.

In [51]:
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

Print the score of the model.

In [52]:
print("Score: %.2f%%" % (my_forest.score(features_forest, target) * 100))

Score: 93.94%


Compute predictions on our test set features then print the length of the prediction vector

In [53]:
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))

418


Let's display our feature importances and also compare them with the second decision tree.

In [57]:
importances_featues = ["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]
importances_tree = my_tree_two.feature_importances_ * 100
importances_forest = my_forest.feature_importances_ * 100

importances_df = pd.DataFrame({
    'Name': importances_featues,
    'Decision tree': importances_tree,
    'Random forest': importances_forest
})

importances_df = importances_df[['Name', 'Decision tree', 'Random forest']]
importances_df

Unnamed: 0,Name,Decision tree,Random forest
0,Pclass,14.130255,10.384741
1,Age,17.906027,20.139027
2,Sex,41.616727,31.989322
3,Fare,17.938711,24.602858
4,SibSp,5.039699,5.272693
5,Parch,1.923751,4.159232
6,Embarked,1.44483,3.452128


### Compute and print the mean accuracy score for both models

In [62]:
print("Decision tree score: %.2f%%" % (my_tree_two.score(features_two, target) * 100))
print("Random forest score: %.2f%%" % (my_forest.score(features_forest, target) * 100))

Decision tree score: 90.57%
Random forest score: 93.94%


## Testing other models

### Logistic regression

In [65]:
from sklearn import linear_model

features_lr = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
my_lr = linear_model.LogisticRegression(C=1e5)
my_lr.fit(features_lr, target)

print("Score: %.2f%%" % (my_lr.score(features_lr, target) * 100))

Score: 79.91%


### Gradient boosting classifier

In [68]:
from sklearn.ensemble import GradientBoostingClassifier

features_gb = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
my_gb = GradientBoostingClassifier(n_estimators=200, random_state=1, verbose=False)
my_gb.fit(features_gb, target)

print("Score: %.2f%%" % (my_gb.score(features_gb, target) * 100))


Score: 92.59%


### Decision Tree Regression with AdaBoost

In [73]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

features_ada = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

my_ada = AdaBoostRegressor(DecisionTreeRegressor(max_depth=10),
                          n_estimators=200, random_state=1)
my_ada.fit(features_ada, target)

print("Score: %.2f%%" % (my_ada.score(features_ada, target) * 100))

Score: 64.44%


This notebook was partially inspired by the Datacamp course:
https://www.datacamp.com/community/open-courses/kaggle-python-tutorial-on-machine-learning#gs._n34sMg