# 1. Question or problem definition

Competition sites like Kaggle define the problem to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The question or problem definition for Titanic Survival competition is described here at Kaggle.

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

We may also want to develop some early understanding about the domain of our problem. This is described on the Kaggle competition description page here. Here are the highlights to note.

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.



#### VARIABLES OF THE DATA:
    VARIABLE    DESCRIPTION
    Survival    survival
    pclass      Ticket class
    sex	        Sex	
    Age	        Age in years	
    sibsp	    # of siblings / spouses aboard the Titanic	
    parch	    # of parents / children aboard the Titanic	
    ticket	    Ticket number	
    fare	    Passenger fare	
    cabin	    Cabin number	
    embarked	Port of Embarkation

Some notes:
pclass:     A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.





## Importing libraries

In [209]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn import tree

# 2. Acquire data

In [210]:
# We read the values from the csv files and store them in a dataframe
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

#We also combine these datasets to run certain operations on both datasets together and store them in a different dataset
combine = [train_df, test_df]


# 3. Wrangle, prepare, cleanse the data.

## Explore data

In [211]:
# Dimensions of the train dataframe: 12 features, 891 samples
train_df.shape

(891, 12)

In [212]:
# Types of each column
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [213]:
# Visualizing first 5 rows of the dataset
train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [214]:
# Statistics of the dataframe. Only numerical columns are taken into account
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [215]:
# if we want to rename a column train_df.rename(columns={'Survived':'survived'}, inplace=True)

In [216]:
# How many unique values has sex column have?
train_df.Sex.nunique()

2

In [217]:
# What are the possible values of sex column?
train_df.Sex.unique()

array(['male', 'female'], dtype=object)

In [218]:
# How many people survived?
train_df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [219]:
# Are there any missing values in survived column?
train_df['Survived'].isnull().sum()

0

### Part 3 exercises



In [220]:
# What are all the possible values of sibsp and parch?


In [221]:
# How many people survived in percentage?


Why is the percentage survival different from the one we shared on the slides?

In [222]:
# Print the last 10 rows of the dataset


Which variables you think they will be included in the model in order to predict survival?
Pclass, sex, age, sibsp, parch, fare, embarked. The rest of the variables don't make sense to include them, as they are unique identifiers: passengerId, name, ticket and cabin

-- END OF EXERCISES --

In [223]:
# Let's see what fields are included in the test dataframe
list(test_df)

['PassengerId',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [224]:
# There is no column Survived, we will add it and fill with zeros
Survived_column = np.array(0)


test_df['Survived'] = Survived_column


test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0


In [225]:
# Now we take only the columns that we need for the submmision
submit_df = test_df[['PassengerId','Survived']]
submit_df.head()

# Store the dataframe into a csv
submit_df.to_csv("1_theyallperish.csv", index = False)

# And we are ready to send our prediction to Kaggle!

## 4. Analyze, identify patterns, and explore the data.

### Gender

The disaster was famous for saving “women and children first”, so let’s take a look at the Sex and Age variables to see if any patterns are evident. We’ll start with the gender of the passengers. 

In [226]:
train_df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

Now we will check the perfectage of people that survived depending of the sex. 
For this, we will create a new dataframe for each sex.

In [227]:
mask_fem = train_df['Sex'] == 'female'
train_df_fem = train_df[mask_fem]

train_df_fem.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S


This dataframe contains only females, as can be seen in the tail command

One last test, we will check that the size of the female dataframe is 314, as we saw in the original training dataset that it was the total number of women aboard

In [228]:
train_df_fem.shape

(314, 12)

The last step is to see how many of those women survived (in percentage)

In [229]:
(train_df_fem['Survived'].value_counts())/(train_df_fem.shape[0])

1    0.742038
0    0.257962
Name: Survived, dtype: float64

Therefore we can see that the 74.20% of the women aboard survived. Let's check what is the percentage for men:

In [230]:
mask_male = train_df['Sex'] == 'male'
train_df_male = train_df[mask_male]

(train_df_male['Survived'].value_counts())/(train_df_male.shape[0])

0    0.811092
1    0.188908
Name: Survived, dtype: float64

The number of men that survived was only 18.89%. We now can see that the majority of females aboard survived, and a very low percentage of males did. In our last prediction we said they all met Davy Jones, so changing our prediction for this new insight should give us a big gain on the leaderboard! Let’s update our old prediction

In [231]:
test_df.loc[test_df['Sex'] == 'female', 'Survived'] = 1

In [232]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1


In [233]:
# Now we take only the columns that we need for the submmision
submit2_df = test_df[['PassengerId','Survived']]
submit2_df.head()

# Store the dataframe into a csv
submit2_df.to_csv("2_womensurvive.csv", index = False)

Now let’s write a new submission and send it to Kaggle to see how it’s improved our position!

### Age

Let’s start digging into the age variable now, print the statistics related to this column:

Are there any missing values? 

It is possible for values to be missing in data analytics, and this can cause a variety of problems out in the real world that can be quite difficult to deal with at times. For now we could assume that the 177 missing values are the average age of the rest of the passengers, ie. late twenties.

Our last few tables were on categorical variables, ie. they only had a few values. Now we have a continuous variable which makes drawing proportion tables almost useless, as there may only be one or two passengers for each age! So, let’s create a new variable, “Child”, to indicate whether the passenger is below the age of 18:

In [234]:
train_df['Child'] = np.array(0)
train_df.loc[train_df['Age'] < 18, 'Child'] = 1

You will see that any passengers with an age of NA have been assigned a zero in the child column, this is because the NA will fail any boolean test. This is what we wanted though, since we had decided to use the average age, which was an adult. Nevertheless, this null values can be troublesome later when we feed the data into the models, so let's replace all the NA values:

In [235]:
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].mean())

Did children have more chances to survive than adults? Let's look at the statistics:

In [236]:
# Create the male and female dataframes again, so they include the new column child


In [237]:
# Now calculate the percentage of male kids that survived


print("Percetage of male kids that survived: ")


Percetage of male kids that survived: 


In [238]:
# Now calculate the percentage of male adults that survived
print("Percetage of male adults that survived: ")


Percetage of male adults that survived: 


In [239]:
# Now calculate the percentage of males that survived
print("Percetage of males that survived: ")


Percetage of males that survived: 


Let's see now for women:

In [240]:
# Calculate the percentage of female kids that survived

print("Percetage of female kids that survived: ")


Percetage of female kids that survived: 


In [241]:
# Calculate the percentage of female adults that survived
print("Percetage of female adults that survived: ")



Percetage of female adults that survived: 


In [242]:
# Calculate the percentage of females that survived
print("Percetage of females that survived: ")


Percetage of females that survived: 


Well, it still appears that if a passenger is female most survive, and if they were male most don’t, regardless of whether they were a child or not. So we haven’t got anything to change our predictions on here. 


### Ticket fare

Let’s take a look at a couple of other potentially interesting variables to see if we can find anything more: the class that they were riding in, and what they paid for their ticket.

While the class variable is limited to a manageable 3 values, the fare is again a continuous variable that needs to be reduced to something that can be easily tabulated. Let’s bin the fares into less than 10 dollars, between 10 and 20, 20 to 30 and more than 30 and store it in new columns:

In [243]:
train_df['Fare2'] = np.array(0)
train_df.loc[train_df['Fare'] < 10, 'Fare2'] = '<10'
train_df.loc[(train_df['Fare'] >= 10) & (train_df['Fare'] < 20), 'Fare2'] = '10-20'
train_df.loc[(train_df['Fare'] > 20) & (train_df['Fare'] <= 30), 'Fare2'] = '20-30'
train_df.loc[train_df['Fare'] > 30, 'Fare2'] = '>30'

And now let's present this information in a more visual way than before:

In [244]:
train_df[['Sex', 'Fare2', 'Pclass', 'Survived']].groupby(['Sex', 'Pclass', 'Fare2'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Sex,Pclass,Fare2,Survived
4,female,2,>30,1.0
1,female,1,>30,0.977011
2,female,2,10-20,0.914286
3,female,2,20-30,0.903226
0,female,1,20-30,0.857143
7,female,3,<10,0.59375
5,female,3,10-20,0.581395
9,male,1,20-30,0.441176
11,male,1,>30,0.365854
6,female,3,20-30,0.333333


While the majority of males, regardless of class or fare still don’t do so well, we notice that most of the class 3 women who paid more than $20 for their ticket actually also miss out on a lifeboat.

It’s a little hard to imagine why someone in third class with an expensive ticket would be worse off in the accident, but perhaps those more expensive cabins were located close to the iceberg impact site, or further from exit stairs? Whatever the cause, let’s make a new prediction based on the new insights.

In [245]:
test_df.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Survived
count,418.0,418.0,418.0,418.0,418.0,417.0,418.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188,0.363636
std,120.810458,0.841838,12.634534,0.89676,0.981429,55.907576,0.481622
min,892.0,1.0,0.17,0.0,0.0,0.0,0.0
25%,996.25,1.0,23.0,0.0,0.0,7.8958,0.0
50%,1100.5,3.0,30.27259,0.0,0.0,14.4542,0.0
75%,1204.75,3.0,35.75,1.0,0.0,31.5,1.0
max,1309.0,3.0,76.0,8.0,9.0,512.3292,1.0


In [246]:
test_df.loc[(test_df['Sex'] == 'female') & (test_df['Pclass'] == 3) & (test_df['Fare'] > 20), 'Survived'] = 0

We can see after this tweak that the survival rate went down from 36.3% to 33.7%:

In [247]:
test_df.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Survived
count,418.0,418.0,418.0,418.0,418.0,417.0,418.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188,0.337321
std,120.810458,0.841838,12.634534,0.89676,0.981429,55.907576,0.473362
min,892.0,1.0,0.17,0.0,0.0,0.0,0.0
25%,996.25,1.0,23.0,0.0,0.0,7.8958,0.0
50%,1100.5,3.0,30.27259,0.0,0.0,14.4542,0.0
75%,1204.75,3.0,35.75,1.0,0.0,31.5,1.0
max,1309.0,3.0,76.0,8.0,9.0,512.3292,1.0


In [248]:
# Now we take only the columns that we need for the submmision
submit3_df = test_df[['PassengerId','Survived']]
submit3_df.head()

# Store the dataframe into a csv
submit3_df.to_csv("3_lesswomensurvive.csv", index = False)

# 5. Model, predict and solve the problem.
## Decision trees



There are two parts of the data, the input data (sex, age, fare, passenger class) is all the information we know about one passenger. The other part if if they survive or not, this is the "label", the result that we want to find out with our prediction algorithms. 

In order to train the model, we will need to separate the input data and the result (survival).

In [249]:
# The following variables are unique identifiers, therefore we will remove them:
# passengeer id, passenger name, ticket number and cabin 
X_train = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Fare2', 'Child'], axis = 1)
X_test = test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)


X_train = X_train.drop("Survived", axis=1)
X_test = X_test.drop("Survived", axis=1)
Y_train = train_df["Survived"]


X_train.shape, Y_train.shape, X_test.shape 

((891, 7), (891,), (418, 7))

In [250]:
# Let's check if the columns are really removed
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


In [251]:
X_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,34.5,0,0,7.8292,Q
1,3,female,47.0,1,0,7.0,S
2,2,male,62.0,0,0,9.6875,Q
3,3,male,27.0,0,0,8.6625,S
4,3,female,22.0,1,1,12.2875,S


Now we will choose the model and train it with the data

In [252]:
decision_tree = tree.DecisionTreeClassifier()
#decision_tree = decision_tree.fit(X_train, Y_train)

The model is failing because we are feeding categorical values, which is not allowed. Let's convert all categorical values to numerical ones, for this, we will use one-hot encoding:

In [253]:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

In [254]:
X_train.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,3,22.0,1,0,7.25,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,1,0,0
2,3,26.0,0,0,7.925,1,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,0,1
4,3,35.0,0,0,8.05,0,1,0,0,1


Now let's train the model again:

In [255]:
decision_tree = decision_tree.fit(X_train, Y_train)

In [256]:
#Y_pred = decision_tree.predict(X_test)

Why is the model failing now?


In [257]:
X_test.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
count,418.0,418.0,418.0,418.0,417.0,418.0,418.0,418.0,418.0,418.0
mean,2.26555,30.27259,0.447368,0.392344,35.627188,0.363636,0.636364,0.244019,0.110048,0.645933
std,0.841838,12.634534,0.89676,0.981429,55.907576,0.481622,0.481622,0.430019,0.313324,0.478803
min,1.0,0.17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,23.0,0.0,0.0,7.8958,0.0,0.0,0.0,0.0,0.0
50%,3.0,30.27259,0.0,0.0,14.4542,0.0,1.0,0.0,0.0,1.0
75%,3.0,35.75,1.0,0.0,31.5,1.0,1.0,0.0,0.0,1.0
max,3.0,76.0,8.0,9.0,512.3292,1.0,1.0,1.0,1.0,1.0


In [258]:
X_test['Fare'] = X_test['Fare'].fillna(X_test['Fare'].mean())

In [259]:
Y_pred = decision_tree.predict(X_test)

This time the model correctly predicted the survival chance for each passenger in the test dataset

In [260]:
submit_df = pd.DataFrame({"PassengerId": test_df["PassengerId"], "Survived": Y_pred})
submit_df.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,1
3,895,1
4,896,1


In [261]:
# Store the dataframe into a csv
submit_df.to_csv("4_decisiontree.csv", index = False)

Let's check the accuracy on our training set

In [262]:
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

98.2

Why is the accuracy of our model so high? Welcome to overfitting!!

Now we can play with other parameters in the tree:

In [263]:
#class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
decision_tree = tree.DecisionTreeClassifier(max_depth = 5, min_samples_split = 5)
decision_tree = decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

83.95

## Other models
Now implement several models. 
Logistic regression:


Support Vector Machines

In [264]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
Pclass        891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 39.2 KB


In [266]:
#X_train.loc[X_train['Sex_female'], 'Sex_females'] = int(X_train['Sex_female'])
counter = 0
for item in X_train['Sex_female']:
    X_train['Sex_female'][counter] = int8(item)
    counter += 1

NameError: name 'int8' is not defined

In [265]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
Pclass        891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 39.2 KB


In [205]:
#type(int(X_train['Sex_female'])

## Feature engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. This can involve combining different columns on the dataset and modifying them, adding new ones based on the data, etc. 

In this case, we are going to focus on the title that we can find in the name of the passenger.

In [None]:
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])

We can replace many titles with a more common name or classify them as Rare.

In [None]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

Now we will convert the categorical titles to ordinal.

In [None]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()

Now we can safely drop the Name feature from training and testing datasets. 

In [None]:
train_df = train_df.drop('Name', axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape

We can go ahead and do the modifications that we made before about NA values and getting dummies (This is why feature engineering is done in the "identifying patterns" of the process, but we do it hear in order to follow the steps in chronological order)

In [None]:
X_train = train_df.drop(['PassengerId', 'Ticket', 'Cabin', 'Fare2', 'Child'], axis = 1)
X_test = test_df.drop(['PassengerId', 'Ticket', 'Cabin'], axis = 1)


X_train = X_train.drop("Survived", axis=1)
X_test = X_test.drop("Survived", axis=1)
Y_train = train_df["Survived"]

X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].mean())
X_test['Fare'] = X_test['Fare'].fillna(X_test['Fare'].mean())


X_train.shape, Y_train.shape, X_test.shape 

Now we can apply one of the models that we used before, as for example logistic regression:

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

The accuracy has increased from 80.36 to 81.59! Let's submit this new model

In [None]:
submit_df = pd.DataFrame({"PassengerId": test_df["PassengerId"], "Survived": Y_pred})
submit_df.head()
submit_df.to_csv("5_featureengineering.csv", index = False)