Hi all. I will use some work of **@alexisbcook** ([Titanic Tutorial](https://www.kaggle.com/alexisbcook/titanic-tutorial "Alexis Cook’s Titanic Tutorial")) and **@dansbecker** ([Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning "Dan Becker’s course")) in this notebook. My purpose is to use all columns of the training data in my model with the most reasonable imputations. I will also impute to test data.

At the end of the notebook there is a references section for some of my copy & pastes. The others can be found next to code I pasted.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestClassifier

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


Get the data...

In [2]:
# For displaying all the rows of the data frame
pd.set_option('display.max_rows', None) # Thanks to @ https://dev.to/chanduthedev/how-to-display-all-rows-from-data-frame-using-pandas-dha

# get training data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# get test data
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


First let's search for some new patterns on both datasets. So far we have only used "Pclass", "Sex", "SibSp" and "Parch" columns ([Titanic Tutorial](https://www.kaggle.com/alexisbcook/titanic-tutorial "Alexis Cook’s Titanic Tutorial")). 

In [4]:
# Describe training data
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
# Describe test data
test_data.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [6]:
# Check correlation in training data
train_data.corr(method='pearson')

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


Let' search for null values...

In [7]:
# Look null values for train data
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [8]:
# Look null values for test data
test_data.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

It seems that **'Fare'** and **'Embarked'** columns are encouraging for new patterns since they have very few null values which can be modified somehow.

Moreover correlation of 'Fare' with 'Survived' is relatively **high** (second after **'Pclass'**). This is as expected. Probably how much they paid for the tickets determined where they slept at that unholly night.

Let's see who have these missing values.

In [9]:
# There is 1 null value for 'Fare' column in test data
# https://dzone.com/articles/pandas-find-rows-where-columnfield-is-null
test_data[test_data["Fare"].isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


In [10]:
# There are 2 null values for 'Embarked' column in train data
# https://dzone.com/articles/pandas-find-rows-where-columnfield-is-null
train_data[train_data["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


If we can replace missing 'Fare' data of **"Storey, Mr. Thomas"** with a meaningful value we can use the '*Fare*' column in our model (we do not want to drop any non value since our purpose is the use all the data as much as possible).

Similarly if we can replace missing 'Embarked' data of **"Icard, Miss. Amelie"** and **"Stone Mrs. George Nelson (Martha Evelyn)"** with meaningful values we can also use the "*Embarked*" column in our model.

Let's start with **Mr. Storey**:

**Mr. Storey** embarked from Southampton. He was traveling in the third class section. Let's look who else is in the same status at datasets.

In [11]:
# Description of the people from Southampton who are traveling in the third class(training dataset)
South_3C_train_data = train_data[train_data["Embarked"] == "S"][train_data["Pclass"] == 3]
South_3C_train_data.describe()

  


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,353.0,353.0,353.0,290.0,353.0,353.0,353.0
mean,440.685552,0.189802,3.0,25.696552,0.705382,0.439093,14.644083
std,263.352933,0.392701,0.0,12.110906,1.529408,0.954863,13.276609
min,1.0,0.0,3.0,1.0,0.0,0.0,0.0
25%,201.0,0.0,3.0,19.0,0.0,0.0,7.8542
50%,432.0,0.0,3.0,25.0,0.0,0.0,8.05
75%,668.0,0.0,3.0,32.0,1.0,0.0,16.1
max,889.0,1.0,3.0,74.0,8.0,6.0,69.55


In [12]:
# Description of the people from Southampton who are traveling in the third class(test dataset)
South_3C_test_data = test_data[test_data["Embarked"] == "S"][test_data["Pclass"] == 3]
South_3C_test_data.describe()

  


Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,142.0,142.0,109.0,142.0,142.0,141.0
mean,1095.56338,3.0,24.051193,0.549296,0.556338,13.91303
std,118.465332,0.0,11.000006,1.274838,1.391589,12.744667
min,893.0,3.0,0.17,0.0,0.0,3.1708
25%,995.5,3.0,19.0,0.0,0.0,7.8542
50%,1092.0,3.0,24.0,0.0,0.0,8.05
75%,1190.0,3.0,29.0,1.0,0.0,14.5
max,1308.0,3.0,60.5,8.0,9.0,69.55


It seems that according to the training dataset aproximately **39%** of the passengers embarked from Southampton and were travelling in the third class with an avarage fare of **£ 14.644083**.

It also seems that according to the test dataset aproximately **34%** of the passengers embarked from Southampton and were travelling in the third class with an avarage fare of **£ 13.913030**.

Note that percentage values are for third class passengers only. There may be passangers in the first and second class who are coming from Southampton

First we can use some number between these amounts (£ 13.913030 - £ 14.644083) for Mr Storey's fare. 

But before that let's check **"Icard, Miss. Amelie"** and **"Stone Mrs. George Nelson (Martha Evelyn)"** again. Although their embark location is not known it seems that they were in the same cabin (B28) and paid the same fare (£ 80.0). As can be seen easily their ticket number is also the same (113572).

So is there any other person whose ticket number is the same with Mr Storey? If yes we can conclude that they paid the same fare.

In [13]:
# Look if there is another person who has the same ticket number with Mr Storey 
train_data[train_data["Ticket"] == "3701"]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


In [14]:
# Look if there is another person who has the same ticket number with Mr Storey 
test_data[test_data["Ticket"] == "3701"]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


Unfortunately no one but himself has the same ticket number with Mr Storey. May be this is because he doesn't have a cabin. Let's dive in to fares a litle bit more...

In [15]:
# Who paid £ 7.854200 in Southampton(training dataset)
train_data[train_data["Embarked"] == 'S'][train_data["Fare"] == 7.854200]

  


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
91,92,0,3,"Andreasson, Mr. Paul Edvin",male,20.0,0,0,347466,7.8542,,S
175,176,0,3,"Klasen, Mr. Klas Albin",male,18.0,1,1,350404,7.8542,,S
192,193,1,3,"Andersen-Jensen, Miss. Carla Christine Nielsine",female,19.0,1,0,350046,7.8542,,S
281,282,0,3,"Olsson, Mr. Nils Johan Goransson",male,28.0,0,0,347464,7.8542,,S
315,316,1,3,"Nilsson, Miss. Helmina Josefina",female,26.0,0,0,347470,7.8542,,S
396,397,0,3,"Olsson, Miss. Elina",female,31.0,0,0,350407,7.8542,,S
569,570,1,3,"Jonsson, Mr. Carl",male,32.0,0,0,350417,7.8542,,S
623,624,0,3,"Hansen, Mr. Henry Damsgaard",male,21.0,0,0,350029,7.8542,,S
640,641,0,3,"Jensen, Mr. Hans Peder",male,20.0,0,0,350050,7.8542,,S


In [16]:
# Who paid £ 7.854200 in Southampton(test dataset)
test_data[test_data["Embarked"] == 'S'][test_data["Fare"] == 7.854200]

  


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
98,990,3,"Braf, Miss. Elin Ester Maria",female,20.0,0,0,347471,7.8542,,S
135,1027,3,"Carlsson, Mr. Carl Robert",male,24.0,0,0,350409,7.8542,,S
157,1049,3,"Lundin, Miss. Olga Elida",female,23.0,0,0,347469,7.8542,,S
195,1087,3,"Karlsson, Mr. Julius Konrad Eugen",male,33.0,0,0,347465,7.8542,,S
235,1127,3,"Vendel, Mr. Olof Edvin",male,20.0,0,0,350416,7.8542,,S
261,1153,3,"Nilsson, Mr. August Ferdinand",male,21.0,0,0,350410,7.8542,,S
299,1191,3,"Johansson, Mr. Nils",male,29.0,0,0,347467,7.8542,,S
318,1210,3,"Jonsson, Mr. Nils Hilding",male,27.0,0,0,350408,7.8542,,S


As can be seen easily people who paid the same fares have some resemblance in their ticket number. The tickets that cost £ 7.8542 are 6 digit numbers that start with 34 or 35. Probably they were in the same area next to each other while traveling and as a result paid the same amount.

Can we find some ticket number similar to Mr Storey?

In [17]:
# Who else has 4 digit tickets, embarked from Southampton and traveling in the third class? 
train_data[[len(x) == 4 for x in train_data["Ticket"]]][train_data["Embarked"] == "S"][train_data["Pclass"] == 3]

  


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
40,41,0,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.475,,S
74,75,1,3,"Bing, Mr. Lee",male,32.0,0,0,1601,56.4958,,S
103,104,0,3,"Johansson, Mr. Gustaf Joel",male,33.0,0,0,7540,8.6542,,S
113,114,0,3,"Jussila, Miss. Katriina",female,20.0,1,0,4136,9.825,,S
138,139,0,3,"Osen, Mr. Olaf Elon",male,16.0,0,0,7534,9.2167,,S
169,170,0,3,"Ling, Mr. Lee",male,28.0,0,0,1601,56.4958,,S
176,177,0,3,"Lefebre, Master. Henry Forbes",male,,3,1,4133,25.4667,,S
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
197,198,0,3,"Olsen, Mr. Karl Siegwart Andreas",male,42.0,0,1,4579,8.4042,,S
229,230,0,3,"Lefebre, Miss. Mathilde",female,,3,1,4133,25.4667,,S


In [18]:
# Who else has 4 digit tickets, embarked from Southampton and traveling in the third class? 
test_data[[len(x) == 4 for x in test_data["Ticket"]]][test_data["Embarked"] == "S"][test_data["Pclass"] == 3]

  


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
39,931,3,"Hee, Mr. Ling",male,,0,0,1601,56.4958,,S
108,1000,3,"Willer, Mr. Aaron (Abi Weller"")""",male,,0,0,3410,8.7125,,S
132,1024,3,"Lefebre, Mrs. Frank (Frances)",female,,0,4,4133,25.4667,,S
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S
169,1061,3,"Hellstrom, Miss. Hilda Maria",female,22.0,0,0,7548,8.9625,,S
243,1135,3,"Hyman, Mr. Abraham",male,,0,0,3470,7.8875,,S
253,1145,3,"Salander, Mr. Karl Johan",male,24.0,0,0,7266,9.325,,S
357,1249,3,"Lockyer, Mr. Edward",male,,0,0,1222,7.8792,,S


4 digit tickets does not reveal much. The only resemblancethat can found are the tickets:
* 3470(£ 7.8875),
* 3410(£ 8.7125),
* 3474(£ 7.8875),
* 3460(£ 7.0458),

Obviously there is not much similarity other that they all start with 3. But may be we can use £ 7.8875 for Mr Storey. Let's try another option.

In [19]:
# Are there any tickets that contains the string 3701? May be Some digits in Mr Storey' ticket is missing.
train_data[['3701' in x for x in train_data['Ticket']]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
254,255,0,3,"Rosblom, Mrs. Viktor (Helena Wilhelmina)",female,41.0,0,2,370129,20.2125,,S
424,425,0,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,20.2125,,S


In [20]:
# Are there any tickets that contains the string 3701? May be Some digits in Mr Storey' ticket is missing.
test_data[['3701' in x for x in test_data['Ticket']]]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S
284,1176,3,"Rosblom, Miss. Salli Helena",female,2.0,1,1,370129,20.2125,,S


4 people in the datasets have string 3701 on their tickets. One of them is Mr Storey whose fare we do not know. The others are Rosblom family. Their ticket number is '370129' and paid £ 20.2125 each. May be Mr Storey's ticket was something like '370128' and was torn somehow and we lost the last 2 digit and the fare amount together. This makes sense. Also explains why Mr Storey's fare is missing.

Which one shall we use for Mr Storey's fare?
* £ 14.644083 (the avarage training data fare)?
* £ 13.913030 (the avarage test data fare)?
* £ 7.8875 (4 digit tickets)?
* £ 20.2125 (tickets contain '3701')?

Let's use £ 20.2125 for Mr Storey's fare since torn ticket theory also explains why Mr Storey's fare is missing 

In [21]:
# Insert £ 20.2125 for Mr Storey's fare
test_data.loc[152,'Fare'] = 20.2125

In [22]:
# Check Mr Storey
test_data.loc[152]

PassengerId                  1044
Pclass                          3
Name           Storey, Mr. Thomas
Sex                          male
Age                          60.5
SibSp                           0
Parch                           0
Ticket                       3701
Fare                      20.2125
Cabin                         NaN
Embarked                        S
Name: 152, dtype: object

In [23]:
# Check again
test_data.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64

OK. Now we can use **'Fare'** column in our model. Now it is time for **'Embarked'** column. Let's remember the missing values...

In [24]:
# There are 2 null values for 'Embarked' column in train data
# https://dzone.com/articles/pandas-find-rows-where-columnfield-is-null
train_data[train_data["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


**Miss Icard's** and **Mrs Stone's** embarking location is missing. Let's start with **'Fare'**. Is there anybody else who gave **£ 80.0** for their ticket?

In [25]:
# Who paid £ 80 in training data 
train_data[train_data["Fare"] == 80]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [26]:
# Who paid £ 80 in test data
test_data[test_data["Fare"] == 80]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


No one else paid that amount of money. Let's check then who else paid **£ 70.0** or more.

In [27]:
# People who paid more than £ 70 in training data
train_data[train_data["Fare"] > 70]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
34,35,0,1,"Meyer, Mr. Edgar Joseph",male,28.0,1,0,PC 17604,82.1708,,C
52,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
62,63,0,1,"Harris, Mr. Henry Birkhardt",male,45.0,1,0,36973,83.475,C83,S
72,73,0,2,"Hood, Mr. Ambrose Jr",male,21.0,0,0,S.O.C. 14879,73.5,,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
102,103,0,1,"White, Mr. Richard Frasar",male,21.0,0,1,35281,77.2875,D26,S


In [28]:
# People who paid more than £ 70 in test data
test_data[test_data["Fare"] > 70]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,904,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
24,916,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
48,940,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",female,60.0,0,0,11813,76.2917,D15,C
53,945,1,"Fortune, Miss. Ethel Flora",female,28.0,3,2,19950,263.0,C23 C25 C27,S
59,951,1,"Chaudanson, Miss. Victorine",female,36.0,0,0,PC 17608,262.375,B61,C
64,956,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
69,961,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60.0,1,4,19950,263.0,C23 C25 C27,S
74,966,1,"Geiger, Miss. Amalie",female,35.0,0,0,113503,211.5,C130,C
75,967,1,"Keeping, Mr. Edwin",male,32.5,0,0,113503,211.5,C132,C
81,973,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S


 **'Fare'** does not reveal much. Let's try a different approach. **Miss Icard's** and **Mrs Stone's** Cabin number is **B28**. Let's find the cabins with **'B'**... 

In [29]:
# people in Cabins starting with 'B' in training data
train_data[['B' in str(x) for x in train_data['Cabin']]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
118,119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C
139,140,0,1,"Giglio, Mr. Victor",male,24.0,0,0,PC 17593,79.2,B86,C
170,171,0,1,"Van der hoef, Mr. Wyckoff",male,61.0,0,0,111240,33.5,B19,S
194,195,1,1,"Brown, Mrs. James Joseph (Margaret Tobin)",female,44.0,0,0,PC 17610,27.7208,B4,C
195,196,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.5,B77,S
263,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S


In [30]:
# people in Cabins starting with 'B' in test data
test_data[['B' in str(x) for x in test_data['Cabin']]]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,904,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
24,916,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
26,918,1,"Ostby, Miss. Helene Ragnhild",female,22.0,0,1,113509,61.9792,B36,C
59,951,1,"Chaudanson, Miss. Victorine",female,36.0,0,0,PC 17608,262.375,B61,C
64,956,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
92,984,1,"Davidson, Mrs. Thornton (Orian Hays)",female,27.0,1,2,F.C. 12750,52.0,B71,S
142,1034,1,"Ryerson, Mr. Arthur Larned",male,61.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
166,1058,1,"Brandeis, Mr. Emil",male,48.0,0,0,PC 17591,50.4958,B10,C
184,1076,1,"Douglas, Mrs. Frederick Charles (Mary Helene B...",female,27.0,1,1,PC 17558,247.5208,B58 B60,C
215,1107,1,"Head, Mr. Christopher",male,42.0,0,0,113038,42.5,B11,S


It's seen that **B**'s are either from **'Cherbourg'** or **'Southampton'**. What about **'B2'**? Remember that their Cabin is **B28**.

In [31]:
# people in Cabins starting with 'B2' in training data
train_data[['B2' in str(x) for x in train_data['Cabin']]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
540,541,1,1,"Crosby, Miss. Harriet R",female,36.0,0,2,WE/P 5735,71.0,B22,S
690,691,1,1,"Dick, Mr. Albert Adrian",male,31.0,1,0,17474,57.0,B20,S
745,746,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S
781,782,1,1,"Dick, Mrs. Albert Adrian (Vera Gillespie)",female,17.0,1,0,17474,57.0,B20,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [32]:
test_data[['B2' in str(x) for x in test_data['Cabin']]]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
305,1197,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabe...",female,64.0,1,1,112901,26.55,B26,S
390,1282,1,"Payne, Mr. Vivian Ponsonby",male,23.0,0,0,12749,93.5,B24,S


It's seen that cabins starting with **'B2'** are from **'Southampton'**. So let's use **'S'** for **Miss Icard** and **Mrs Stone**.

In [33]:
# Miss Icard
train_data.loc[61, 'Embarked'] = 'S'
# Mrs Stone
train_data.loc[829, 'Embarked'] = 'S'

In [34]:
# Check Miss Icard
train_data.loc[61]

PassengerId                     62
Survived                         1
Pclass                           1
Name           Icard, Miss. Amelie
Sex                         female
Age                             38
SibSp                            0
Parch                            0
Ticket                      113572
Fare                            80
Cabin                          B28
Embarked                         S
Name: 61, dtype: object

In [35]:
# Check Mrs Stone
train_data.loc[829]

PassengerId                                          830
Survived                                               1
Pclass                                                 1
Name           Stone, Mrs. George Nelson (Martha Evelyn)
Sex                                               female
Age                                                   62
SibSp                                                  0
Parch                                                  0
Ticket                                            113572
Fare                                                  80
Cabin                                                B28
Embarked                                               S
Name: 829, dtype: object

In [36]:
# Check again
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

OK. Now we can use **'Embarked'** column in our model.

In [37]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


**References**

Thanks to **@alexisbcook** and  **@dansbecker**

https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#links
https://stackoverflow.com/questions/13842088/set-value-for-particular-cell-in-pandas-dataframe-using-index thanks to @Yash