# Introduction: Machine Learning Project Part 1

In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. For this project I will use a very-well known titanic dataset that can be found in Kagle. It is a beginer level machine learnign related tasks that can be used for those who are just starting their carreer as a Machine Learning/Datascientis Engineer.

Use the provided titanic data to develop a model that can predict the rate of survival, and then interpret the results to find the variables that are most predictive of the survival.

This is a supervised, regression machine learning task: given a set of data with targets (in this case the `survival = {1: survived, 0: died}`) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.

* **Supervised problem**: we are given both the features and the target
* **Classification problem**: the target is a discrete variable, in this case 0 or 1.

During training, we want the model to learn the relationship between the features and the survival so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!

## The Titanic Dataset
In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this introductory project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive. 

# 1. Data Cleaning and Formatting
## Load in the Data and Examine
We will be loading our data into a pandas dataframe, one of the most useful data structures for data science. You can think of it as an excel within Python that we can easily manipulate, clean, and visualize. Pandas has many methods to help make the data science/machine learning pipeline as smooth as possible.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
sns.set(font_scale = 2)

import warnings
warnings.filterwarnings('ignore')

import utils

In [2]:
train_data, test_data = utils.load_titanic_data()

In [3]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


From raw data we can see the various features present for each passenger:

| **Feature**    | **Description** | **Key** |                                                                                                                              
|----------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| PassengerId    | A unique identifier for each passenger| -                                                                                                                  |
| Survived       | That's the target, 0 means the passenger did not survive, while 1 means he/she survived| (0 = NO, 1 = YES)                                              |
| Pclass         | Passenger's ticket class (1 = 1st upper class, 2 = 2nd middle class, 3 = 3rd lower class)| (1 = 1st, 2 = 2nd, 3 = 3rd)
| Name           | Name of the passenger | -                                                                                                                                  |
| Sex            | Sex of the passenger| (male/female)                                                                                                                      |
| Age            | Age of the passenger| -                                                                                                       |
| SibSp          | How many siblings & spouses of the passenger aboard the Titanic| -                                                                                         |
| Parch          | How many children & parents of the passenger aboard the Titanic | -                                                                                        |
| Ticket         | Ticket number id | -                                                                                                                                       |
| Fare           | Price paid (in pounds)| -                                                                                                                                  |
| Cabin          | Passenger's cabin number| -                                                                                                 |
| Embarked       | Where the passenger embarked the Titanic | (C = Cherbourg, Q = Queenstown, S = Southampton)


**The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.**

### Data types and missing

The `dataframe.info` method is a quick way to assess the data by displaying the data types of each column and the number of non-missing values.

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The size of the data frame corresponds with the 12 features for 891 individual passengers. Most columns contain numerical data types (PassengerID, Survived, Pclass, Age, SibSp, ParCh, Fare) denoted as `int64` or `float64` and few contain strings (Name, Sex, Ticket, Cabin, Embarked) marked as `object`.


In [6]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


### 1.1 Missing Values
Now we know the size of the data and its data types, we can start analysis by looking at the percentage of missing values in each column. Missing values are fine when we do Exploratory Data Analysis, but they will have to be filled in for machine learning methods.

In [7]:
utils.missing_values_table(train_data)

Your selected dataframe has 12 columns.
There are 3 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Cabin,687,77.1
Age,177,19.9
Embarked,2,0.2


In [8]:
utils.missing_values_table(test_data)

Your selected dataframe has 11 columns.
There are 3 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Cabin,327,78.2
Age,86,20.6
Fare,1,0.2


Although we want to be careful to not discard information and should be careful when dropping columns, if a column has a high percentage of missing values, then it probably will not be of much use.

For this project, we will remove any columns with more than 50% missing values (`Cabin`). In general, be careful about dropping any information because even if it is not there for all the observations, it may still be useful for predicting the target value.


In [9]:
# Get the columns with > 50% missing
missing_df = utils.missing_values_table(train_data)

missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

Your selected dataframe has 12 columns.
There are 3 columns that have missing values.
We will remove 1 columns.


In [10]:
# Get the columns with > 50% missing
missing_df = utils.missing_values_table(test_data)

missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

Your selected dataframe has 11 columns.
There are 3 columns that have missing values.
We will remove 1 columns.


In [11]:
# Drop the columns
[utils.fill_null_values(df_ = df, column_ = list(missing_columns), method = 'REMOVE') for df in [train_data, test_data]]

[None, None]

The rest of the missing values will have to be imputed (filled-in) using an appropriate strategy before doing machine learning.

#### 1.1.1 Age

The simplest approach is to use the mean or median to fill all the missing ages. The mean age is around 30 (train_data.describe() gives this). Median too is similar as there are no outliers. However, one would be more accurate in imputing missing ages by calculating mean age of a group of 'similar passengers'. For example we can calculate mean age for each Pclass+Title combination and assign this value to the missing agesfor that Pclass+Title combination.

One question that you may ask...How do to determine the Title information? The asnwer is from the Name column using the following line of code
```pyhton
df['Title'] = df['Name'].str.extract('([A-Za-z]+)\.', expand=False)
```

* **Master** - boys and young men - helps identify boy children
* **Mrs** - Married women - indicates a female passenger who is married
* **Mr** - Adult men - male passenger who is an adult
* **Miss** - Unmarried women

As we can see in the above code, the mean age is around 30 so ideally we would have young men with the missing values and for this case it is not correct to replace them with the mean or median values. Let's first analyse this case!

In [12]:
combined = pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

combined.describe()

Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
count,1046.0,1308.0,1309.0,1309.0,1309.0,1309.0,891.0
mean,29.881135,33.295479,0.385027,655.0,2.294882,0.498854,0.383838
std,14.4135,51.758668,0.86556,378.020061,0.837836,1.041658,0.486592
min,0.1667,0.0,0.0,1.0,1.0,0.0,0.0
25%,21.0,7.8958,0.0,328.0,2.0,0.0,0.0
50%,28.0,14.4542,0.0,655.0,3.0,0.0,0.0
75%,39.0,31.275,0.0,982.0,3.0,1.0,1.0
max,80.0,512.3292,9.0,1309.0,3.0,8.0,1.0


In [13]:
combined[(combined['Age'].isnull()) & (combined.Name.str.contains('Master'))]

Unnamed: 0,Age,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
65,,C,15.2458,"Moubarek, Master. Gerios",1,66,3,male,1,1.0,2661
159,,S,69.55,"Sage, Master. Thomas Henry",2,160,3,male,8,0.0,CA. 2343
176,,S,25.4667,"Lefebre, Master. Henry Forbes",1,177,3,male,3,0.0,4133
709,,C,15.2458,"Moubarek, Master. Halim Gonios ('William George')",1,710,3,male,1,1.0,2661
1135,,S,23.45,"Johnston, Master. William Arthur 'Willie'",2,1136,3,male,1,,W./C. 6607
1230,,C,7.2292,"Betros, Master. Seman",0,1231,3,male,0,,2622
1235,,S,14.5,"van Billiard, Master. James William",1,1236,3,male,1,,A/5. 851
1308,,C,22.3583,"Peter, Master. Michael J",1,1309,3,male,1,,2668


As we can see from both datasets: There are 8 children and all are pClass=3 and male. If we had taken the simple approach of applying the mean() for every missing entry, we would have entered their ages as 30. But we know that they are Children. So they should be < 14 for sure and definitely not 30. What value should we fill as their mean age? One good option is to take the avg age of 'male children & use this as the default value for missing male children age in that class.


In [14]:
print(train_data[train_data.Name.str.contains('Master')]['Age'].mean())

4.574075


So 5 is a good average Age for these 8 boys which is a vast difference from the 30 we earlier wanted to go with.

Before we update the values for these 8 kids, let us do a quick sanity check. We know that these kids should ideally have non-zero Parch i.e. these small kids should not be travelling alone but travelling with at least 1 parent!

In [15]:
display((train_data[(train_data.Age.isnull()) & (train_data.Name.str.contains('Master')) & (train_data.Parch==0)]))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked


In [16]:
display((test_data[(test_data.Age.isnull()) & (test_data.Name.str.contains('Master')) & (test_data.Parch==0)]))

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
339,1231,3,"Betros, Master. Seman",male,,0,0,2622,7.2292,C


In [17]:
#OK just one kid travelling without their parents. We will assume this child is a senior and consequently cannot be 5. We will assign him/her to 14
test_data.loc[test_data.Name == 'Betros, Master. Seman', 'Age'] = 14


##### 1.1.2 Feature Engineering

In [18]:
train_data['Title'], test_data['Title'], combined['Title'] = [df.Name.str.extract(' ([A-Za-z]+)\.', expand=False) for df in [train_data, test_data, combined]]

train_data.groupby(['Title', 'Pclass'])['Age'].agg(['mean', 'count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
Title,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
Capt,1,70.0,1
Col,1,58.0,2
Countess,1,33.0,1
Don,1,40.0,1
Dr,1,43.75,4
Dr,2,38.5,2
Jonkheer,1,38.0,1
Lady,1,48.0,1
Major,1,48.5,2
Master,1,5.305567,3



As we can see the mean age differs across Pclass. For e.g. avg age of 'Mrs' in Pclass 1 is 40 versus 33 in other Pclasses. 

Also there are just too many titles in the dataset. Let's create a few important ones only else this will just unnecessarily cause too much noise. We do the consolidation in 1 line of code. Let us first create a Python dictionary to map the titles. The title to the right of the ':' are the final set of titles we will go with.

* **Officer**: Includes military and civil officers.
* **Royalty**: Includes noble titles.
* **Mrs**: Married women.
* **Miss**: Young unmarried women.
* **Mr**: Adult men.
* **Master**: Young boys.

In [19]:
TitleDict = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Sir": "Royalty",
    "Dr": "Royalty",
    "Rev": "Royalty",
    "Countess": "Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr": "Mr",
    "Mrs": "Mrs",
    "Miss": "Miss",
    "Master": "Master",
    "Lady": "Royalty"
}
train_data['Title'], test_data['Title'] = [df.Title.map(TitleDict) for df in [train_data, test_data]]

##Let us now reprint the groups
train_data.groupby(['Title', 'Pclass'])['Age'].agg(['mean', 'count'])


Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
Title,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
Master,1,5.305567,3
Master,2,2.259256,9
Master,3,5.350696,24
Miss,1,29.744681,47
Miss,2,22.390625,32
Miss,3,16.123188,69
Mr,1,41.58046,87
Mr,2,32.768293,82
Mr,3,28.724891,229
Mrs,1,40.4,35


In [20]:
display(train_data[train_data.Title.isnull()])
display(test_data[test_data.Title.isnull()])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Title


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Title
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C,


In [21]:
test_data.at[414, 'Title'] = 'Royalty'

Much better!!

Just one question remained. How about female children?  If the title is `Miss` and Parch is 0 then it is female children

In [22]:
print ("Avg age of 'Miss' Title", round(train_data[train_data.Title=="Miss"]['Age'].mean()))

print ("Avg age of 'Miss' Title travelling without Parents", round(train_data[(train_data.Title=="Miss") & (train_data.Parch==0)]['Age'].mean()))

print ("Avg age of 'Miss' Title travelling with Parents", round(train_data[(train_data.Title=="Miss") & (train_data.Parch!=0)]['Age'].mean()), '\n')

Avg age of 'Miss' Title 22
Avg age of 'Miss' Title travelling without Parents 28
Avg age of 'Miss' Title travelling with Parents 12 



##### 1.1.3 Fare

In [23]:
test_data[test_data.Fare.isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Title
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,S,Mr


The Fare feature represents the total fare for a group of people and not a single people, let us compute the `Fareperperson`

In [24]:
display(combined[combined.Fare.isnull()])

for df in [train_data, test_data, combined]:
    df['PeopleInTicket']=df['Ticket'].map(combined['Ticket'].value_counts())
    df['FarePerPerson']=df['Fare']/df['PeopleInTicket']


##Just take the mean fare for the PORT S and the Pclass & fill it. Remember to consider FarePerPerson and not Fare
print('Mean fare for this category: ', train_data[(train_data.Embarked=='S') & (train_data.Pclass==3)]['FarePerPerson'].mean())

Unnamed: 0,Age,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title
1043,60.5,S,,"Storey, Mr. Thomas",0,1044,3,male,0,,3701,Mr


Mean fare for this category:  7.382647773383368


In [25]:
# almost double
print(train_data[(train_data.Embarked=='S') & (train_data.Pclass==3)]['Fare'].mean())

14.644083002832863


In [26]:


test_data.loc[test_data.Fare.isnull(), ['Fare','FarePerPerson']] = round(train_data[(train_data.Embarked=='S') & (train_data.Pclass==3) & (train_data.PeopleInTicket==1)]['Fare'].mean(),1)


In [27]:
test_data[test_data.Fare.isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Title,PeopleInTicket,FarePerPerson


In [28]:
display(combined[combined.Embarked.isnull()])

Unnamed: 0,Age,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,PeopleInTicket,FarePerPerson
61,38.0,,80.0,"Icard, Miss. Amelie",0,62,1,female,0,1.0,113572,Miss,2,40.0
829,62.0,,80.0,"Stone, Mrs. George Nelson (Martha Evelyn)",0,830,1,female,0,1.0,113572,Mrs,2,40.0


In [29]:
##Fare is 40 per person (80 for 2 people) for Pclass 1 for 2 adults. Where could they have Embarked from?
# solo travller
train_data[(train_data.Pclass==1)].groupby('Embarked').agg({'FarePerPerson': 'mean', 'Fare': 'mean', 'PassengerId': 'count'})


Unnamed: 0_level_0,FarePerPerson,Fare,PassengerId
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,39.179223,104.718529,85
Q,30.0,90.0,2
S,30.211849,70.364862,127


In [30]:


##Only 1 family got on at Q. Also fare is 30 per person and this is definitely not the case
##From the data below, it seems fairly obvious that the fareperperson of 40 for the 2 missing cases maps to Port C

##Let us check same data for groups of 2 adults
train_data[(train_data.Pclass==1) & (train_data.PeopleInTicket==2) & (train_data.Age>18)].groupby('Embarked').agg({'FarePerPerson': 'mean', 'Fare': 'mean', 'PassengerId': 'count'})



Unnamed: 0_level_0,FarePerPerson,Fare,PassengerId
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,35.073648,70.147296,23
S,31.702411,63.404822,32


In [31]:

##Still port C comes out as a winner in all cases. We will go ahead with this
#train_data.Embarked.fillna('C', inplace=True)

[utils.fill_null_values(df_ = df, column_ = ['Embarked'], method = 'FILL', value_ = 'C') for df in [train_data, test_data]]


[None, None]

In [32]:


print(train_data.groupby(['Pclass','Sex','Title'])['Age'].agg({'mean', 'median', 'count'}))

for df in [train_data, test_data, combined]:
    df.loc[(df.Title=='Miss') & (df.Parch!=0) & (df.PeopleInTicket>1), 'Title']="FemaleChild"

display(combined[(combined.Age.isnull()) & (combined.Title=='FemaleChild')])



                       median  count       mean
Pclass Sex    Title                            
1      female Miss       30.0     47  29.744681
              Mrs        40.0     35  40.400000
              Royalty    48.0      3  43.333333
       male   Master      4.0      3   5.305567
              Mr         40.0     87  41.580460
              Officer    56.0      5  56.600000
              Royalty    42.0      6  42.166667
2      female Miss       24.0     32  22.390625
              Mrs        31.5     42  33.547619
       male   Master      1.0      9   2.259256
              Mr         31.0     82  32.768293
              Royalty    46.5      8  42.000000
3      female Miss       18.0     69  16.123188
              Mrs        31.0     33  33.515152
       male   Master      4.0     24   5.350696
              Mr         26.0    229  28.724891


Unnamed: 0,Age,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,PeopleInTicket,FarePerPerson
128,,C,22.3583,"Peter, Miss. Anna",1,129,3,female,1,1.0,2668,FemaleChild,3,7.452767
180,,S,69.55,"Sage, Miss. Constance Gladys",2,181,3,female,8,0.0,CA. 2343,FemaleChild,11,6.322727
229,,S,25.4667,"Lefebre, Miss. Mathilde",1,230,3,female,3,0.0,4133,FemaleChild,5,5.09334
409,,S,25.4667,"Lefebre, Miss. Ida",1,410,3,female,3,0.0,4133,FemaleChild,5,5.09334
485,,S,25.4667,"Lefebre, Miss. Jeannie",1,486,3,female,3,0.0,4133,FemaleChild,5,5.09334
792,,S,69.55,"Sage, Miss. Stella Anna",2,793,3,female,8,0.0,CA. 2343,FemaleChild,11,6.322727
863,,S,69.55,"Sage, Miss. Dorothy Edith 'Dolly'",2,864,3,female,8,0.0,CA. 2343,FemaleChild,11,6.322727
888,,S,23.45,"Johnston, Miss. Catherine Helen 'Carrie'",2,889,3,female,1,0.0,W./C. 6607,FemaleChild,4,5.8625
1079,,S,69.55,"Sage, Miss. Ada",2,1080,3,female,8,,CA. 2343,FemaleChild,11,6.322727


In [33]:
grp = train_data.groupby(['Pclass','Sex','Title'])['Age'].mean()
print(grp)

Pclass  Sex     Title      
1       female  FemaleChild    21.071429
                Miss           33.424242
                Mrs            40.400000
                Royalty        43.333333
        male    Master          5.305567
                Mr             41.580460
                Officer        56.600000
                Royalty        42.166667
2       female  FemaleChild     9.916667
                Miss           29.875000
                Mrs            33.547619
        male    Master          2.259256
                Mr             32.768293
                Royalty        42.000000
3       female  FemaleChild     6.500000
                Miss           21.590909
                Mrs            33.515152
        male    Master          5.350696
                Mr             28.724891
Name: Age, dtype: float64


In [34]:
grp = train_data.groupby(['Pclass','Sex','Title'])['Age'].mean().reset_index()[['Sex', 'Pclass', 'Title', 'Age']]

print('\n', 'We converted the series object to: ', type(grp))


 We converted the series object to:  <class 'pandas.core.frame.DataFrame'>


In [35]:
def fill_age(row):
    if np.isnan(row['Age']):
        matches = grp[(grp['Pclass'] == row['Pclass']) & 
                      (grp['Sex'] == row['Sex']) & 
                      (grp['Title'] == row['Title'])]
        if not matches.empty:
            return matches['Age'].values[0]
    else:
        return row['Age']

Fill in the missing age with the mean age of Pclass+Sex+Title

In [36]:
train_data['Age'], test_data['Age'] = [df.apply(lambda row: fill_age(row), axis=1) for df in [train_data, test_data]]

In [59]:
# Ceiling the value

[utils.clean_feature(df, 'Age') for df in [train_data, test_data]]

[None, None]

# Conclusion

In this notebook, we carried out the firs step of a machine learning problem: (Cleaned and corrected the missing values the raw data making them ready for exploratory data analysis)


In [61]:
file_names = ['datasets/titanic/train_data.csv', 'datasets/titanic/test_data.csv']
data_frames = [train_data, test_data]

# Iterate through data_frames and file_names to save each DataFrame to a CSV file
for data, file_name in zip(data_frames, file_names):
    data.to_csv(file_name, index=False)