# <u>Classification for Titanic Dataset </u>

<b>The Challenge</b>

<u>The sinking of the Titanic is one of the most infamous shipwrecks in history.</u>

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 

In [1]:
test_path = 'D:/Datasets/Titanic/test.csv'
train_path = 'D:/Datasets/Titanic/train.csv'

In [2]:
import pandas as pd 

test_data = pd.read_csv(test_path)
train_data = pd.read_csv(train_path)

train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Explore Data

In [3]:
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [4]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB


In [6]:
train_data['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [7]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# <u>Part I </u> 

# <u> Filling in missing Cabins </u>

### There's too many missing cabins. I'll build a classification model to classify the cabins. First, i'll need to remove the room number from the deck letter. For the missing values (NaN), I'll make those Deck passgengers since most likely means they didn't have a cabin. Fare seems to what most likely defines which Pclass and Cabin section. I'll have two training sets, one where the data set has been stratified using the Fare bins, and another where the dataset hasn't been stratified. 

* I'll also end up removing PassgenerId, Name, Ticket columns since this data isn't relevant to what we are trying to figure out. 

In [3]:
# Split the data 80*/20 

from sklearn.model_selection import train_test_split

train, test = train_test_split(train_data, test_size=0.2, random_state=42)

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
331,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


In [4]:
train.Cabin.sort_values(ascending=True).unique()

# A, B, C, D, E, F, G, T, nan 

array(['A10', 'A14', 'A16', 'A19', 'A23', 'A24', 'A26', 'A32', 'A34',
       'A36', 'B101', 'B102', 'B18', 'B19', 'B20', 'B22', 'B28', 'B3',
       'B35', 'B37', 'B38', 'B4', 'B41', 'B42', 'B49', 'B5', 'B50',
       'B51 B53 B55', 'B57 B59 B63 B66', 'B58 B60', 'B71', 'B73', 'B77',
       'B79', 'B80', 'B82 B84', 'B94', 'B96 B98', 'C101', 'C103', 'C104',
       'C106', 'C111', 'C118', 'C123', 'C124', 'C125', 'C128', 'C2',
       'C22 C26', 'C23 C25 C27', 'C30', 'C32', 'C45', 'C46', 'C47', 'C49',
       'C50', 'C52', 'C54', 'C62 C64', 'C65', 'C68', 'C7', 'C70', 'C78',
       'C82', 'C83', 'C85', 'C86', 'C87', 'C90', 'C91', 'C92', 'C93',
       'C99', 'D', 'D11', 'D17', 'D20', 'D26', 'D30', 'D33', 'D35', 'D36',
       'D37', 'D45', 'D46', 'D49', 'D56', 'D6', 'D9', 'E10', 'E101',
       'E12', 'E121', 'E17', 'E24', 'E31', 'E33', 'E38', 'E40', 'E44',
       'E46', 'E50', 'E58', 'E67', 'E8', 'F E69', 'F G63', 'F G73', 'F2',
       'F33', 'F38', 'F4', 'G6', 'T', nan], dtype=object)

In [10]:
train.Cabin.isnull().sum()

553

In [11]:
train.shape

(712, 12)

# Part II Missing Cabins (NaN)

I can't fill the NaN with Deck since there is a Fare associated with the Cabin.

In [12]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            140
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          553
Embarked         2
dtype: int64

In [4]:
fare = train[train['Fare']==0.0]
fare[fare['Pclass']==3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S


In [14]:
train[train['Ticket']=='LINE']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S


In [15]:
fare[fare['Pclass']==2]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
277,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
732,733,0,2,"Knight, Mr. Robert J",male,,0,0,239855,0.0,,S
674,675,0,2,"Watson, Mr. Ennis Hastings",male,,0,0,239856,0.0,,S
413,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S
466,467,0,2,"Campbell, Mr. William",male,,0,0,239853,0.0,,S


In [16]:
train.loc[(train.Pclass==2) & (train.Cabin.notnull())]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
148,149,0,2,"Navratil, Mr. Michel (""Louis M Hoffman"")",male,36.5,0,2,230080,26.0,F2,S
193,194,1,2,"Navratil, Master. Michel M",male,3.0,1,1,230080,26.0,F2,S
473,474,1,2,"Jerwan, Mrs. Amin S (Marie Marthe Thuillard)",female,23.0,0,0,SC/AH Basle 541,13.7917,D,C
340,341,1,2,"Navratil, Master. Edmond Roger",male,2.0,1,1,230080,26.0,F2,S
516,517,1,2,"Lemore, Mrs. (Amelia Milley)",female,34.0,0,0,C.A. 34260,10.5,F33,S
618,619,1,2,"Becker, Miss. Marion Louise",female,4.0,2,1,230136,39.0,F4,S
717,718,1,2,"Troutt, Miss. Edwina Celia ""Winnie""",female,27.0,0,0,34218,10.5,E101,S
303,304,1,2,"Keane, Miss. Nora A",female,,0,0,226593,12.35,E101,Q
123,124,1,2,"Webber, Miss. Susan",female,32.5,0,0,27267,13.0,E101,S
183,184,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S


In [17]:
fare[fare['Pclass']==1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S
263,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S
815,816,0,1,"Fry, Mr. Richard",male,,0,0,112058,0.0,B102,S
806,807,0,1,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S


## <b>Assumption: </b>

* Ticket number starting with 1 = Pclass 1 (Cabins A, B)
* Ticket number starting with 2 = Pclass 2 (Cabins D, F, E)
* Ticket number with LINE = Pclass is Deck/No Cabin (Cabins Deck, G, T)

In [5]:
# Update Cabins according to class 

def update_cabins(cabin, pclass, fare):
        if pd.isnull(cabin) and pclass==1 and fare==0.0:
            cabin='A'
        if pd.isnull(cabin) and pclass==2 and fare==0.0:
            cabin='D'
        if pd.isnull(cabin) and pclass==3 and fare==0.0:
            cabin='On_Deck'
        else:
            cabin=cabin
        return cabin 

# Use apply method using lambda function 

fare['new_col'] = fare.apply(lambda x: update_cabins(x['Cabin'], x['Pclass'], x['Fare']), axis=1)

fare[fare['new_col']=='On_Deck']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,new_col
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S,On_Deck
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S,On_Deck
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S,On_Deck
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S,On_Deck


# Return Fare DF

In [7]:
copied_train = train.copy()
copied_train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
331,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


In [8]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
331,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


In [8]:
copied_train['Update_Cabin'] = copied_train.apply(lambda x: update_cabins(x['Cabin'], x['Pclass'], x['Fare']), axis=1)

copied_train[copied_train['Update_Cabin']=='On_Deck']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Update_Cabin
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S,On_Deck
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S,On_Deck
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S,On_Deck
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S,On_Deck


In [10]:
copied_train[copied_train['Update_Cabin']=='A']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Update_Cabin
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S,A


In [11]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
331,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


# Part III 

* Seperate the NaN in the Cabin column from the columns that have value. 
* Remove the numbers from the character in the Cabin column. 

In [9]:
nan_cabins = copied_train[copied_train['Cabin'].isnull()]
X_train = copied_train.copy()

In [9]:
X_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Update_Cabin
331,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S,C124
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S,
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S,
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S,
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S,


In [10]:
X_train[X_train['Update_Cabin']=='On_Deck']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Update_Cabin
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S,On_Deck
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S,On_Deck
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S,On_Deck
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S,On_Deck


In [11]:
nan_cabins.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Update_Cabin
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S,
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S,
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S,
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S,
361,362,0,2,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,,C,


# Part IV 
Remove the numeric from the character from the Cabin. 

In [44]:
A_only.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_Class
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S,
263,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S,B94
815,816,0,1,"Fry, Mr. Richard",male,,0,0,112058,0.0,B102,S,B102
806,807,0,1,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,A


In [12]:
def cab_letter(cabin):
    list_a = [i for i in cabin['Update_Cabin'] if i.startswith('A')]
    list_b = [i for i in cabin['Update_Cabin'] if i.startswith('B')]
    list_c = [i for i in cabin['Update_Cabin'] if i.startswith('C')]
    list_d = [i for i in cabin['Update_Cabin'] if i.startswith('D')]
    list_e = [i for i in cabin['Update_Cabin'] if i.startswith('E')]
    list_f = [i for i in cabin['Update_Cabin'] if i.startswith('F')]
    list_g = [i for i in cabin['Update_Cabin'] if i.startswith('G')]
    list_t = [i for i in cabin['Update_Cabin'] if i.startswith('T')]
    list_n = [i for i in cabin['Update_Cabin'] if i.startswith('N')]
    list_deck = [i for i in cabin['Update_Cabin'] if i.startswith('O')]

    cabin['Cabin_Section'] = cabin['Update_Cabin'].replace(list_a, 'A').replace(list_b, 'B').replace(list_c, 'C').replace(list_d, 'D').replace(list_e, 'E').replace(list_f, 'F').replace(list_g, 'G').replace(list_t, 'T').replace(list_n, 'None').replace(list_deck, 'Deck')
    return cabin

In [13]:
X_train_copy = X_train.copy()
X_train_copy['Update_Cabin'] = X_train_copy['Update_Cabin'].fillna(value='None')

In [14]:
X_train_copy_format = cab_letter(X_train_copy)

In [15]:
X_train_copy_format.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Update_Cabin,Cabin_Section
331,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S,C124,C
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S,,
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S,,
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S,,
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S,,


In [18]:
X_train_copy_format[X_train_copy_format['Update_Cabin']=='On_Deck']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Update_Cabin,Cabin_Section
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S,On_Deck,Deck
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S,On_Deck,Deck
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S,On_Deck,Deck
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S,On_Deck,Deck


In [19]:
X_train_copy_format.Cabin_Section.sort_values(ascending=True).unique()

array(['A', 'B', 'C', 'D', 'Deck', 'E', 'F', 'G', 'None', 'T'],
      dtype=object)

## Curious to see if there is a price range for each class 

In [23]:
pclass_1 = X_train[X_train['Pclass']==1]
print(pclass_1['Fare'].min())
print(pclass_1['Fare'].max())

0.0
512.3292


In [24]:
pclass_1['Fare'].median()

71.0

In [25]:
pclass_2 = X_train[X_train['Pclass']==2]
print(pclass_2['Fare'].min())
print(pclass_2['Fare'].max())

10.5
39.0


In [26]:
pclass_2['Fare'].median()

13.39585

In [27]:
pclass_3 = X_train[X_train['Pclass']==3]
print(pclass_3['Fare'].min())
print(pclass_3['Fare'].max())

7.65
22.3583


In [28]:
pclass_3['Fare'].median()

10.4625

In [29]:
Pclass_corr = train.Cabin.str.get_dummies(sep=' ').corrwith(train.Pclass/train.Pclass.max())

print(Pclass_corr)

A10   -0.060535
A14   -0.060535
A16   -0.060535
A19   -0.060535
A23   -0.060535
         ...   
F4    -0.021259
G6     0.052887
G63    0.030491
G73    0.043151
T     -0.060535
Length: 130, dtype: float64


In [30]:
Pclass_corr = train.Cabin.str.get_dummies(sep=' ').corrwith(train.Pclass/train.Pclass.max())

print(Pclass_corr)

A10   -0.060535
A14   -0.060535
A16   -0.060535
A19   -0.060535
A23   -0.060535
         ...   
F4    -0.021259
G6     0.052887
G63    0.030491
G73    0.043151
T     -0.060535
Length: 130, dtype: float64


In [31]:
# Ticket # Fare 

ticket_corr = train.Cabin.str.get_dummies(sep=' ').corrwith(train.Ticket.str.get_dummies(sep=' ')/train.Ticket.str.get_dummies(sep=' ').max())
print(ticket_corr)

10482    NaN
110152   NaN
110413   NaN
110465   NaN
110564   NaN
          ..
T        NaN
W./C.    NaN
W.E.P.   NaN
W/C      NaN
WE/P     NaN
Length: 729, dtype: float64


In [32]:
fare_corr = train.Cabin.str.get_dummies(sep=' ').corrwith(train.Fare/train.Fare.max())
print(fare_corr)

A10    0.005444
A14    0.014019
A16    0.005065
A19   -0.004756
A23   -0.001868
         ...   
F4     0.006555
G6    -0.025107
G63   -0.018007
G73   -0.025484
T      0.002104
Length: 130, dtype: float64


In [33]:
# Only specific attributes will be needed to predict/label the cabin. Will remove a few columns. 

cabin_train = train.drop(['Name', 'Sex'])

KeyError: "['Name' 'Sex'] not found in axis"

In [12]:
# Create set that Cabin needs to be predicted and Train & Test sets 

predict_set = train[train.Cabin.isnull()]
predict_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
733,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S
361,362,0,2,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,,C


In [13]:
predict_set.shape

(553, 12)

In [12]:
# Train and Test Set 

training_data = train_data[train_data.Cabin.notnull()]
training_data.head()


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
1,1,1,female,38.0,1,0,71.2833,C85,C
3,1,1,female,35.0,1,0,53.1,C123,S
6,0,1,male,54.0,0,0,51.8625,E46,S
10,1,3,female,4.0,1,1,16.7,G6,S
11,1,1,female,58.0,0,0,26.55,C103,S


In [13]:
# Create training and testing sets. 

from sklearn.model_selection import train_test_split

In [14]:
cabin_train, cabin_test = train_test_split(training_data, test_size=0.2, random_state=42)

In [15]:
print(cabin_train.shape)
print(cabin_test.shape)

(163, 9)
(41, 9)


In [16]:
X_cabin_train = cabin_train.drop(['Cabin'], axis=1)
y_cabin_train = cabin_train[['Cabin']]

X_cabin_test = cabin_test.drop(['Cabin'], axis=1)
y_cabin_test = cabin_test[['Cabin']]

In [17]:
print(X_cabin_train.shape)
print(y_cabin_train.shape)
print(X_cabin_test.shape)
print(y_cabin_test.shape)

(163, 8)
(163, 1)
(41, 8)
(41, 1)


## <u>Part II</u>

# <u>Prepare data for classification models </u>

There are two sets of data. One has been stratified and the other not stratified. The purpose of this section to predict/label the Cabins correctly. I'll build some pipelines to preprocess the data. 

In [19]:
X_cabin_train.isnull().sum()

Survived     0
Pclass       0
Sex          0
Age         17
SibSp        0
Parch        0
Fare         0
Embarked     1
dtype: int64

In [20]:
X_cabin_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 871 to 457
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  163 non-null    int64  
 1   Pclass    163 non-null    int64  
 2   Sex       163 non-null    object 
 3   Age       146 non-null    float64
 4   SibSp     163 non-null    int64  
 5   Parch     163 non-null    int64  
 6   Fare      163 non-null    float64
 7   Embarked  162 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 10.2+ KB


In [21]:
# Build pipeline to fill missing values for Age and encode Sex and Embarked. 

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

age_pipeline = Pipeline([

    ("age_fill", SimpleImputer(strategy='mean'))
])

embark_pipeline = Pipeline([
    ("embark_col", SimpleImputer(strategy='most_frequent')),
    ("encoder", OneHotEncoder())
])

In [22]:
age_df = X_cabin_train[['Age']]

age_tr = age_pipeline.fit_transform(age_df)

In [23]:
age_filled_df = pd.DataFrame(age_tr, columns=age_df.columns, index=age_df.index)
age_filled_df.isnull().any()

Age    False
dtype: bool

In [24]:

preprocess_pipeline = ColumnTransformer([

    ("age_col", SimpleImputer(strategy='mean'), ['Age']),
    ("embarked_col", embark_pipeline, ['Embarked']),
    ("encode_sex", OneHotEncoder(), ['Sex'])
])

X_cabin_train_prepared = preprocess_pipeline.fit_transform(X_cabin_train)

In [32]:
# y data: remove the room number and just leave the Cabin letter only. 

y_cabin_train['Cabin'].sort_values(ascending=True).unique()

array(['A14', 'A16', 'A19', 'A20', 'A23', 'A24', 'A31', 'A32', 'A36',
       'A6', 'A7', 'B101', 'B102', 'B18', 'B19', 'B20', 'B22', 'B28',
       'B3', 'B30', 'B35', 'B37', 'B38', 'B39', 'B4', 'B41', 'B42', 'B49',
       'B5', 'B50', 'B51 B53 B55', 'B57 B59 B63 B66', 'B58 B60', 'B69',
       'B71', 'B73', 'B77', 'B78', 'B79', 'B80', 'B82 B84', 'B86', 'B94',
       'B96 B98', 'C103', 'C104', 'C106', 'C110', 'C111', 'C123', 'C124',
       'C125', 'C126', 'C148', 'C22 C26', 'C23 C25 C27', 'C30', 'C32',
       'C45', 'C46', 'C47', 'C49', 'C52', 'C62 C64', 'C65', 'C68', 'C70',
       'C78', 'C82', 'C83', 'C85', 'C87', 'C90', 'C91', 'C92', 'C93',
       'C95', 'C99', 'D', 'D10 D12', 'D15', 'D17', 'D19', 'D20', 'D21',
       'D26', 'D28', 'D30', 'D33', 'D35', 'D36', 'D45', 'D46', 'D47',
       'D48', 'D56', 'D6', 'D9', 'E10', 'E101', 'E12', 'E121', 'E17',
       'E24', 'E25', 'E31', 'E33', 'E34', 'E38', 'E40', 'E44', 'E46',
       'E49', 'E58', 'E63', 'E67', 'E68', 'E77', 'E8', 'F E69', 'F G