# Imports and Environment

In [2]:
# Basic pandas and numpy
import pandas as pd
import numpy as np
 
# Basic visualization tools
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
 
# Pandas defaults
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500
 
# Make jupyter bigger
from IPython.core.display import display, HTML
display(HTML('<style>.container { width:100% !important; }</style>'))

In [3]:
# Importing dataset:
df2 = pd.read_csv('../Data/df2.csv')

In [4]:
# Subsetting for the variables I want
dani_df2 = df2[['Survived', 'Pclass', 'Name', 'Sex', 'Ticket', 'Embarked', 'Cabin']]

# Type of variables and nulls

In [6]:
dani_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Ticket      891 non-null object
Embarked    889 non-null object
Cabin       204 non-null object
dtypes: int64(2), object(5)
memory usage: 48.8+ KB


Only embarked has NaNs

# Analysis: Target Variable - Survived

In [7]:
dani_df2.Survived.value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

**Variable is numeric and has no NaN**

Aprox. 61.61% Did not survive (0)

Aprox. 38.38% Did survive (1)

# Analysis: Rest of Categorical Variables

## Embarked

**Nominal variable**

This is the only categorical variable that has nulls, but only two. Let's have a look at them:

In [8]:
dani_df2.loc[dani_df2.Embarked.isna()]

Unnamed: 0,Survived,Pclass,Name,Sex,Ticket,Embarked,Cabin
61,1,1,"Icard, Miss. Amelie",female,113572,,B28
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,113572,,B28


With manual data found at [Encyclopedia Titanica](https://www.encyclopedia-titanica.org/) we conclude that both women embarked in Southampton (S), so we manually change these entries

In [10]:
dani_df2.Embarked[dani_df2.Embarked.isna()] = 'S'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [11]:
dani_df2.Embarked.value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

## Name

**Nominal variable**

In [12]:
# .head() is written for length issues when uploading to git, but the full series has been explored.
dani_df2.Name.value_counts().head()

Futrelle, Mrs. Jacques Heath (Lily May Peel)    1
Todoroff, Mr. Lalio                             1
Osman, Mrs. Mara                                1
Andrews, Miss. Kornelia Theodosia               1
Hamalainen, Master. Viljo                       1
Name: Name, dtype: int64

## Cabin

**Nominal variable**

In [21]:
# .head() is written for length issues when uploading to git, but the full series has been explored.
dani_df2.Cabin.value_counts()

B96 B98            4
C23 C25 C27        4
G6                 4
D                  3
F33                3
C22 C26            3
E101               3
F2                 3
F G73              2
E44                2
D36                2
C65                2
B22                2
C93                2
D33                2
B5                 2
B18                2
D26                2
D35                2
B51 B53 B55        2
B49                2
B58 B60            2
B28                2
C92                2
C123               2
E8                 2
C83                2
C68                2
E67                2
C2                 2
E24                2
D20                2
E33                2
E121               2
B20                2
D17                2
E25                2
C124               2
F4                 2
B77                2
C125               2
B57 B59 B63 B66    2
B35                2
C78                2
C52                2
C126               2
C86                1
B86          

For the cabin variable, the important thing for classification matters is the letter, so we strip the rest:

In [None]:
['A', 'B', 'C', 'D', 'E']

## Sex

**Nominal variable**

In [14]:
dani_df2.Sex.value_counts(normalize=True)

male      0.647587
female    0.352413
Name: Sex, dtype: float64

Only males and females. No nulls. 

This variable is quite clean.

## Ticket

**Nominal variable**

In [15]:
# .head() is written for length issues when uploading to git, but the full series has been explored.
dani_df2.Ticket.value_counts(normalize=False).head()

347082      7
CA. 2343    7
1601        7
CA 2144     6
347088      6
Name: Ticket, dtype: int64

There are repeated values. We believe they belong to the same families, but try to demonstrate it

In [16]:
dani_df2.loc[dani_df2.Ticket == '347077']

Unnamed: 0,Survived,Pclass,Name,Sex,Ticket,Embarked,Cabin
25,1,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,347077,S,
182,0,3,"Asplund, Master. Clarence Gustaf Hugo",male,347077,S,
233,1,3,"Asplund, Miss. Lillian Gertrud",female,347077,S,
261,1,3,"Asplund, Master. Edvin Rojj Felix",male,347077,S,


This column is quite messy, with a lot of values.
Some pattern can be infered for 1st class passengers:

In [17]:
dani_df2[dani_df2.Pclass == 1].tail()

Unnamed: 0,Survived,Pclass,Name,Sex,Ticket,Embarked,Cabin
871,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,11751,S,D35
872,0,1,"Carlsson, Mr. Frans Olof",male,695,S,B51 B53 B55
879,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,11767,C,C50
887,1,1,"Graham, Miss. Margaret Edith",female,112053,S,B42
889,1,1,"Behr, Mr. Karl Howell",male,111369,C,C148


* WEP (WEP - WE/P) 
* PC
* Only number

For 2nd class passengers:

In [18]:
dani_df2[dani_df2.Pclass == 2].tail()

Unnamed: 0,Survived,Pclass,Name,Sex,Ticket,Embarked,Cabin
866,1,2,"Duran y More, Miss. Asuncion",female,SC/PARIS 2149,C,
874,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,P/PP 3381,C,
880,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,230433,S,
883,0,2,"Banfield, Mr. Frederick James",male,C.A./SOTON 34068,S,
886,0,2,"Montvila, Rev. Juozas",male,211536,S,


* CA (multiple formats)
* SC (multiple formats)
* SO (multiple formats)
* Only number

For 3rd class passengers:

In [19]:
dani_df2[dani_df2.Pclass == 3].tail()

Unnamed: 0,Survived,Pclass,Name,Sex,Ticket,Embarked,Cabin
882,0,3,"Dahlberg, Miss. Gerda Ulrika",female,7552,S,
884,0,3,"Sutehall, Mr. Henry Jr",male,SOTON/OQ 392076,S,
885,0,3,"Rice, Mrs. William (Margaret Norton)",female,382652,Q,
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,W./C. 6607,S,
890,0,3,"Dooley, Mr. Patrick",male,370376,Q,


* CA (multiple formats)
* SP (multiple formats)
* AC (multiple formats)
* SO (multiple formats)
* SOTON (multiple formats)
* STON (multiple formats)
* Only number

**We will manage these issues in the feature creation/transformation section, although there is only a clear pattern for 1st class passengers**

## Pclass

**Ordinal variable**

In [20]:
dani_df2.Pclass.value_counts(normalize=True)

3    0.551066
1    0.242424
2    0.206510
Name: Pclass, dtype: float64

There are no nulls. Clean values.


We will try to change the values later as, counterintuitively, 3 in this particular case is worse than 1