# Giulia Solinas - First Assignment: Exploratory Data Analysis

- Use case: derive descriptive statistical measures & generate plots.
- Data source: https://www.kaggle.com/competitions/titanic/data


In [5]:
# Import packages

import numpy as np      # linear algebra and arrays
import pandas as pd     # data wrangling

%matplotlib inline
import matplotlib.pyplot as plt     # data visualization
import seaborn as sns               # data visualization

from scipy.stats import mode
import string


In [6]:
# Enabling autocompletion in the Jupyther notebook
%config IPCompleter.greedy=True

## Load data

In [7]:
from pathlib import Path
cwd        = Path.cwd()
dataFolder = Path(cwd.parent , 'data')

In [11]:
# train
df_train = pd.read_csv(Path(dataFolder, 'train.csv'))
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
# test
df_test = pd.read_csv(Path(dataFolder, 'test.csv'))
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [10]:
# gender
df_gender = pd.read_csv(Path(dataFolder, 'gender_submission.csv'))

# This dataset is not necessary for the analysis.

In [13]:
# create full dataset by appending the train and test
# NOTE: the append function will be deprecated and will be removed. In the future, use 'pandas.concat' !!!!
df_full = df_train.append(df_test, ignore_index= True)

  df_full = df_train.append(df_test, ignore_index= True)


In [14]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


In total, the combined dataset includes 1309 passengers and it is close to the official passengers' number. Exact numbers of those traveling on the Titanic is not known, but the official total of all passengers and crew is 2,229. The approximate number of passengers seems to be 1316 (see here for a list of sources: https://en.wikipedia.org/wiki/Titanic#).

The passengers' list features 1309 tickets, but not all ticket owners boarded the ship--here, for example, embarked people are 1307 .

In [15]:
df_full.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Dataset features

The dataset provides information about passengers who were aboard the Titanic, including details about their demographics, ticket information, cabin location, and survival status. Here's an explanation of the columns in the Titanic dataset:

- PassengerId: A unique identifier for each passenger.
- Survived: Indicates whether the passenger survived (1) or did not survive (0).
- Pclass: The passenger's class of travel:
    - 1 = 1st class.
    - 2 = 2nd class.
    - 3 = 3rd class.
- Name: The name of the passenger.
- Sex: The gender of the passenger.
- Age: The age of the passenger. This column may have missing values.
- SibSp: The number of siblings or spouses traveling with the passenger.
- Parch: The number of parents or children traveling with the passenger.
- Ticket: The ticket number.
- Fare: The fare paid by the passenger.
- Cabin: The cabin number where the passenger stayed. This column may have missing values.
- Embarked: The port at which the passenger boarded the ship:
    - C = Cherbourg, a port in France.
    - Q = Queenstown, a port in Ireland.
    - S = Southampton, a port in England.

In [None]:
# TODO: using this link (https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/) create the following new features: family size, inputing data for fares, age per class (for example older people in first class).
# TODO: cleaning data from NAs