# General instructions
-  In this problemset you will work on four datasets related to the titanic ship accident. Your task is to prepare and structure the data in a way that allows future analysis. The final dataset (it should be a single, tidy data set) may look similar to the one that we work with in previous lectures. 
- There may be multiple ways to link and clean up the data. Decide for yourself what makes sense and justify your choices.
- If you want to better understand the meaning of certain information, then visit https://www.encyclopedia-titanica.org/. In particular, you may study the information on [passengers](https://www.encyclopedia-titanica.org/titanic-passenger-list/), [crew members](https://www.encyclopedia-titanica.org/titanic-crew-list/) and the description of [recovered bodies](https://www.encyclopedia-titanica.org/description-of-recovered-titanic-bodies.html#135).


# Exercise 1

Load the four data sets

In [106]:
import pandas as pd

In [107]:
bodies = pd.read_csv('bodies.csv', sep=';')
crew = pd.read_csv('crew.csv', sep=',')
outcomes = pd.read_csv('outcomes.csv', sep=',')
passengers = pd.read_csv('passengers.csv', sep=',')


Study important characteristics of the data (number of rows, column names, first rows, ...)

In [108]:
print(bodies.shape)
bodies.head(3)

(330, 4)


Unnamed: 0,BODY_NO,GENDER,CLOTHING,ESTIMATED_AGE
0,1,male,"Overcoat, grey; one grey coat; one blue coat; ...",
1,2,male,Brown tweed coat; white steward's jacket; brow...,24.0
2,3,female,Grey cloth jacket; red jersey jacket; blue alp...,40.0


In [109]:
print(crew.shape)
crew.head(3)

(1122, 13)


Unnamed: 0,person_id,first_name,family_name,departure,nationality,gender,age,marital_status,birth_date,home_location,occupation,department,works_for
0,1420,Ernest Owen,Abbott,Southampton,English,Male,21 years,Single,1891-01-19,"Southampton, Hampshire, England",Lounge Pantry Steward,Victualling Crew,White Star Line
1,1421,William Thomas,Abrams,Southampton,English,Male,34 years,Married,,"Southampton, Hampshire, England",Fireman,Engineering Crew,White Star Line
2,1422,Robert John,Adams,Southampton,English,Male,26 years,Single,1885-06-07,"Southampton, Hampshire, England",Fireman,Engineering Crew,White Star Line


In [110]:
print(outcomes.shape)
outcomes.head(5)

(2541, 4)


Unnamed: 0,person_id,lifeboat,body_no,death_date
0,1,,,1912-04-15
1,2,,,1912-04-15
2,3,boat A,,1946-02-18
3,4,,190.0,1912-04-15
4,5,boat 16,,1969-07-27


In [111]:
print(passengers.shape)
passengers.head(3)


(1419, 16)


Unnamed: 0,person_id,first_name,family_name,ticket_number,pclass,departure,price,nationality,gender,age,marital_status,birth_date,home_location,destination,occupation,works_for
0,1,Anthony,Abbing,5547.0,3rd Class Passengers,Southampton,7.0,American,Male,41 years,Single,1870-05-11,"Cincinnati, Ohio, United States","Cincinnati, Ohio, United States",Blacksmith,
1,2,Eugene Joseph,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,American,Male,13 years,,1899-03-31,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",Scholar,
2,3,Rhoda,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,English,Female,39 years,Divorced,1873-01-25,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",,


Briefly describe: (1) How are the four tables related with each other

passengers and crew are the same df but for two different groups, bodies are the found dead bodies and the outcomes have the info about the persons death in them

Briefly describe: Do you observe any potential problems or noteworthy characteristics that need further inspection/handling?
a potential problem might be the amounts of NAs in all of the datasets and if or how to merge the datasets. the body_no has a lot of NAs in the bigger dfs because a lot of people didnt die in the water. body number is spelled differently


# Exercise 2

Your task is to create a single, combined dataset suitable for data analysis. When developing your strategy, you may think about the following questions: 
- Should some datasets be merged (column-wise)? 
- If so, on which keys and using which merge types?
- Should some datasets be concatenated (row-wise)?
- Should some datasets be reshaped (pivoted or melted)?

Implement you chosen strategy.

In [112]:
crew['crew'] = 1
df = pd.concat([passengers, crew],axis = 0)
print(df.shape)
df.columns

(2541, 18)


Index(['person_id', 'first_name', 'family_name', 'ticket_number', 'pclass',
       'departure', 'price', 'nationality', 'gender', 'age', 'marital_status',
       'birth_date', 'home_location', 'destination', 'occupation', 'works_for',
       'department', 'crew'],
      dtype='object')

In [113]:
df = pd.merge( left=df, right=outcomes, on = 'person_id', how= 'outer')
print(df.shape)
df.columns

(2541, 21)


Index(['person_id', 'first_name', 'family_name', 'ticket_number', 'pclass',
       'departure', 'price', 'nationality', 'gender', 'age', 'marital_status',
       'birth_date', 'home_location', 'destination', 'occupation', 'works_for',
       'department', 'crew', 'lifeboat', 'body_no', 'death_date'],
      dtype='object')

In [114]:
bodies.rename(columns={'BODY_NO': 'body_no'}, inplace=True)
df_full = pd.merge(df, bodies, on='body_no', how='outer')
df_full.shape


(2730, 24)

In [115]:
df_full.drop(columns=['GENDER'], inplace=True)

In [116]:
df_full.head()

Unnamed: 0,person_id,first_name,family_name,ticket_number,pclass,departure,price,nationality,gender,age,...,destination,occupation,works_for,department,crew,lifeboat,body_no,death_date,CLOTHING,ESTIMATED_AGE
0,,,,,,,,,,,...,,,,,,,1.0,,"Overcoat, grey; one grey coat; one blue coat; ...",
1,,,,,,,,,,,...,,,,,,,2.0,,Brown tweed coat; white steward's jacket; brow...,24.0
2,,,,,,,,,,,...,,,,,,,3.0,,Grey cloth jacket; red jersey jacket; blue alp...,40.0
3,492.0,Sidney Leslie,Goodwin,2144.0,3rd Class Passengers,Southampton,46.0,English,Male,1 years,...,"Niagara Falls, New York, United States",,,,,,4.0,1912-04-15,Grey coat with fur on collar and cuffs; brown ...,2.0
4,,,,,,,,,,,...,,,,,,,5.0,,Blue waterproof; black jacket and skirt; pink ...,


# Exercise 3

Derive a boolean column that indicates whether a person (crew or passenger) survived the accident or not

In [117]:
df_full.columns

Index(['person_id', 'first_name', 'family_name', 'ticket_number', 'pclass',
       'departure', 'price', 'nationality', 'gender', 'age', 'marital_status',
       'birth_date', 'home_location', 'destination', 'occupation', 'works_for',
       'department', 'crew', 'lifeboat', 'body_no', 'death_date', 'CLOTHING',
       'ESTIMATED_AGE'],
      dtype='object')

In [118]:
df_full.body_no

0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
       ... 
2725    NaN
2726    NaN
2727    NaN
2728    NaN
2729    NaN
Name: body_no, Length: 2730, dtype: float64

In [119]:
df_full.death_date

0              NaN
1              NaN
2              NaN
3       1912-04-15
4              NaN
           ...    
2725    1939-11-18
2726    1912-04-15
2727    1912-04-15
2728    1912-04-15
2729    1912-04-15
Name: death_date, Length: 2730, dtype: object

In [120]:
df_full['survived_byDD'] = df_full['death_date'] == '1912-04-15'


In [121]:
df_full['survived'] = df_full['body_no'].isna()

In [122]:
df_full['pclass_no'] = df_full['pclass'].map({'1st Class Passengers': 1, '2nd Class Passengers': 2, '3rd Class Passengers': 3})
df_full[['pclass', 'pclass_no']].head()

Unnamed: 0,pclass,pclass_no
0,,
1,,
2,,
3,3rd Class Passengers,3.0
4,,


Derive a new column `pclass_no` with numeric values 1, 2, and 3 corresponding to the three passenger classes given in column `pclass`.

In [123]:
df_full['pclass_no'] = df_full.pclass.map({'1st Class Passengers': 1, '2nd Class Passengers':2, '3rd Class Passengers':3})

Calculate the age of a person at death and store it in a new column

In [124]:
df_full['birth_date'] = pd.to_datetime(df_full['birth_date'])
df_full['death_date'] = pd.to_datetime(df_full['birth_date'])


In [125]:
df_full['death_age'] = df_full.birth_date - df_full.death_date

In [None]:
# df_full['death_age'] = (df_full['death_date'] - df_full['birth_date']).dt.days / 365.25

0       NaN
1       NaN
2       NaN
3       0.0
4       NaN
       ... 
2725    0.0
2726    0.0
2727    0.0
2728    NaN
2729    0.0
Name: death_age, Length: 2730, dtype: float64

Derive a new column `home_country` from the column `home_location`, extracting only the country name from the string

In [127]:
df_full['home_location']

0                                   NaN
1                                   NaN
2                                   NaN
3          Melksham, Wiltshire, England
4                                   NaN
                     ...               
2725    Southampton, Hampshire, England
2726    Southampton, Hampshire, England
2727                    London, England
2728    Southampton, Hampshire, England
2729    Southampton, Hampshire, England
Name: home_location, Length: 2730, dtype: object

In [128]:
df_full['home_country'] = df_full['home_location'].str.split(',').str[2]

Which further data cleaning/processing steps would be needed/desirable?