# General instructions
-  In this problemset you will work on four datasets related to the titanic ship accident. Your task is to prepare and structure the data in a way that allows future analysis. The final dataset (it should be a single, tidy data set) may look similar to the one that we work with in previous lectures. 
- There may be multiple ways to link and clean up the data. Decide for yourself what makes sense and justify your choices.
- If you want to better understand the meaning of certain information, then visit https://www.encyclopedia-titanica.org/. In particular, you may study the information on [passengers](https://www.encyclopedia-titanica.org/titanic-passenger-list/), [crew members](https://www.encyclopedia-titanica.org/titanic-crew-list/) and the description of [recovered bodies](https://www.encyclopedia-titanica.org/description-of-recovered-titanic-bodies.html#135).


In [1]:
import pandas as pd

# Exercise 1

Load the four data sets

In [3]:
df_bodies = pd.read_csv('bodies.csv',sep=';')
df_crew = pd.read_csv('crew.csv')
df_outcomes = pd.read_csv('outcomes.csv')
df_passengers = pd.read_csv('passengers.csv')

In [4]:
df_bodies.head()

Unnamed: 0,BODY_NO,GENDER,CLOTHING,ESTIMATED_AGE
0,1,male,"Overcoat, grey; one grey coat; one blue coat; ...",
1,2,male,Brown tweed coat; white steward's jacket; brow...,24.0
2,3,female,Grey cloth jacket; red jersey jacket; blue alp...,40.0
3,4,male,Grey coat with fur on collar and cuffs; brown ...,2.0
4,5,female,Blue waterproof; black jacket and skirt; pink ...,


In [24]:
df_bodies.columns = df_bodies.columns.str.lower()

In [25]:
df_bodies

Unnamed: 0,body_no,gender,clothing,estimated_age
0,1,male,"Overcoat, grey; one grey coat; one blue coat; ...",
1,2,male,Brown tweed coat; white steward's jacket; brow...,24.0
2,3,female,Grey cloth jacket; red jersey jacket; blue alp...,40.0
3,4,male,Grey coat with fur on collar and cuffs; brown ...,2.0
4,5,female,Blue waterproof; black jacket and skirt; pink ...,
...,...,...,...,...
325,326,male,Steward's white coat; light checked overalls; ...,50.0
326,327,male,,38.0
327,328,female,Lace trimmed red and black overdress; black un...,14.0
328,329,male,,38.0


In [5]:
df_crew.head()

Unnamed: 0,person_id,first_name,family_name,departure,nationality,gender,age,marital_status,birth_date,home_location,occupation,department,works_for
0,1420,Ernest Owen,Abbott,Southampton,English,Male,21 years,Single,1891-01-19,"Southampton, Hampshire, England",Lounge Pantry Steward,Victualling Crew,White Star Line
1,1421,William Thomas,Abrams,Southampton,English,Male,34 years,Married,,"Southampton, Hampshire, England",Fireman,Engineering Crew,White Star Line
2,1422,Robert John,Adams,Southampton,English,Male,26 years,Single,1885-06-07,"Southampton, Hampshire, England",Fireman,Engineering Crew,White Star Line
3,1423,Percy Snowden,Ahier,Southampton,Channel Islander,Male,20 years,Single,1892-01-08,"Southampton, Hampshire, England",Saloon Steward,Victualling Crew,White Star Line
4,1424,Albert Edward,Akerman,Southampton,English,Male,31 years,Single,1880-11-13,"Southampton, Hampshire, England",3rd Class Steward,Victualling Crew,White Star Line


In [6]:
df_outcomes.head()

Unnamed: 0,person_id,lifeboat,body_no,death_date
0,1,,,1912-04-15
1,2,,,1912-04-15
2,3,boat A,,1946-02-18
3,4,,190.0,1912-04-15
4,5,boat 16,,1969-07-27


In [7]:
df_passengers.head()

Unnamed: 0,person_id,first_name,family_name,ticket_number,pclass,departure,price,nationality,gender,age,marital_status,birth_date,home_location,destination,occupation,works_for
0,1,Anthony,Abbing,5547.0,3rd Class Passengers,Southampton,7.0,American,Male,41 years,Single,1870-05-11,"Cincinnati, Ohio, United States","Cincinnati, Ohio, United States",Blacksmith,
1,2,Eugene Joseph,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,American,Male,13 years,,1899-03-31,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",Scholar,
2,3,Rhoda,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,English,Female,39 years,Divorced,1873-01-25,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",,
3,4,Rossmore Edward,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,English,Male,16 years,Single,1896-02-21,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",Jeweller,
4,5,Kalle (Karen) Marie Kristiane,Abelseth,348125.0,3rd Class Passengers,Southampton,7.0,Norwegian,Female,16 years,Single,1895-09-14,"Sondmore, Norway","Los Angeles, California, United States",,


Study important characteristics of the data (number of rows, column names, first rows, ...)

In [9]:
df_bodies.shape, df_crew.shape, df_outcomes.shape, df_passengers.shape


((330, 4), (1122, 13), (2541, 4), (1419, 16))

In [10]:
df_crew.columns, df_bodies.columns, df_outcomes.columns, df_passengers.columns

(Index(['person_id', 'first_name', 'family_name', 'departure', 'nationality',
        'gender', 'age', 'marital_status', 'birth_date', 'home_location',
        'occupation', 'department', 'works_for'],
       dtype='object'),
 Index(['BODY_NO', 'GENDER', 'CLOTHING', 'ESTIMATED_AGE'], dtype='object'),
 Index(['person_id', 'lifeboat', 'body_no', 'death_date'], dtype='object'),
 Index(['person_id', 'first_name', 'family_name', 'ticket_number', 'pclass',
        'departure', 'price', 'nationality', 'gender', 'age', 'marital_status',
        'birth_date', 'home_location', 'destination', 'occupation',
        'works_for'],
       dtype='object'))

Briefly describe: (1) How are the four tables related with each other

#### they have keys which could be combined together to get one whole data they represent differe info about one common topic 

Briefly describe: Do you observe any potential problems or noteworthy characteristics that need further inspection/handling?

#### Answer


* Standartization of columns by type Numerical / String must differ
* splitting text columns to give more readable and atomized nature
* Remove Nan, Merge Dataframes into  one
* Processing datetime mining and correcting

# Exercise 2

Your task is to create a single, combined dataset suitable for data analysis. When developing your strategy, you may think about the following questions: 
- Should some datasets be merged (column-wise)? 
- If so, on which keys and using which merge types?
- Should some datasets be concatenated (row-wise)?
- Should some datasets be reshaped (pivoted or melted)?

Implement you chosen strategy.

In [16]:
onboard = pd.concat([df_passengers,df_crew])
onboard.head()
onboard.shape

(2541, 17)

In [18]:
onboard.head()

Unnamed: 0,person_id,first_name,family_name,ticket_number,pclass,departure,price,nationality,gender,age,marital_status,birth_date,home_location,destination,occupation,works_for,department
0,1,Anthony,Abbing,5547.0,3rd Class Passengers,Southampton,7.0,American,Male,41 years,Single,1870-05-11,"Cincinnati, Ohio, United States","Cincinnati, Ohio, United States",Blacksmith,,
1,2,Eugene Joseph,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,American,Male,13 years,,1899-03-31,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",Scholar,,
2,3,Rhoda,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,English,Female,39 years,Divorced,1873-01-25,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",,,
3,4,Rossmore Edward,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,English,Male,16 years,Single,1896-02-21,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",Jeweller,,
4,5,Kalle (Karen) Marie Kristiane,Abelseth,348125.0,3rd Class Passengers,Southampton,7.0,Norwegian,Female,16 years,Single,1895-09-14,"Sondmore, Norway","Los Angeles, California, United States",,,


In [31]:
total_board_info = onboard.merge(df_outcomes, on='person_id', how='left')
total_board_info.head(), total_board_info.shape

(   person_id                     first_name family_name  ticket_number  \
 0          1                        Anthony      Abbing         5547.0   
 1          2                  Eugene Joseph      Abbott         2673.0   
 2          3                          Rhoda      Abbott         2673.0   
 3          4                Rossmore Edward      Abbott         2673.0   
 4          5  Kalle (Karen) Marie Kristiane    Abelseth       348125.0   
 
                  pclass    departure  price nationality  gender       age  \
 0  3rd Class Passengers  Southampton    7.0    American    Male  41 years   
 1  3rd Class Passengers  Southampton   20.0    American    Male  13 years   
 2  3rd Class Passengers  Southampton   20.0     English  Female  39 years   
 3  3rd Class Passengers  Southampton   20.0     English    Male  16 years   
 4  3rd Class Passengers  Southampton    7.0   Norwegian  Female  16 years   
 
   marital_status  birth_date                                 home_location  \

In [32]:
total_board_info

Unnamed: 0,person_id,first_name,family_name,ticket_number,pclass,departure,price,nationality,gender,age,marital_status,birth_date,home_location,destination,occupation,works_for,department,lifeboat,body_no,death_date
0,1,Anthony,Abbing,5547.0,3rd Class Passengers,Southampton,7.0,American,Male,41 years,Single,1870-05-11,"Cincinnati, Ohio, United States","Cincinnati, Ohio, United States",Blacksmith,,,,,1912-04-15
1,2,Eugene Joseph,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,American,Male,13 years,,1899-03-31,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",Scholar,,,,,1912-04-15
2,3,Rhoda,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,English,Female,39 years,Divorced,1873-01-25,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",,,,boat A,,1946-02-18
3,4,Rossmore Edward,Abbott,2673.0,3rd Class Passengers,Southampton,20.0,English,Male,16 years,Single,1896-02-21,"East Providence, Rhode Island, United States","East Providence, Rhode Island, United States",Jeweller,,,,190.0,1912-04-15
4,5,Kalle (Karen) Marie Kristiane,Abelseth,348125.0,3rd Class Passengers,Southampton,7.0,Norwegian,Female,16 years,Single,1895-09-14,"Sondmore, Norway","Los Angeles, California, United States",,,,boat 16,,1969-07-27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2536,2537,Harry,Yearsley,,,Southampton,,English,Male,40 years,Married,1871-12-03,"Southampton, Hampshire, England",,First class saloon steward,White Star Line,Victualling Crew,,,1939-11-18
2537,2538,Francis James,Young,,,Southampton,,English,Male,32 years,Married,1879-11-14,"Southampton, Hampshire, England",,Fireman,White Star Line,Engineering Crew,,,1912-04-15
2538,2539,Mario,Zanetti,,,Southampton,,Swiss,Male,20 years,Single,1892-01-09,"London, England",,Assistant Waiter,White Star Line,,,,1912-04-15
2539,2540,Leopoldo,Zarracchi,,,Southampton,,Italian,Male,24 years,Single,,"Southampton, Hampshire, England",,Wine Butler,White Star Line,,,,1912-04-15


In [42]:
total_board_info_merged = pd.merge(left=total_board_info, right=df_bodies, on='body_no', how='outer')


In [43]:
total_board_info_merged.shape

(2730, 23)

In [40]:
total_board_info_merged.isna().sum()

person_id            0
first_name           9
family_name          1
ticket_number     1128
pclass            1123
departure            1
price             1151
nationality         29
gender_x             2
age                 31
marital_status     647
birth_date         675
home_location      360
destination       1586
occupation         652
works_for         1407
department        1567
lifeboat          2006
body_no           2394
death_date         211
gender_y          2398
clothing          2405
estimated_age     2408
dtype: int64

# Exercise 3

Derive a boolean column that indicates whether a person (crew or passenger) survived the accident or not

Derive a new column `pclass_no` with numeric values 1, 2, and 3 corresponding to the three passenger classes given in column `pclass`.

Calculate the age of a person at death and store it in a new column

Derive a new column `home_country` from the column `home_location`, extracting only the country name from the string

Which further data cleaning/processing steps would be needed/desirable?