# General instructions
- Install the package `Pandas` in your course-specific virtual environment, if you have not done so already.
- Store the data file `titanic.csv` either in the same directory as the current Jupyter notebook, or in a subdirectory named `data`.
- Read in the data set 'titanic.csv' as a Pandas DataFrame. 
- Answer the questions below, and other questions that may arise in the process.

Note that the data set was downloaded from [Kaggle](https://www.kaggle.com/datasets/vinicius150987/titanic3). Only minimal preprocessing has been performed on the data set. Careful inspection shows that the likely raw source of the data is [Encyclopedia Titanica](https://www.encyclopedia-titanica.org/). Note that there is also a (less complete) variant of the data set available via the Python package `seaborn`.

# 1. Investigation of lifeboats on the Titanic

- Which were the 5 boats that saved the largest number of people?
- How many boats were there in total?
- Which boats saved the largest number of male passengers?
- Which boats saved the larges proportion of male adult passengers?
- What was the average number of people saved by boat?
- Which boats saved in particular people from the 1st (2nd, 3rd) class?
- What was the average fare paid by boat?
- By boat: what was the (1) average fare, (2) the number of people with data on the paid fare, (3) the number of people saved? (Note: Please calculate this in one single query)

# 2. Investigation of fares
- The fares are given in British Pounds. 1 British Pound at that time corresponds to 161 US Dollar today. And 1 US Dollar corresponds currently to 0.90 Euro. Please convert the original fares to current day Euros, and store it in a new column. 
- Are the fares provided in the data the prices paid by person, or the prices paid by ticket (potentially covering multiple people)? Carry out a detailed data analysis to answer this question.
- What are other interesting patterns related to the fares paid? Freely explore!

# 3. Questions collected in class
- What are other interesting questions that can be answered with the data? Freely explore!

## Aggregate statistics: mean, ...
- Mean age
- Average size of the families travelling

## Number of / counts of specific attributes
- How many peope survived, or died?
- How many people were travelling in different passenger classes?
- Are there more embarkation ports?
- Where do people come from (`home_dest`)?
- Of the people who died, how many were not found?

## Data cleaning
- Do we have duplicate values?
- Checking the data types
- Refactor data types: e.g. string column sex into boolean (0/1)

# Relationship between two (or more variables): causality?
- Did more woman survive than men (percentage-wise)?
- Did more children survive than adults (percentage-wise)?
- What is the likelihood of survival in the first/second/... class?


# Ranking / Sorting
- Which cabins have the least/most survivers?
- Which boat saved the most numbers?


In [18]:
import pandas as pd

df = pd.read_csv('titanic.csv')
df

Unnamed: 0,name,sex,age,embarked,home.dest,ticket,pclass,cabin,fare,sibsp,parch,survived,boat,body
0,"Allen, Miss. Elisabeth Walton",female,29.0000,Southampton,"St Louis, MO",24160,1,B5,211.34,0,0,1,2,
1,"Allison, Master. Hudson Trevor",male,0.9167,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,1,11,
2,"Allison, Miss. Helen Loraine",female,2.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,
3,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,135.0
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,"Zabour, Miss. Hileni",female,14.5000,Cherbourg,,2665,3,,14.45,1,0,0,,328.0
1305,"Zabour, Miss. Thamine",female,,Cherbourg,,2665,3,,14.45,1,0,0,,
1306,"Zakarian, Mr. Mapriededer",male,26.5000,Cherbourg,,2656,3,,7.22,0,0,0,,304.0
1307,"Zakarian, Mr. Ortin",male,27.0000,Cherbourg,,2670,3,,7.22,0,0,0,,


mean age

In [19]:
df.age.mean()

np.float64(29.8811345124283)

average size of the families travelling

In [20]:
df[['sibsp','parch']].mean().sum() + 1

np.float64(1.8838808250572956)

how many peope survived, or died?

In [21]:
df.value_counts('survived')

survived
0    809
1    500
Name: count, dtype: int64

How many people were travelling in different passenger classes?


In [22]:
df.value_counts('pclass')

pclass
3    709
1    323
2    277
Name: count, dtype: int64

Are there more embarkation ports?

In [23]:
df.embarked.unique()

array(['Southampton', 'Cherbourg', nan, 'Queenstown'], dtype=object)

Of the people who died, how many were not found?

In [24]:
df[(df.survived == 0) & (df.body.isna() == False)].shape[0]

121

Did more women survive than men (percentage-wise)?

In [25]:
df[df.survived==1].value_counts(['sex'], normalize=True, sort=False)

sex   
female    0.678
male      0.322
Name: proportion, dtype: float64

What is the likelihood of survival in the first/second/... class?

In [26]:
# df[df.survived==1].value_counts("pclass",normalize=True)
df[df.survived==1].value_counts("pclass",normalize=True, sort=False)

pclass
1    0.400
2    0.238
3    0.362
Name: proportion, dtype: float64

# 1. Investigation of lifeboats on the Titanic

- Which were the 5 boats that saved the largest number of people?
- How many boats were there in total?
- Which boats saved the largest number of male passengers?
- Which boats saved the larges proportion of male adult passengers?
- What was the average number of people saved by boat?
- Which boats saved in particular people from the 1st (2nd, 3rd) class?
- What was the average fare paid by boat?
- By boat: what was the (1) average fare, (2) the number of people with data on the paid fare, (3) the number of people saved? (Note: Please calculate this in one single query)

In [27]:
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,name,sex,age,embarked,home.dest,ticket,pclass,cabin,fare,sibsp,parch,survived,boat,body
0,"Allen, Miss. Elisabeth Walton",female,29.0000,Southampton,"St Louis, MO",24160,1,B5,211.34,0,0,1,2,
1,"Allison, Master. Hudson Trevor",male,0.9167,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,1,11,
2,"Allison, Miss. Helen Loraine",female,2.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,
3,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,135.0
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,"Zabour, Miss. Hileni",female,14.5000,Cherbourg,,2665,3,,14.45,1,0,0,,328.0
1305,"Zabour, Miss. Thamine",female,,Cherbourg,,2665,3,,14.45,1,0,0,,
1306,"Zakarian, Mr. Mapriededer",male,26.5000,Cherbourg,,2656,3,,7.22,0,0,0,,304.0
1307,"Zakarian, Mr. Ortin",male,27.0000,Cherbourg,,2670,3,,7.22,0,0,0,,


Which were the 5 boats that saved the largest number of people?

In [28]:
df[df.survived==1].groupby('boat').size().nlargest(5)

boat
13    39
15    37
C     37
14    32
4     31
dtype: int64

How many boats were there in total?

In [29]:
df.boat.nunique()

27

Which boat saved the largest number of male passengers?

In [30]:
df[(df.sex=='male') & (df.survived == 1)].groupby('boat').size().nlargest(1)

boat
15    24
dtype: int64

What was the average fare paid by boat?

In [31]:
df.groupby('boat').fare.mean()

boat
1           46.980000
10          62.155172
11          37.648800
12          19.688947
13          16.143590
13 15        7.510000
13 15 B      7.750000
14          32.688788
15          11.716216
15 16        7.750000
16          13.293478
2           92.127692
3          147.133846
4          127.922258
5           60.287407
5 7         52.000000
5 9         26.550000
6           83.375000
7           52.948261
8          100.560000
8 10        26.550000
9           26.464400
A           24.168182
B           24.306667
C           19.375263
C D         20.520000
D           35.993000
Name: fare, dtype: float64

# 2. Investigation of fares
- The fares are given in British Pounds. 1 British Pound at that time corresponds to 161 US Dollar today. And 1 US Dollar corresponds currently to 0.90 Euro. Please convert the original fares to current day Euros, and store it in a new column. 
- Are the fares provided in the data the prices paid by person, or the prices paid by ticket (potentially covering multiple people)? Carry out a detailed data analysis to answer this question.
- What are other interesting patterns related to the fares paid? Freely explore!

In [32]:
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,name,sex,age,embarked,home.dest,ticket,pclass,cabin,fare,sibsp,parch,survived,boat,body
0,"Allen, Miss. Elisabeth Walton",female,29.0000,Southampton,"St Louis, MO",24160,1,B5,211.34,0,0,1,2,
1,"Allison, Master. Hudson Trevor",male,0.9167,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,1,11,
2,"Allison, Miss. Helen Loraine",female,2.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,
3,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,135.0
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,Southampton,"Montreal, PQ / Chesterville, ON",113781,1,C22 C26,151.55,1,2,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,"Zabour, Miss. Hileni",female,14.5000,Cherbourg,,2665,3,,14.45,1,0,0,,328.0
1305,"Zabour, Miss. Thamine",female,,Cherbourg,,2665,3,,14.45,1,0,0,,
1306,"Zakarian, Mr. Mapriededer",male,26.5000,Cherbourg,,2656,3,,7.22,0,0,0,,304.0
1307,"Zakarian, Mr. Ortin",male,27.0000,Cherbourg,,2670,3,,7.22,0,0,0,,


In [33]:
df['fare_eur'] = df.fare*161*0.90
df[['fare_eur']]

Unnamed: 0,fare_eur
0,30623.166
1,21959.595
2,21959.595
3,21959.595
4,21959.595
...,...
1304,2093.805
1305,2093.805
1306,1046.178
1307,1046.178


- Are there any duplicates in the data?

In [34]:
areNODuplicates = df['name'].nunique() == df.shape[0]

print(f"areNODuplicates {areNODuplicates}")

# get duplicate rows
dups = df[df.duplicated('name')].name.values
df.loc[df.name.isin(dups)]

areNODuplicates False


Unnamed: 0,name,sex,age,embarked,home.dest,ticket,pclass,cabin,fare,sibsp,parch,survived,boat,body,fare_eur
725,"Connolly, Miss. Kate",female,22.0,Queenstown,Ireland,370373,3,,7.75,0,0,1,13.0,,1122.975
726,"Connolly, Miss. Kate",female,30.0,Queenstown,Ireland,330972,3,,7.63,0,0,0,,,1105.587
924,"Kelly, Mr. James",male,34.5,Queenstown,,330911,3,,7.83,0,0,0,,70.0,1134.567
925,"Kelly, Mr. James",male,44.0,Southampton,,363592,3,,8.05,0,0,0,,,1166.445
