# General instructions
- Install the package `Pandas` in your course-specific virtual environment, if you have not done so already.
- Store the data file `titanic.csv` either in the same directory as the current Jupyter notebook, or in a subdirectory named `data`.
- Read in the data set 'titanic.csv' as a Pandas DataFrame. 
- Answer the questions below, and other questions that may arise in the process.

Note that the source of the data set is [Encyclopedia Titanica](https://www.encyclopedia-titanica.org/). Several preprocessing steps have been carried out on the raw dataset. Note that there is also a (less complete) variant of the data set available via the Python package `seaborn`.

There may be occasional problems or errors in the data (unplausible, wrong, ...). If you find some error or strange anomaly, then please give me a hint, so that I can further curate the dataset for the future.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,first_name,family_name,type,ticket_number,pclass,departure,price,survived,lifeboat,body_no,...,gender,age,marital_status,number_relatives_onboard,home_location,destination,age_at_death,occupation,department,works_for
0,Luka,Orešković,Passenger,315094.0,3rd Class Passengers,Southampton,8.0,False,,,...,Male,20.0,Married,3,"Konjsko Brdo, Croatia [Austria-Hungary]","Chicago, Illinois, United States",20.0,Farmer,,
1,Joseph Francis,Akerman,Crew,,,Southampton,,False,,,...,Male,37.0,Married,1,"Southampton, Hampshire, England",,37.0,Assistant Pantryman Steward,Victualling Crew,White Star Line
2,Mary Elizabeth,Davison,Passenger,386525.0,3rd Class Passengers,Southampton,16.0,True,boat 10,,...,Female,34.0,Married,2,"Liverpool, Lancashire, England","Bedford, Ohio, United States",61.0,,,
3,Ernest Edward Samuel,Freeman,Crew,,,Belfast,,False,,,...,Male,45.0,Married,1,"Southampton, Hampshire, England",,,Deck Steward (1st Class),Victualling Crew,White Star Line
4,George Alfred,Levett,Crew,,,Belfast,,False,,,...,Male,25.0,Married,2,"Southampton, Hampshire, England",,25.0,Assistant Pantryman Steward (1st Class),Victualling Crew,White Star Line


# 1. Traveler type

How many crew members and how many passengers are recorded in the dataset?

In [4]:
# Anzahl der Crew-Mitglieder und Passagiere
crew_count = df[df["type"] == "Crew"].shape[0]
passenger_count = df[df["type"] == "Passenger"].shape[0]

crew_count, passenger_count


(1122, 1352)

Sort the dataset permanently according to traveler type (crew vs. passenger) and lastname?

In [8]:
sortef_df = df.sort_values(by=['type', 'family_name'])
sortef_df

Unnamed: 0,first_name,family_name,type,ticket_number,pclass,departure,price,survived,lifeboat,body_no,...,gender,age,marital_status,number_relatives_onboard,home_location,destination,age_at_death,occupation,department,works_for
434,Ernest Owen,Abbott,Crew,,,Southampton,,False,,,...,Male,21.0,Single,1,"Southampton, Hampshire, England",,21.0,Lounge Pantry Steward,Victualling Crew,White Star Line
221,William Thomas,Abrams,Crew,,,Southampton,,False,,,...,Male,34.0,Married,1,"Southampton, Hampshire, England",,,Fireman,Engineering Crew,White Star Line
1132,Robert John,Adams,Crew,,,Southampton,,False,,,...,Male,26.0,Single,0,"Southampton, Hampshire, England",,26.0,Fireman,Engineering Crew,White Star Line
2428,Percy Snowden,Ahier,Crew,,,Southampton,,False,,,...,Male,20.0,Single,0,"Southampton, Hampshire, England",,20.0,Saloon Steward,Victualling Crew,White Star Line
1,Joseph Francis,Akerman,Crew,,,Southampton,,False,,,...,Male,37.0,Married,1,"Southampton, Hampshire, England",,37.0,Assistant Pantryman Steward,Victualling Crew,White Star Line
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,Mubārik Sulaymān Abī Āsī,Ḥannā,Passenger,2663.0,3rd Class Passengers,Cherbourg,7.0,True,boat 15,,...,Male,27.0,Single,0,"Hardīn, Batroun, Syria","Wilkes Barre, Pennsylvania, United States",66.0,,,
505,Mansūr,Ḥannā Al-Hāj,Passenger,2693.0,3rd Class Passengers,Cherbourg,7.0,False,,,...,Male,35.0,Married,1,"Kafr Mishki, Syria","Ottawa, Ontario, Canada",,,,
659,Ṭannūs Ḥannā Mu'awwad,Ṭannūs,Passenger,2684.0,3rd Class Passengers,Cherbourg,7.0,False,,,...,Male,16.0,,0,,"Columbus, Ohio, United States",,Scholar,,
946,Ḥannā,Ṭannūs Mu'awwad,Passenger,2681.0,3rd Class Passengers,Cherbourg,6.0,False,,,...,Male,34.0,,4,,"Columbus, Ohio, United States",34.0,Dealer,,


How many crew members died?

In [10]:
crew_deaths = df[(df["type"] == "Crew") & (df["survived"] == False)].shape[0]
crew_deaths

677

# 2. Lifeboats

Which were the 5 boats that saved the largest number of people?

In [None]:
# Count the number of people saved by each lifeboat
lifeboat_counts = df["lifeboat"].value_counts()

# Get the top 5 lifeboats
top_5_lifeboats = lifeboat_counts.head(5)

print("Top 5 lifeboats that saved the most people:\n", top_5_lifeboats)

Top 5 lifeboats that saved the most people:
 lifeboat
boat 13    46
boat 11    44
boat 15    38
boat 14    35
boat 5     34
Name: count, dtype: int64


How many lifeboats were there in total?

In [12]:
total_lifeboats = df["lifeboat"].nunique()
total_lifeboats

20

Which lifeboats saved the largest number of male passengers?

In [26]:
df_male_survived = df[(df.gender == 'Male') & (df.survived == True)]

male_survived_lifeboat_counts = df_male_survived['lifeboat'].value_counts()

male_survived_lifeboat_counts.head(1)

lifeboat
boat 15    32
Name: count, dtype: int64

In [13]:
# Filter for male passengers who survived and group by lifeboat
male_passengers_by_lifeboat = df[(df["gender"] == "Male") & (df["survived"] == True)].groupby("lifeboat").size()

# Sort the results in descending order
largest_male_saving_lifeboats = male_passengers_by_lifeboat.sort_values(ascending=False)

# Display the lifeboats that saved the largest number of male passengers
largest_male_saving_lifeboats

lifeboat
boat 15    32
boat 13    30
boat B     23
boat 9     20
boat 5     18
boat 3     18
boat 7     16
boat 11    16
boat 14    15
boat 4     12
boat A     11
boat 1     10
boat D     10
boat 10     8
boat C      8
boat 2      7
boat 16     5
boat 6      4
boat 12     3
boat 8      3
dtype: int64

Which lifeboats saved the largest proportion of male adult passengers?

In [38]:
df_male_survived = df[(df.gender == 'Male') & (df.survived == True) & (df.age >= 18)]

male_survived_lifeboat_counts = df_male_survived['lifeboat'].value_counts()

df_survived = df[(df.survived == True)]

survived_lifeboat_counts = df_survived['lifeboat'].value_counts()

proportion_male_adults = male_survived_lifeboat_counts / survived_lifeboat_counts

proportion_male_adults.sort_values(ascending=False).head(1)


lifeboat
boat A    0.916667
Name: count, dtype: float64

In [None]:
# Filter for male adult passengers who survived
male_adults = df[(df["gender"] == "Male") & (df["age"] >= 18) & (df["survived"] == True)]

# Group by lifeboat and calculate the proportion of male adults saved
male_adults_by_lifeboat = male_adults.groupby("lifeboat").size()
proportion_male_adults = (male_adults_by_lifeboat / male_adults.shape[0]).sort_values(ascending=False)

# Display the lifeboats with the largest proportion of male adult passengers saved
proportion_male_adults

lifeboat
boat 15    0.055556
boat 13    0.050179
boat B     0.037634
boat 9     0.035842
boat 5     0.030466
boat 3     0.030466
boat 7     0.028674
boat 11    0.023297
boat A     0.019713
boat D     0.017921
boat 1     0.017921
boat 4     0.017921
boat 14    0.017921
boat C     0.012545
boat 10    0.010753
boat 2     0.008961
boat 6     0.007168
boat 16    0.007168
boat 12    0.005376
boat 8     0.005376
dtype: float64

What was the average number of people saved on lifeboats?

In [49]:
n_lifeboat = df.lifeboat.nunique()
# n_survivor = df.survived.sum()
n_survivor = df.lifeboat.value_counts().sum()

n_survivor / n_lifeboat

np.float64(25.8)

Which boats saved in particular people from the 1st (2nd, 3rd) class?

In [60]:
df.pclass.head(20)

0     3rd Class Passengers
1                      NaN
2     3rd Class Passengers
3                      NaN
4                      NaN
5     3rd Class Passengers
6                      NaN
7     1st Class Passengers
8                      NaN
9     3rd Class Passengers
10                     NaN
11                     NaN
12    3rd Class Passengers
13                     NaN
14    3rd Class Passengers
15                     NaN
16    3rd Class Passengers
17                     NaN
18                     NaN
19                     NaN
Name: pclass, dtype: object

In [62]:
df_counts_p1 = df[df.pclass == "1st Class Passengers"].lifeboat.value_counts()
df_counts_p2 = df[df.pclass == "2nd Class Passengers"].lifeboat.value_counts()
df_counts_p3 = df[df.pclass == "3rd Class Passengers"].lifeboat.value_counts()

df_counts_p1.head(1), df_counts_p2.head(1), df_counts_p3.head(1)

(lifeboat
 boat 5    27
 Name: count, dtype: int64,
 lifeboat
 boat 14    18
 Name: count, dtype: int64,
 lifeboat
 boat 13    19
 Name: count, dtype: int64)

In [None]:
# Filter for each class and group by lifeboat
first_class_lifeboats = df[df["pclass"] == "1st Class Passengers"].groupby("lifeboat").size()
second_class_lifeboats = df[df["pclass"] == "2nd Class Passengers"].groupby("lifeboat").size()
third_class_lifeboats = df[df["pclass"] == "3rd Class Passengers"].groupby("lifeboat").size()

# Display the results
first_class_lifeboats, second_class_lifeboats, third_class_lifeboats

What was the average age of rescued people, by lifeboat?

In [63]:
df.groupby('lifeboat').age.mean()

lifeboat
boat 1     34.416667
boat 10    28.433333
boat 11    31.048780
boat 12    28.250000
boat 13    26.311111
boat 14    27.971429
boat 15    30.263158
boat 16    27.800000
boat 2     30.111111
boat 3     36.516129
boat 4     34.852941
boat 5     36.058824
boat 6     34.090909
boat 7     31.040000
boat 8     37.200000
boat 9     31.454545
boat A     32.833333
boat B     26.869565
boat C     29.428571
boat D     33.692308
Name: age, dtype: float64

In [None]:
# Filter for rescued people
rescued_people = df[df["survived"] == True]

# Group by lifeboat and calculate the average age
average_age_by_lifeboat = rescued_people.groupby("lifeboat")["age"].mean()

# Display the result
average_age_by_lifeboat

What was the (1) average age, and the number of people saved by lifeboat? (Note: Calculate this in one single query)

In [71]:
df.groupby('lifeboat').agg(
	age_sum=('age', 'mean'),
	age_count=('lifeboat', 'value_counts')
)

Unnamed: 0_level_0,age_sum,age_count
lifeboat,Unnamed: 1_level_1,Unnamed: 2_level_1
boat 1,34.416667,12
boat 10,28.433333,31
boat 11,31.04878,44
boat 12,28.25,16
boat 13,26.311111,46
boat 14,27.971429,35
boat 15,30.263158,38
boat 16,27.8,10
boat 2,30.111111,18
boat 3,36.516129,31


In [16]:
# Filter for rescued people
rescued_people = df[df["survived"] == True]

# Group by lifeboat and calculate average age and count of people saved
lifeboat_stats = rescued_people.groupby("lifeboat").agg(
    average_age=("age", "mean"),
    people_saved=("lifeboat", "size")
)

# Display the result
lifeboat_stats

Unnamed: 0_level_0,average_age,people_saved
lifeboat,Unnamed: 1_level_1,Unnamed: 2_level_1
boat 1,34.416667,12
boat 10,28.433333,31
boat 11,31.04878,44
boat 12,28.25,16
boat 13,26.311111,46
boat 14,27.971429,35
boat 15,30.263158,38
boat 16,27.8,10
boat 2,30.111111,18
boat 3,36.516129,31


# 2. Investigation of ticket prices
The prices are given in British Pounds. 1 British Pound at that time corresponds to 161 US Dollar today. And 1 US Dollar corresponds currently to 0.90 Euro. Please convert the original fares to current day Euros, and store it in a new column.  

In [72]:
df['current_price'] = df.price * 161 *0.9

In [17]:
# Conversion rates
pound_to_usd = 161
usd_to_euro = 0.90

# Convert the prices to current day Euros
df['price_in_euros'] = df['price'] * pound_to_usd * usd_to_euro
df[['price', 'price_in_euros']].head()

Unnamed: 0,price,price_in_euros
0,8.0,1159.2
1,,
2,16.0,2318.4
3,,
4,,


Are the prices provided in the data the prices paid by person, or the prices paid by ticket (potentially covering multiple people)? Carry out a detailed data analysis to answer this question.

In [None]:
df.ticket_number.nunique()

939

In [18]:
# Group by ticket number and calculate the total price and number of people per ticket
ticket_analysis = df.groupby("ticket_number").agg(
    total_price=("price", "sum"),
    people_per_ticket=("ticket_number", "size")
)

# Display the result
ticket_analysis.head()

Unnamed: 0_level_0,total_price,people_per_ticket
ticket_number,Unnamed: 1_level_1,Unnamed: 2_level_1
2.0,42.0,2
3.0,20.0,2
7.0,0.0,1
8.0,1.0,1
47.0,4.0,2


What are other interesting patterns related to the prices paid? Freely explore!

# 3. Freely Explore