## Part I: Exploratory analysis (3 points)

Read the dataset correctly using pandas. The dataset is in the file `train.csv`.

After reading it --[or when reading it](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)-- **choose a proper column to be used as the index of the dataframe.**

The dataset contains the following columns:

| Variable |                 Definition                 |                       Key                      |
|:--------:|:------------------------------------------:|:----------------------------------------------:|
| PassengerId | Passenger ID | |
| Survived | Survival                                   | 0 = No, 1 = Yes                                |
| Pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Name     | Name of Passenger                          |                                                |
| Sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| SibSp    | # of siblings / spouses aboard the Titanic |                                                |
| Parch    | # of parents / children aboard the Titanic |                                                |
| Ticket   | Ticket number                              |                                                |
| Fare     | Passenger fare                             |                                                |
| Cabin    | Cabin number                               |                                                |
| Embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

In [1]:
import pandas as pd

In [5]:
# using `PassengerId` as the index column when reading
titanic = pd.read_csv("../../../../CSV_FILES/train.csv", index_col="PassengerId")

# using `PassengerId` as the index column after reading
titanic = pd.read_csv("../../../../CSV_FILES/train.csv")
titanic.set_index("PassengerId", inplace=True)

In [6]:
pd.set_option('display.max_columns', None)

titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Compute the percentage of survivors out of total number of passengers...

In [22]:
titanic['Survived'].mean() * 100

38.38383838383838

...and the percentage of men and women out of total number of passengers (0.5 points)

In [29]:
percentage_men = titanic[titanic['Sex'] == 'male'].size / titanic['Sex'].size
percentage_women = titanic[titanic['Sex'] == 'female'].size / titanic['Sex'].size

print(percentage_men, percentage_women)

7.1234567901234565 3.876543209876543


Compute the percentage of survivors by sex (i.e. the percentage of male passengers that survived and female passengers that survived)...

In [26]:
print(titanic[['Survived', 'Sex']].value_counts(normalize = True))

Survived  Sex   
0         male      0.525253
1         female    0.261504
          male      0.122334
0         female    0.090909
Name: proportion, dtype: float64


...and the sex distribution of survivors (i.e. percentage of survivors that were men and percentage of survivors that were women) (0.5 points)

In [27]:
titanic['Sex'].value_counts(normalize = True)

Sex
male      0.647587
female    0.352413
Name: proportion, dtype: float64

Display in a 2 x 2 DataFrame the probability of being male/female and surviving/not surviving (0.5 points)

One possible option:

|            | **Survived**      | **Not survived**      |
|------------|-------------------|-----------------------|
| **Male**   | Male & Survived   | Male & Not survived   |
| **Female** | Female & Survived | Female & Not survived |

Notice that the sum of all values in the table above should be 1 (or 100 %).

In [72]:
import pandas as pd

# Assuming you have the titanic DataFrame
# Create a pivot table with counts
pivot_table = titanic.pivot_table(index='Sex', columns='Survived', aggfunc='size')

# Rename the columns to match your table's layout
pivot_table.columns = ['Not Survived', 'Survived']

# Normalize the counts to get probabilities
pivot_table_normalized = pivot_table.div(pivot_table.sum(axis=1), axis=0) * 100

# Create a new DataFrame for the final table layout
final_table = pd.DataFrame(index=['Male', 'Female'], columns=['Survived', 'Not Survived'])

# Populate the new DataFrame with the calculated probabilities
final_table.loc['Male', 'Survived'] = f"Male & Survived: {pivot_table_normalized.loc['male', 'Survived']:.2f}%"
final_table.loc['Male', 'Not Survived'] = f"Male & Not survived: {pivot_table_normalized.loc['male', 'Not Survived']:.2f}%"
final_table.loc['Female', 'Survived'] = f"Female & Survived: {pivot_table_normalized.loc['female', 'Survived']:.2f}%"
final_table.loc['Female', 'Not Survived'] = f"Female & Not survived: {pivot_table_normalized.loc['female', 'Not Survived']:.2f}%"

# Display the final table
print(final_table)

                         Survived                   Not Survived
Male      Male & Survived: 18.89%    Male & Not survived: 81.11%
Female  Female & Survived: 74.20%  Female & Not survived: 25.80%


Display in a DataFrame the probability of survival/not survival of all combinations of sex and class (0.5 points)

One possible option:

|            |   | **Survived**              | **Not survived** |
|------------|---|---------------------------|------------------|
| **Male**   | 1 | Male & Survived & Class 1 | ...              |
|            | 2 | Male & Survived & Class 2 | ...              |
|            | 3 | Male & Survived & Class 3 | ...              |
| **Female** | 1 | ...                       | ...              |
|            | 2 | ...                       | ...              |
|            | 3 | ...                       | ...              |

Notice that the sum of all values in the table above should be 1 (or 100 %).

In [71]:
grouped = titanic.groupby(['Sex', 'Pclass', 'Survived']).size().unstack()

grouped_normalized = grouped.div(grouped.sum(axis = 1), axis = 0) * 100 # axis = 1 means aggregating across column, axis = 0 means row

grouped_normalized.columns = ['Not Survived', 'Survived']

grouped_normalized

Unnamed: 0_level_0,Unnamed: 1_level_0,Not Survived,Survived
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,3.191489,96.808511
female,2,7.894737,92.105263
female,3,50.0,50.0
male,1,63.114754,36.885246
male,2,84.259259,15.740741
male,3,86.455331,13.544669


## Part II: More insights on the data (3 points)

Present 3 insights about the dataset, each of them relating at least 3 different variables, and support them by code and numbers. Possible examples:

- "**Men** aged **less than 18** were more/less likely to **survive** than the average passenger" (Sex, Age, Survival)
- "**Women** with **no siblings or spouses** paid on average a cheaper/pricier **ticket** than the average woman" (Sex, SibSp, Fare)
- "**Men** with a **title other than Mr.** were more/less likely to have a known (i.e. non-missing) **cabin** than the average man" (Sex, Name, Cabin)

(Using these exact examples is valid, but awards fewer points than proposing original insights)

_Hint: If you want to work with lists and dictionaries rather than pandas objects, you can do_

```python
recs = df.to_dict(orient="records")
```

In [74]:
titanic.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [None]:
import numpy as np

df = titanic.copy()

df.drop(columns = ['Name', 'Ticket'], inplace = True)

df['Sex'] = np.where(df['Sex'] == 'male', 1, 0)

# df.corr()

In [95]:
titanic.groupby('Cabin')['Survived'].sum().idxmax() #Cabin had the most survivors

'B96 B98'

In [7]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
print(titanic[titanic['Survived'] == 0].groupby('Sex')['SibSp'].sum().sum()) # The number of people that lost a relative
print(titanic[titanic['Survived'] == 0].groupby('Sex')['Parch'].sum().sum()) # The number of children that lost a parent

304
181


In [18]:
titanic['Parch'].describe()

count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        6.000000
Name: Parch, dtype: float64

In [23]:
grouped_normalized

Unnamed: 0_level_0,Not Survived,Survived
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,44.642857,55.357143
Q,61.038961,38.961039
S,66.304348,33.695652


In [27]:
grouped = titanic.groupby(['Embarked', 'Survived']).size().unstack()

grouped_normalized = grouped.div(grouped.sum(axis = 1), axis = 0) * 100 # axis = 1 means aggregating across column, axis = 0 means row

grouped_normalized.columns = ['Not Survived', 'Survived']

print('People with the highest chance of survival embarkede from port:') 
print(grouped_normalized['Survived'].idxmax())
print(f'People with the highest probability of not surviving embarkede from port:') 
print(grouped_normalized['Not Survived'].idxmax())

# You had the highest chance of survival if you embarked from S

People with the highest chance of survival embarkede from port:
C
People with the highest probability of not surviving embarkede from port:
S
