Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menu bar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menu bar, select Cell$\rightarrow$Run All).

Make sure that in addition to the code, you provide written answers for all questions of the assignment. 

Below, please fill in your name and collaborators:

In [1]:
NAME = "CLAUDE LEWIS NDI MBUA"
COLLABORATORS = ""

## Assignment 2 - Data Analysis using Pandas
**(15 points total)**

For this assignment, we will analyze the open dataset with data on the passengers aboard the Titanic.

The data file for this assignment can be downloaded from Kaggle website: https://www.kaggle.com/c/titanic/data, file `train.csv`. It is also attached to the assignment page. The definition of all variables can be found on the same Kaggle page, in the Data Dictionary section.

Read the data from the file into pandas DataFrame. Analyze, clean and transform the data to answer the following question: 

**What categories of passengers were most likely to survive the Titanic disaster?**

**Question 1.**  _(4 points)_
* The answer to the main question - What categories of passengers were most likely to survive the Titanic disaster? _(2 points)_
* The detailed explanation of the logic of the analysis _(2 points)_

**Question 2.**  _(3 points)_
* What other attributes did you use for the analysis? Explain how you used them and why you decided to use them. 
* Provide a complete list of all attributes used.

**Question 3.**  _(3 points)_
* Did you engineer any attributes (created new attributes)? If yes, explain the rationale and how the new attributes were used in the analysis?
* If you have excluded any attributes from the analysis, provide an explanation why you believe they can be excluded.

**Question 4.**  _(5 points)_
* How did you treat missing values for those attributes that you included in the analysis (for example, `age` attribute)? Provide a detailed explanation in the comments.


In [2]:
import pandas as pd

# Load the data into a DataFrame
titanic_df = pd.read_csv('train.csv')

# Preview the first few rows of the DataFrame
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
#QUESTION 1

#To answer the question of what categories of passengers were most likely to survive the Titanic disaster,
#we can start by looking at some basic statistics about the data. We can use the describe() method to get a
#summary of the numerical columns.

# First calculated the median age of passengers using the median() method:
median_age = titanic_df['Age'].median()

# We then filled in the missing values using the fillna() method:

titanic_df['Age'] = titanic_df['Age'].fillna(median_age)

# We can use the describe() method to get a summary of the numerical columns:

titanic_df.describe()

#This will give us information such as the count, mean, and standard deviation of each numerical column, 
#as well as the minimum and maximum values.

#Next, let's look at the survival rate overall and by different categories. We can use the value_counts() method to
#count the number of survivors and non-survivors:

# Count the number of survivors and non-survivors
titanic_df['Survived'].value_counts()

#This will give us the number of passengers who survived (1) and who did not survive (0).

#To calculate the survival rate, we can divide the number of survivors by the total number of passengers:

# Calculate the overall survival rate

survival_rate = titanic_df['Survived'].sum() / len(titanic_df)
print(f"Overall survival rate: {survival_rate:.2%}")

#This will give us the overall survival rate as a percentage.

#To analyze the survival rate by different categories, we can use the groupby() method to group the data by a specific
#column, and then calculate the survival rate within each group:


# Calculate the survival rate by gender

gender_survival_rate = titanic_df.groupby('Sex')['Survived'].mean()
print(f"Survival rate by gender:\n{gender_survival_rate}")

# Calculate the survival rate by passenger class

class_survival_rate = titanic_df.groupby('Pclass')['Survived'].mean()
print(f"\nSurvival rate by passenger class:\n{class_survival_rate}")

# Calculate the survival rate by age group

bins = [0, 18, 30, 50, 100]
labels = ['child', 'young adult', 'adult', 'senior']
age_groups = pd.cut(titanic_df['Age'], bins=bins, labels=labels)
age_group_survival_rate = titanic_df.groupby(age_groups)['Survived'].mean()
print(f"\nSurvival rate by age group:\n{age_group_survival_rate}")


# Calculate the survival rate by port of embarkation
embarked_survival_rate = titanic_df.groupby('Embarked')['Survived'].mean()
print(f"Survival rate by port of embarkation:\n{embarked_survival_rate}")

# Calculate the survival rate by number of siblings/spouses
sibsp_survival_rate = titanic_df.groupby('SibSp')['Survived'].mean()
print(f"\nSurvival rate by number of siblings/spouses:\n{sibsp_survival_rate}")

# Calculate the survival rate by number of parents/children
parch_survival_rate = titanic_df.groupby('Parch')['Survived'].mean()
print(f"\nSurvival rate by number of parents/children:\n{parch_survival_rate}")

# Calculate the survival rate by fare
bins = [0, 10, 20, 30, 1000]
labels = ['low', 'medium', 'high', 'very high']
fare_groups = pd.cut(titanic_df['Fare'], bins=bins, labels=labels)
fare_group_survival_rate = titanic_df.groupby(fare_groups)['Survived'].mean()
print(f"\nSurvival rate by fare group:\n{fare_group_survival_rate}")


Overall survival rate: 38.38%
Survival rate by gender:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

Survival rate by passenger class:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

Survival rate by age group:
Age
child          0.503597
young adult    0.331096
adult          0.423237
senior         0.343750
Name: Survived, dtype: float64
Survival rate by port of embarkation:
Embarked
C    0.553571
Q    0.389610
S    0.336957
Name: Survived, dtype: float64

Survival rate by number of siblings/spouses:
SibSp
0    0.345395
1    0.535885
2    0.464286
3    0.250000
4    0.166667
5    0.000000
8    0.000000
Name: Survived, dtype: float64

Survival rate by number of parents/children:
Parch
0    0.343658
1    0.550847
2    0.500000
3    0.600000
4    0.000000
5    0.200000
6    0.000000
Name: Survived, dtype: float64

Survival rate by fare group:
Fare
low          0.205607
medium       0.424581
high         0.443662
very high    0

In [4]:
#QUESTION 2


#In addition to gender, passenger class, and age group, there are several other attributes in the dataset that 
# we used for our analysis. These include:

# Embarked: the port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
#SibSp: the number of siblings/spouses aboard the Titanic
#Parch: the number of parents/children aboard the Titanic
#Fare: the fare paid for the ticket

# Calculate the survival rate by port of embarkation
embarked_survival_rate = titanic_df.groupby('Embarked')['Survived'].mean()
print(f"Survival rate by port of embarkation:\n{embarked_survival_rate}")

# Calculate the survival rate by number of siblings/spouses
sibsp_survival_rate = titanic_df.groupby('SibSp')['Survived'].mean()
print(f"\nSurvival rate by number of siblings/spouses:\n{sibsp_survival_rate}")

# Calculate the survival rate by number of parents/children
parch_survival_rate = titanic_df.groupby('Parch')['Survived'].mean()
print(f"\nSurvival rate by number of parents/children:\n{parch_survival_rate}")

# Calculate the survival rate by fare
bins = [0, 10, 20, 30, 1000]
labels = ['low', 'medium', 'high', 'very high']
fare_groups = pd.cut(titanic_df['Fare'], bins=bins, labels=labels)
fare_group_survival_rate = titanic_df.groupby(fare_groups)['Survived'].mean()
print(f"\nSurvival rate by fare group:\n{fare_group_survival_rate}")


# We can use these attributes to explore other factors that may have influenced the survival rate. 
# For example, we might hypothesize that passengers who paid a higher fare were more likely to survive, 
# or that passengers who were traveling with family members were more likely to survive.

# For example, passengers who paid a higher fare may have been more likely to be in first class or have
# a better chance of surviving due to their location on the ship. Similarly, passengers who were traveling with 
# family members may have had more help in escaping the sinking ship.

# The complete list of attributes used for the analysis is:
#Gender
#Passenger class
#Age group
#Port of embarkation
#Number of siblings/spouses
#Number of parents/children
#Fare

Survival rate by port of embarkation:
Embarked
C    0.553571
Q    0.389610
S    0.336957
Name: Survived, dtype: float64

Survival rate by number of siblings/spouses:
SibSp
0    0.345395
1    0.535885
2    0.464286
3    0.250000
4    0.166667
5    0.000000
8    0.000000
Name: Survived, dtype: float64

Survival rate by number of parents/children:
Parch
0    0.343658
1    0.550847
2    0.500000
3    0.600000
4    0.000000
5    0.200000
6    0.000000
Name: Survived, dtype: float64

Survival rate by fare group:
Fare
low          0.205607
medium       0.424581
high         0.443662
very high    0.581197
Name: Survived, dtype: float64


In [5]:
# QUESTION 3
# Yes, we created a new attribute called "Age group" by grouping passengers into different age categories. 
# We did this because age can be an important factor in survival rates. For example, children and elderly passengers may
# have had a harder time escaping the sinking ship than young adults. We used this new attribute in our analysis to explore 
# the relationship between age and survival rates.

# We also used the "Fare" attribute to create a new attribute called "Fare group". We grouped passengers into different
# fare categories to explore the relationship between fare paid and survival rates.

# We excluded some attributes from the analysis, such as the passenger's name, ticket number, and cabin number. 
# We did this because these attributes are unlikely to have a direct impact on the passenger's survival rate. 
# Additionally, we excluded the "Ticket" attribute because it contains a mix of alphanumeric characters and may not be
# useful in our analysis.

# We also excluded the "Cabin" attribute because it has a large number of missing values.
# While the cabin number could potentially provide information on the passenger's location on the ship and their 
# proximity to lifeboats, the missing data make it difficult to draw any conclusions from this attribute.

In [6]:
# QUESTION 4

# We can fill in missing values with an estimated value based on the available data. 
#Dropping values will lead to important data loss
# For example, we could use the mean or median age of passengers to fill in missing values.

# In this case, we decided to impute missing values using the median age of passengers. 
# To do this, we first calculated the median age of passengers using the median() method:

median_age = titanic_df['Age'].median()

# We then filled in the missing values using the fillna() method:

titanic_df['Age'] = titanic_df['Age'].fillna(median_age)

# It's worth noting that imputing missing values in this way can introduce some bias into our analysis, 
# as the imputed values may not accurately reflect the true values of the missing data. However, 
# in this case, using the median age to impute missing values is a reasonable approach given that age is 
# unlikely to vary significantly within different groups of passengers.
# The age distribution of the data set is such that mean and median are close enough to provide the same results