# __Survival Analysis in Cruise StoneMitch Aug 2024__
***

__Hypothesis__: To understand if survival rate is different for each passenger class. Secondly, does age play a factor in survival rate?  <br>
__Approach__: Using cruise passenger records to analyse the correlation of factors like passenger class, customers profile on the survival rate. <br> 

Analysis required - 
1) #survival rate vs passenger class, filter by age (Bar chart)
2) %survival rate vs passenger class, filter by age (Bar chart)
3) %survival rate vs age (Bar chart)
<br>

__Tools__: Pre-processing (ie. extract transform load), exploratory data analysis in __Python__. Data visualization in __Tableau__. 

##### 1. Download the file from the internet

In [6]:
import gdown


# Download the file from google drive
file_id = '15cFsnPnHc7KlzV0C9QQ5wG5v8PWqJ65C'
download_url = f'https://drive.google.com/uc?id={file_id}'
output_file = 'downloaded_file.csv'
gdown.download(download_url, output_file, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=15cFsnPnHc7KlzV0C9QQ5wG5v8PWqJ65C
To: C:\Digipen\Titanic Analysis\downloaded_file.csv
100%|█████████████████████████████████████████████████████████████████████████████| 61.2k/61.2k [00:00<00:00, 7.69MB/s]


'downloaded_file.csv'

In [92]:
#load csv into pandas dataframe
cruise_data = pd.read_csv('downloaded_file.csv')

In [93]:
cruise_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


From the records: <br>
Total passengers on board 891.

PassengerId is the index and starts from 1. <br>
Survived is a boolean where 0 is dead and 1 is alive. <br>
Pclass is an integer between 1 and 3, where 1 refers to 1st class, 2 2nd class, 3 3rd class. <br>
Name is the name of passenger on board.<br>
Sex is the gender, either male or female<br>
Age is a float with 1 decimal point, some records have NaN value. Smallest value is 0.42 which should be corrected. <br>
SibSp refers to # of siblings / spouses aboard  <br> 
Parch refers to # of parents / children aboard the Titanic	 <br>
Ticket refers to the ticket booking id.<br>
Fare refers to the cost of ticket of each passenger, some customers enjoyed a 0 fare while the most expensive fare was 512usd. <br>
Cabin refers to the cabin number passenger is allocated to. most cabin no. were not recorded.<br>
Embarked refers to Port of Embarkation.<br>

In [94]:
cruise_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [95]:
cruise_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [96]:
df = cruise_data[["Survived", "Pclass", "Age", "Sex"]]

In [97]:
#display no. of missing values in each col
print(df.isnull().sum()) #Observed many missing figures in age column, a way to handle missing values would be to impute with mean age.

Survived      0
Pclass        0
Age         177
Sex           0
dtype: int64


In [98]:
# Calculate the mean of the Age column
age_mean = df["Age"].mean()

# Fill missing values with mean age
df["Age"].fillna(age_mean, inplace=True)

#to verify that there are no more missing values
print(df.isnull().sum())

Survived    0
Pclass      0
Age         0
Sex         0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(age_mean, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Age"].fillna(age_mean, inplace=True)


In [99]:
#next step is to round the age so that they are whole numbers instead of 0.42 years
df[["Age"]] = df[["Age"]].round()

#to verify that the age are now rounded to nearest whole number. 
df[["Age"]].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[["Age"]] = df[["Age"]].round()


Unnamed: 0,Age
count,891.0
mean,29.754209
std,13.000828
min,0.0
25%,22.0
50%,30.0
75%,35.0
max,80.0


In [100]:
#previewing that the age column is correct
df[["Age"]].head(30)

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
5,30.0
6,54.0
7,2.0
8,27.0
9,14.0


In [101]:
#checking dtype of Age column before grouping them into categories
df[["Age"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     891 non-null    float64
dtypes: float64(1)
memory usage: 7.1 KB


In [102]:
#grouping the passenger's age into various categories - child, adult, senior. A person under 13 is considered a Child as their body may still be weak 
#and they lack the knowledge and strength to swim in the open waters. Similarly, for a person above 50, they may not have the endurance and strength to
#keep themselves afloat for a long time thus I grouped them as Senior. 

df['Age_cat'] = df['Age'].case_when([
    (df['Age'] < 13, 'Child'),
    (df['Age'] < 50, 'Adult'),
    (df['Age'] >= 50, 'Senior')
])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age_cat'] = df['Age'].case_when([


In [103]:
#EDA on breakdown of number of people in each age category 
print(df['Age_cat'].value_counts())

Age_cat
Adult     748
Senior     74
Child      69
Name: count, dtype: int64


In [107]:
#save output into csv file
df.to_excel("cleaned_titanic_data.xlsx", index=False)

In [114]:
#visualization will be performed in tableau and saved here 
from IPython.display import HTML

HTML(""" <div class='tableauPlaceholder' id='viz1723552080781' style='position: relative'><noscript><a href='#'><img alt='Dashboard 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sh&#47;Ship_mishap&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Ship_mishap&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sh&#47;Ship_mishap&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1723552080781');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1024px';vizElement.style.height='795px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1024px';vizElement.style.height='795px';} else { vizElement.style.width='100%';vizElement.style.height='1827px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
""")

**Findings**: <br>
1) Pclass 1 highest survival rate amongst all 3 pclass. This could be because their cabins are at close proximity to lifejackets and evacuation boats. 
   Recommendation: Increase price of pclass 1 packages and market it as a safer holiday experience.  
2) Pclass 3 has the lowest survival rate, at only 1/3 of pclass1 even though they have roughly the same number of survivals. There could be insufficient lifejackets for all or some people may have taken more than 1 lifejacket resulting in shortages. Crew assistance to pclass 3 may also be inadequate resulting in pclass 3 passengers not knowing where to go for evacuation.
   Recommendation: Re-evaluate the safety protocols such as lifejackets, crew assistance, evacuation route.  When it comes to safety concerning passenger's life, need to ensure safety resources are made available to all. Failure to ensure this will result in poor branding and loss of reputation. 
3) Child has the highest survival rate across all 3 pclass.
   Recommendation: Market this as a family-friendly holiday experience. Emphasize the kids-friendly precautions and safety features that we strictly enforce in our policies to ensure that it is safe even for kids. 
