<header>Yarra Prasad</header>

In [None]:
# Importing neccessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

In [None]:
import warnings

In [None]:
def warn(*args,**kwargs):
    pass
warnings.warn=warn

In [None]:
#Loading titanic survival dataset
titanic = pd.read_csv("/kaggle/input/titanic-dataset/Titanic-Dataset.csv")

# Brief description of the data set and a summary of its attributes

The titanic dataset gives you information about multiple people like their ages, sexes, sibling counts, embarkment points and whether or not they survived the disaster. Based on these features, we have to predict if an arbitrary passenger on Titanic would survive the sinking.

**summary of attributes**

In [None]:
# features in the data
columns=titanic.columns.to_list()
print(columns)

In [None]:
titanic.describe().T

In [None]:
titanic['Survived'].value_counts()

 A comparison revealed that more than 60% of the passengers had died.

In [None]:
titanic['Pclass'].value_counts()

This feature renders the passenger division. The tourists could opt from three distinct sections, namely class-1, class-2, class-3. The third class had the highest number of commuters, followed by class-2 and class-1.

In [None]:
titanic['Sex'].value_counts()

Approximately 65% of the tourists were male while the remaining 35% were female. 

<h4>Age</h4>
From the descriptive statistics we can understand that the youngest traveler onboard was aged around five months and the oldest traveler was 80 years. The average age of tourists onboard was just under 30 years.

In [None]:
titanic['SibSp'].value_counts()

SibSp is the number of siblings or spouse of a person onboard. A maximum of 8 siblings and spouses traveled along with one of the traveler. More than 90% of people traveled alone or with one of their sibling or spouse.

In [None]:
titanic['Parch'].value_counts()

Similar to the SibSp, this feature contained the number of parents or children each passenger was touring with. A maximum of 6 parents/children traveled along with one of the traveler.

In [None]:
titanic['Fare_Category'] = pd.cut(titanic['Fare'], bins=[0,7.90,14.45,31.28,120], labels=['Low','Mid','High_Mid','High'])
titanic['Fare_Category'].value_counts()                             

In [None]:
pd.crosstab(titanic['Fare_Category'],titanic['Survived'])

By splitting the fare amount into four categories, it was obvious that there was a strong association between the charge and the survival. The higher a tourist paid, the higher would be his chances to survive.

# Initial plan for data exploration

In [None]:
#Age Distribution by Survival:
sns.histplot(x='Age',hue='Survived',data=titanic,kde=True)
plt.show()

**The histogram, accompanied by kernel density estimates (KDE) and color-coded survival status, offers a comprehensive view of age distribution patterns.**

**Observations:**

* Age Distribution: The histogram illustrates a relatively symmetrical distribution of ages among passengers, with a peak in the young adult range (20-30 years). There is also a noticeable presence of children and elderly passengers.

* Survival Patterns: The color-coded bars and KDEs distinguish between passengers who survived and those who did not. Notable variations in survival rates can be observed across different age groups. Children and some elderly passengers appear to have higher survival rates, while there's a dip in survival for young adults.

In [None]:
# Fare Distribution by Survival:
sns.histplot(x='Fare',hue='Survived',data=titanic,kde=True,bins=30)
plt.show()

**The histogram, along with the overlaid kernel density estimates (KDE) and color-coded survival status, offers a comprehensive view of fare distribution patterns.**

**Observations:**

* Fare Distribution: The majority of passengers paid lower fares, with a peak in the lower fare range. However, there is a noticeable spread of higher fares, indicating the presence of passengers who paid premium prices for their tickets.

* Survival Patterns: The color-coded bars and KDEs distinguish between passengers who survived and those who did not. Higher survival rates are evident among passengers who paid higher fares, suggesting a potential correlation between fare class and survival.

In [None]:
#Survival by Pclass and Sex:
plt.figure(figsize=(10,8))
sns.catplot(x='Pclass',hue='Sex',col='Survived',kind='count',data=titanic)
plt.show()

In [None]:
#Survival by Embarked Location:
sns.countplot(x='Embarked',hue='Survived',data=titanic)
plt.show()

* The count plot reveals variations in survival outcomes among passengers who boarded at different ports. Notably, passengers who embarked at Cherbourg (C) appear to have a higher survival rate compared to those who embarked at Southampton (S) or Queenstown (Q). 

In [None]:
# Boxplot of Age by Pclass:
sns.boxplot(x='Pclass',y='Age',data=titanic)
plt.show()

* The boxplot reveals notable differences in age distributions among the three passenger classes. Passengers in the first class tend to be older on average, with a wider range of ages and potential outliers. In contrast, the second and third classes show relatively younger age distributions. This finding aligns with the expectation that first-class accommodations might have been chosen more frequently by older passengers, while younger individuals might have opted for lower-class accommodations.

# Actions taken for data cleaning and feature engineering

In [None]:
titanic.drop('Fare_Category',axis=1,inplace=True)

In [None]:
titanic.info()

In [None]:
titanic.Embarked.fillna(titanic.Embarked.mode()[0],inplace=True)

In [None]:
titanic.info()

In [None]:
titanic.Cabin.fillna('NA',inplace=True)

In [None]:
titanic.info()

In [None]:
titanic.Name

In [None]:
# extracting titles from Name
titanic['Salutation']=titanic.Name.apply(lambda name:name.split(',')[1].split('.')[0].strip())

In [None]:
# Handling missing values in Age with median
titanic['Age'] = titanic.groupby(['Sex', 'Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))

# If there are any remaining missing values in 'Age', fill them with the overall median
titanic['Age'].fillna(titanic['Age'].median(), inplace=True)

In [None]:
# Creating Age group column for analysis
titanic['Age_group']=pd.cut(titanic['Age'],bins=[0,18,35,50,100],labels=['0-18','19-35','36-50','51+'])

In [None]:
# Creating Family size column for analysis
titanic['Family_Size']=titanic['SibSp']+titanic['Parch']+1

In [None]:
titanic.Family_Size.value_counts()

In [None]:
# creating Fare range column for analysis
titanic['Fare_range']=pd.qcut(titanic['Fare'],q=4,labels=['Low','Medium','High','Veryhigh'])

In [None]:
titanic['Fare_range']

In [None]:
titanic.head()

In [None]:
print(titanic['Salutation'].unique().tolist())

In [None]:
# Mapping dictionary for title combinations
title_mapping = {
    'Mme': 'Mrs',
    'Ms': 'Mrs',
    'Mlle': 'Miss',
    'Dr': 'Officer',
    'Rev': 'Officer',
    'Col': 'Officer',
    'Major': 'Officer',
    'Capt': 'Officer',
    'Don': 'Noble',
    'Sir': 'Noble',
    'Lady': 'Noble',
    'the Countess': 'Noble',
    'Jonkheer': 'Noble'
}

In [None]:
titanic['Salutation']=titanic['Salutation'].map(title_mapping).fillna(titanic['Salutation'])

In [None]:
# Creating Cabin and Nocabin information for analysis
titanic['HasCabin']=titanic['Cabin'].apply(lambda x: 0 if x=='NA' else 1)

In [None]:
columns=titanic.columns.to_list()
print(columns)

In [None]:
titanic.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)

# Key Findings and Insights

In [None]:
#Survival Rate by Age Groups:
sns.barplot(x='Age_group',y='Survived',data=titanic)
plt.show()

* Investigating the relationship between age groups and survival rates on the Titanic, it becomes evident that passengers falling within the 0-18 age group exhibited a notably higher survival rate compared to other age brackets. 

In [None]:
#Survival Rate by Embarked Location:
sns.barplot(x='Embarked',y='Survived',data=titanic)
plt.show()

* The bar plot illustrates variations in survival rates based on the embarkation point. Notably, passengers who boarded at Cherbourg (C) appear to have a higher survival rate compared to those embarking at Southampton (S) and Queenstown (Q). While this observation suggests a potential correlation between the port of embarkation and survival outcomes, further statistical analysis is warranted to establish the significance of these differences. Factors such as socio-economic status or cabin locations associated with specific embarkation points could be influencing these variations.

In [None]:
#Survival Rate by Sibling/Spouse (SibSp) and Parent/Child (Parch) Counts:
sns.barplot(x='Family_Size',y='Survived',data=titanic)
plt.show()

* The bar plot illustrates that passengers with a family size of 4 tend to have a notably higher survival rate compared to other family sizes. This finding suggests that individuals traveling with a family size of 4 might have had advantageous dynamics during the evacuation process, potentially facilitating a higher likelihood of survival.

In [None]:
# Survival Rate by Fare Range:
sns.barplot(x="Fare_range",y="Survived",data=titanic)
plt.show()

* The bar plot illustrates a trend where passengers who paid higher fares experienced higher survival rates, particularly in the "High" and "Very High" fare ranges. This finding suggests a potential correlation between the fare paid and the likelihood of survival, reflecting a scenario where passengers with higher-priced tickets might have had access to better accommodations or prioritized evacuation procedures.

In [None]:
#Survival Rate by Title (extracted from Name):
sns.barplot(x='Salutation',y='Survived',data=titanic)
plt.show()

* The bar plot illustrates variations in survival rates based on the title or salutation associated with each passenger. Notably, individuals with titles such as 'Mrs' and 'Miss' tend to exhibit higher survival rates compared to titles like 'Mr' or other titles. This observation suggests that societal norms or perhaps certain characteristics associated with specific titles might have influenced survival outcomes. 

In [None]:
# Survival Rate by Ticket Class and Fare:
sns.scatterplot(x='Fare',y='Pclass',hue='Survived',data=titanic)
plt.show()

* The scatter plot reveals interesting insights into the distribution of fares across different passenger classes and their corresponding survival outcomes. Passengers in lower classes (higher Pclass values) generally paid lower fares, and unfortunately, a considerable number did not survive. On the other hand, passengers in higher classes (lower Pclass values) often paid higher fares and displayed a higher survival rate. This observation aligns with the established notion that higher-class passengers had access to better amenities and potentially received preferential treatment during evacuation. The hue encoding of survival status allows for a clear differentiation between survivors and non-survivors within each class-fare combination.

In [None]:
# Survival Rate by Cabin vs. No Cabin:
sns.barplot(x='HasCabin',y='Survived',data=titanic)
plt.show()

* The bar plot illustrates that passengers who had a recorded cabin have a higher survival rate compared to those without a recorded cabin. This observation suggests that having a cabin might have correlated with a higher chance of survival. Passengers with recorded cabin information may have been located in specific areas of the ship or had certain privileges that contributed to their increased likelihood of survival.