## Data analysis on the Titanic Dataset


<center><img src="https://i.insider.com/5d94e7b0e28ccd0da3290bba?width=1000&format=jpeg&auto=webp" width="700" height="240" style="display=block; margin:auto"/></center>
<p style="text-align: center">
    <b>Sinking of the RMS Titanic</b></br>
    <b>"Untergang der Titanic" by Willy Stöwer, 1912</b>
</p>

One of the most well-known shipwrecks in history was the RMS Titanic. Out of 2224 passengers and crew, 1502 died when the Titanic sank on April 15, 1912, during her maiden voyage after striking an iceberg. The international society was stunned by this shocking catastrophe, which prompted improved ship safety rules. The lack of lifeboats for the passengers and crew was one of the factors that contributed to the shipwreck's high death toll. Some groups of people had a higher chance of surviving the sinking than others, such as women, children, and the upper class, even though there was some element of luck involved. In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. In particular answer the following question 

**Questions to answer include:**
- Who were the passengers on the Titanic?
- What deck were the passengers on?
- Where did the passengers come from?
- Who was alone and who was with family?
- What factors helped someone survive the sinking?
- Did having a family member increase the odds of surviving the crash?
- Did the deck have an effect on the passengers survival rate?
- What gender had a more survival rate if they had family members and in a particular deck?

In [None]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

##visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Reading the datasets 

In [None]:
titanic_df = pd.read_csv('../input/d/johnolutoki/titanic/titanic_data.csv')

In [None]:
titanic_df.head() ##print the first 5 rows of the dataframe

In [None]:
titanic_df.describe()

The `.describe()` technique offers some fundamental statistical information, and we may also draw some valuable conclusions from it. Moving through the columns:

- The Survived column's mean value is 0.38, which indicates that, on average, more people perished than survived.

- The passengers ranged in age from an infant to an elderly person who was 80 years old, with an average age of 29.

- The Age column's count does not match the counts for the other columns, which all show null values that we should be aware of.

In [None]:
titanic_df.info()

### The information provided by Kaggle and some details that will help give us more clarity

- Passenger ID — unique ID that identifies each passenger
- Survived — 0 means never survive, 1 means survive
- Pclass — Passenger class (1,2 or 3)
- SibSp — Number of siblings/spouse onboard the Titanic
- Parch — Number of parents/children onboard the Titanic
- Ticket — Ticket number
- Cabin — Cabin number, where the passengers stay
- Embarked — The location where the passengers board the Titanic ( C=Cherbourg, Q= Queenstown, S=Southampton)

## Visualization of the dataset to understand thier Distributions

In [None]:
# specifies the parameters of our graphs
fig = plt.figure(figsize=(30,10), dpi=1600) 
alpha=alpha_scatterplot = 0.2 
alpha_bar_chart = 0.55

# lets us plot many diffrent shaped graphs together 
ax1 = plt.subplot2grid((2,3),(0,0))
# plots a bar graph of those who surived vs those who did not.               
titanic_df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
# this nicely sets the margins in matplotlib to deal with a recent bug 1.3.1
ax1.set_xlim(-1, 2)
# puts a title on our graph
plt.title("Distribution of Survival, (1 = Survived)")    

plt.subplot2grid((2,3),(0,1))
plt.scatter(titanic_df.Survived, titanic_df.Age, alpha=alpha_scatterplot)
# sets the y axis lable
plt.ylabel("Age")
# formats the grid line style of our graphs                          
plt.grid(b=True, which='major', axis='y')  
plt.title("Survival by Age,  (1 = Survived)")

ax3 = plt.subplot2grid((2,3),(0,2))
titanic_df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart)
ax3.set_ylim(-1, len(titanic_df.Pclass.value_counts()))
plt.title("Class Distribution")

plt.subplot2grid((2,3),(1,0), colspan=2)
# plots a kernel density estimate of the subset of the 1st class passangers's age
titanic_df.Age[titanic_df.Pclass == 1].plot(kind='kde')    
titanic_df.Age[titanic_df.Pclass == 2].plot(kind='kde')
titanic_df.Age[titanic_df.Pclass == 3].plot(kind='kde')
 # plots an axis lable
plt.xlabel("Age")    
plt.title("Age Distribution within classes")
# sets our legend for our graph.
plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best')

ax5 = plt.subplot2grid((2,3),(1,2))
titanic_df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
ax5.set_xlim(-1, len(titanic_df.Embarked.value_counts()))
# specifies the parameters of our graphs
plt.title("Passengers per boarding location")
plt.show()

## 1. Who were the passengers onboard the Titanic?

In [None]:
##print a catplot to shows the values of the Sex (male or female) column.
sns.catplot(x='Sex',data=titanic_df, kind='count')
plt.title("Sex of the passenger on board")
plt.show()

Dividing the passengers into their respective classes and by gender and discovered that class 3 has more than 50% male passengers.

In [None]:
##define a function to show not just sex but male/female and children
def male_female_child(passenger):
    age,sex=passenger
    
    if age<16:
        return 'child'
    else:
        return sex

In [None]:
##apply the function and create a new column
titanic_df['Person'] = titanic_df[['Age','Sex']].apply(male_female_child, axis=1)

In [None]:
titanic_df['Person'].value_counts()

In [None]:
sns.catplot(x='Pclass',data=titanic_df,hue='Person',kind='count')
plt.title("Passengers in their respective classes and by gender")
plt.show()

There are more a lot more male passengers as compared to females onboard the Titanic.

In [None]:
##print a catplot to show the sex distrbution into the difference classes.
sns.catplot(x='Sex',data=titanic_df,kind ='count',hue='Pclass')
plt.title('Gender distrbution in difference classes')
plt.show()

In [None]:
sns.catplot(x='Pclass',data=titanic_df,kind='count',hue='Sex')
plt.title('Sex distrbution into the difference classes')
plt.show()

In [None]:
## plot a kdeplot for person(male,female,child)
fig = sns.FacetGrid(titanic_df, hue='Person', aspect=4)
fig.map(sns.kdeplot, 'Age',shade=True)

oldest = titanic_df['Age'].max()
fig.set(xlim =(0,oldest))

fig.add_legend()
plt.title('Probability density function of the Sex and thier Gender')
plt.show()

In [None]:
## plot a kdeplot for Pclass
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=4)
fig.map(sns.kdeplot, 'Age',shade=True)

oldest = titanic_df['Age'].max()
fig.set(xlim =(0,oldest))

plt.title('Probability density function of the class and thier Gender')
fig.add_legend()
plt.show()

KDE plots display relative distribution, which is the percentage of the dataset that adds up to 1 or 100%, and they are useful for visualising distribution among huge datasets.

## 2. What deck were the passengers on?

In [None]:
titanic_df.head()

The `Cabin` column has some missing data so before we continue it'd be good to drop the missing data before counting the numbers of passengers on each deck.

In [None]:
deck = titanic_df['Cabin'].dropna()
deck.head()

In [None]:
levels = []

#to get the first alphabet of the cabin they are in
for level in deck:
    levels.append(level[0])

#print(levels)

cabin_df = pd.DataFrame(levels)
cabin_df.columns= ['Cabin']


#plot the categorical plot base on the different type of cabin they are in
sns.catplot(x='Cabin',data=cabin_df.sort_values('Cabin'),palette='winter',kind='count')
plt.title('Distribution of city passengers are from')
plt.show()

The initial letter of the cabin identifies the deck that it is on, and we are interested in learning more about how the decks are distributed.

In [None]:
cabin_df.value_counts()

In [None]:
titanic_df.head()

## 3. Where did the passengers come from?

In [None]:
#embarked is which city they embarked on. can refer to kaggle for more details
# s = Southampton, c = Cherboug, q = Queenstown

#plot to find how which passengers are in which class
sns.catplot(x='Embarked', data=titanic_df, hue='Pclass', kind='count')
plt.title('Distribution of city passengers are from and the classes they came from')
plt.show()

Plot to determine the gender of each city's passengers (male, female, or person)

In [None]:
sns.catplot(x='Embarked', data=titanic_df, hue='Person', kind='count')
plt.title('Distribution of city passengers are from and their Gender')
plt.show()

## 4. Who was alone and who was with family?

In [None]:
titanic_df.head()

We are interested in learning which travellers travelled alone and who accompanied them. Since there is no column in the dataset that indicates such, we can utilise the columns "SibSp" and "Parch" to aid in our analysis.

To further clarify, if the "SibSp" or "Parch" column has a value greater than 0, it indicates that the traveller was accompanied by a sibling, spouse, parent, or child. This indicates that the traveller was accompanied by a relative.

- SibSp — Number of siblings/spouse onboard the Titanic
- Parch — Number of parents/children onboard the Titanic

Similar to the previous example, if the both column has a value of 0, it indicates that the passenger did not bring any family members.

Thus, we can create a new column based on the values of 0 or more from the previous two columns.

In [None]:
#make a column to define passengers who are alone and so any value more than zero means they had a family (sibiling,parents,cjild)
titanic_df['Alone'] = titanic_df.SibSp + titanic_df.Parch
titanic_df['Alone']

In [None]:
#now rename the 0 to alone and more than 0 to be with a family
titanic_df['Alone'].loc[titanic_df.Alone>0] = 'With Family'
titanic_df['Alone'].loc[titanic_df.Alone==0] = 'Alone'

In [None]:
sns.catplot(x='Alone',data=titanic_df,kind='count',palette='Blues')
plt.title('Distribution of passengers alone and those with thier Family')
plt.show()

The data was then plotted into a catplot, allowing us to see that there were far more single travellers than passengers travelling with family.

## 5. What factors helped someone survive the sinking?

In [None]:
titanic_df.head()

In [None]:
sns.catplot(x='Survived', data= titanic_df, kind='count', palette='winter')
plt.title('Distribution of passengers that survived and didn\'t sink')
plt.show()

Let's start by taking a look at the survival rate in general. Sadly, fewer passengers from the Titanic survived the collision.

In [None]:
pd.crosstab(titanic_df['Sex'],titanic_df['Survived'])

In [None]:
sns.set_style('whitegrid')
rp = sns.factorplot(x='Survived', col='Sex', kind='count', data=titanic_df)
rp.fig.subplots_adjust(top=0.9) # adjust the Figure in rp
rp.fig.suptitle('Distribution of Passengers that survived and didn\'t sink and thier Genders')
plt.show()

As we all know from the movie and the Titanic's true tragedy, women were given preference when rescue people. The same narrative is also shown in the graph above. Male passengers have died at a higher rate than female ones.

Let's similarly attempt to determine how the variables `Pclass` and `Survived` are related.

The following code snippet would be used to create a countplot to do this.

In [None]:
sns.countplot(x='Survived', hue='Pclass', data=titanic_df)
plt.title('Distribution of Passeneger than survivded and couldn\'t survive and there classes')

plt.show()

According to the graph, `Pclass-3` students had a higher chance of surviving. It was intended for the wealthy, and because `Pclass-1` was comparatively less expensive than `Pclass 3`, they were the most likely victims.

Let's attempt a thorough understanding of the **Sibsip** column. We would once more use the countplot from seaborn to do this. The identical result might be obtained with the following piece of code.

In [None]:
sns.countplot(x='SibSp', data=titanic_df)
plt.title('Distribution of Number of siblings/spouse onboard the Titanic')
plt.show()

The number of siblings or spouses the person was travelling with is indicated by the variable SibSp in this instance. We can tell that most of the visitors arrived alone.

## 6. Did having a family member increase the odds of surviving the crash?

In [None]:
pd.crosstab(titanic_df['Alone'],titanic_df['Survived'])

In [None]:
#plot alone column vs survived column 

sns.catplot(x='Survived',hue='Alone',data=titanic_df,kind='count')
plt.title('Distribution of survivor/non-survivor with thier Family')
plt.show()
##can see that generally more people who has a family

There is a 50/50 chance of survival if the passenger is travelling with their family. However, just 30% of travellers who arrive alone will survive.

## Did the deck have an effect on the passengers survival rate?

In [None]:
#drop na values in the cabin column
titanic_df = titanic_df.dropna(subset=['Cabin'])

In [None]:
##need to reset index first
titanic_df=titanic_df.reset_index(drop=True)

In [None]:
titanic_df['Deck']=cabin_df

titanic_df.head()

In [None]:
#plot a cat plot for deck vs survived to see which deck has a higher survivor rate
sns.catplot(x='Deck',hue='Survived',data=titanic_df.sort_values('Deck'),kind='count',palette='hot')
plt.title('Deck vs survived to see which deck has a higher survivor rate')
plt.show()

In [None]:
#plot to see the person breakdown in each deck
sns.catplot(x='Deck',hue='Sex',data=titanic_df.sort_values('Deck'),kind='count',palette='Blues')
plt.title('Distribution of passenger in deck and thier sex')
plt.show()

## Conclusion
Here, I have accomplished a lot. Understanding the data and conducting a more thorough analysis to determine who is most likely to have survived the Titanic accident.

These are the conclusions we've reached based on the analysis so far:


- Male passengers outnumbered female ones on board. But more female passengers made it through the collision. The best probability of surviving the crash belongs to a female passenger in her 20s who is travelling with a family that is staying in passenger class A or B.

- Male passengers travelling alone would probably not survive the collision.

