# **J&J Coding Event at Rutgers**

Exploratory Data Analysis of Titanic Dataset with Pandas, Seaborn, and Matplotlib

DATASET: https://www.kaggle.com/competitions/titanic/data



# **How to do the data analysis?**

**1. What is the problem we are trying to solve?**

**2. Is there any trend or any interesting observation in the data?**


Let's start with some basic questions:

1.) Who were the passengers on the Titanic? (Ages, Gender, Class,..etc)

2.) Who was alone and who was with family?

3.) What deck were the passengers on and how does that relate to their class ?

4.) Does the ticket price relate to their survival?

5.) What factors helped someone survive the sinking?



After listing out the problems, we can now start to observe/understand our data.


# **Upload Files/Data**

In [None]:
# First, we need to import the data to the algorithm.

_____________________________

# **Import Packages**

Packages we are using today:

1. Panda: https://pandas.pydata.org/docs/reference/index.html
2. Numpy: https://numpy.org/
3. Matplotlib: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylim.html
4. Seaborn: https://seaborn.pydata.org/api.html


In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# **Observe Data**

In [None]:
# Print out and observe the data ----> What kind of data do we have?

# setting plotting style we want
%matplotlib inline
sns.set(style="ticks")

# passing csv data to local variables
_________________________

**Variable Notes**

PassengerId Unique ID of the passenger

Survived Survived (1) or died (0)

Pclass Passenger’s class (1st, 2nd, or 3rd)

Name - Passenger’s name

Sex - Passenger’s sex

Age - Passenger’s age

SibSp - Number of siblings/spouses aboard the Titanic

Parch - Number of parents/children aboard the Titanic

Ticket - Ticket number

Fare - Fare paid for ticket

Cabin - Cabin number

Embarked - Where the passenger got on the ship (C — Cherbourg, S — Southampton, Q = Queenstown)

# **Now, It's time to dive into the data**

**1.) Who were the passengers on the Titanic? (Ages, Gender, Class,..etc)**

In [None]:
# First, take a look at the survival count
sns.countplot(data=titanic_df, x='Survived')
plt.title('Passenger Survival Count')
plt.show()

# Countplot of survival by gender
___________________

# Countplot of survival by passenger class
____________________

sns.catplot(data=titanic_df, x='Sex', kind='count')
# 'catplot()': Figure-level interface for drawing categorical plots onto a FacetGrid.
# Now let separate the gender by classes passing 'Sex' to the 'hue' parameter
sns.catplot( data=titanic_df, x='Pclass', hue='Sex', kind='count')


**Adult versus Children**

In [None]:
# Create a new column 'Adult_Child' in which every person under 16 is a child.
___________________________________

# Checking the distribution
print(f"Adult_Child categories : {titanic_df.Adult_Child.unique()}\n=================================")
print(f"Distribution of Adult_Child : \n{titanic_df.Adult_Child.value_counts()}\n=================================")
print(f"Mean age : {titanic_df.Age.mean()}\n=================================")

# What's the survival rate for the adult and children?
sns.catplot( data=titanic_df, x='Survived', hue='Adult_Child', kind='count')
# What's the class most children at?
sns.catplot( data=titanic_df, x='Pclass', hue='Adult_Child', kind='count')


**Age Distribution**

In [None]:
# visualizing age distribution
# Histogram of passenger age
____________________________________
plt.title('Passenger Age Distribution')
plt.show()




**Right now, we would like to understand:**

**2.) Who was alone and who was with family?**

In [None]:
# Create a new column to count who has siblings, parents, or children
__________________________

In [None]:
# Look for > 0 or == 0 to set "Company" status
titanic_df.loc[titanic_df['Company'] > 0, 'Company'] = 'with Family'
___________________________________________
# Let's check to make sure it worked
titanic_df.head()

**Does with or without family influence survival?**

In [None]:
# Plot survival rate vs Company
sns.catplot( data=titanic_df, x='Survived', hue='Company', kind='count')

**3.) How does class relate to survival rate?**

In [None]:
# Countplot of survival by passenger class
_________________________________________________
plt.title('Survival Count by Passenger Class')
plt.show()

sns.catplot(data=titanic_df, x='Sex', kind='count')
# Now let separate the gender by classes passing 'Sex' to the 'hue' parameter
sns.catplot( data=titanic_df, x='Pclass', hue='Sex', kind='count')

**4.) How about the ticket price??**

**Does it influence the survaival rate?**

In [None]:
# Boxplot of passenger fare by passenger class
________________________________________________
plt.title('Passenger Fare by Passenger Class')
plt.show()

In [None]:
# Swarmplot of passenger fare by passenger class and survival
__________________________
plt.title('Passenger Fare by Passenger Class and Survival')
plt.xlabel('Passenger Class')
plt.ylabel('Fare($)')
plt.show()

# **Conclusion**

**5.) What factors helped someone survive the sinking?**

