# Project 06: Exploratory Data Analysis on the RMS Titanic 

Joanna Farris  
September 30, 2024  

------

## The Disaster

The Titanic disaster is one of history's most infamous maritime tragedies. In this exploratory data analysis (EDA), we examine the Titanic dataset to uncover insights about the passengers aboard the ill-fated ship.

Using various data visualization and statistical methods, we'll explore the dataset for patterns to reveal factors that may have influenced survival during that tragic night.

-----





## The Dataset

The Titanic dataset is a well-known collection of data often used in data analysis and machine learning. It provides information about passengers aboard the RMS Titanic, which tragically sank on its first voyage in April 1912. The dataset includes details like age, gender, class, ticket fare, and survival staus. It’s a useful resource for exploring patterns and factors that might have affected survival during this historic event.

#### Key Features:

- **PassengerId**: A unique identifier for each passenger.
- **Survived**: A binary indicator (0 = No, 1 = Yes) representing whether the passenger survived the disaster.
- **Pclass**: The class of the ticket purchased by the passenger (1 = First, 2 = Second, 3 = Third).
- **Name**: The full name of the passenger.
- **Sex**: The gender of the passenger (male or female).
- **Age**: The age of the passenger in years (some entries may be missing).
- **SibSp**: The number of siblings or spouses aboard the Titanic.
- **Parch**: The number of parents or children aboard the Titanic.
- **Ticket**: The ticket number of the passenger.
- **Fare**: The fare paid for the ticket.
- **Cabin**: The cabin number where the passenger stayed (some entries may be missing).
- **Embarked**: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).  

------




## The Libraries

In [63]:
import requests
import pathlib as path
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np



-----

## The Process

#### Load the Dataset: `df = pd.read_csv()`

The `pd.read_csv()` function is used to load the Titanic dataset from a CSV file into a Pandas DataFrame. This step is essential as it allows us to work with the data in a structured format, enabling data manipulation and analysis. After executing this command, the dataset will be stored in the variable `df`, which we can then use for further exploration and analysis.


In [64]:
df = pd.read_csv("titanic.csv")

#### Preview the Data: `print(df.head())`

The `df.head()` function displays the first few rows of the dataset, allowing us to get a quick look at the data. By default, it shows the top 5 rows, which can help identify the column names, data types, and the general structure


In [65]:
print(df.head)

<bound method NDFrame.head of      survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_mal

### The Initial Exploration

Before diving into deeper analysis or visualization, it's important to get a basic understanding of the dataset. The following steps will give us an overview of the Titanic dataset, its structure, and any potential issues we need to address, such as missing values or duplicates. We'll explore the data types, summary statistics, and specific columns to ensure the data is clean and ready for further analysis.

#### Understand Data Shape: `df.shape`

The `df.shape` attribute provides the dimensions of the dataset, returning the number of rows and columns. This gives us a quick idea of the dataset's size and helps us verify that the data is complete.

In [66]:
df.shape

(891, 15)

#### Summary of the Data: `df.info()`

The `df.info()` function provides an overview of the dataset, including the column names, non-null values, and data types. It’s helpful for understanding the structure of the data and identifying any potential missing values.


In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


#### Descriptive Statistics: `df.describe()`

The `df.describe()` function generates summary statistics for numerical columns in the dataset, such as the mean, median, standard deviation, and other key statistics. This gives insight into the distribution and spread of the data.


In [68]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


#### Check for Missing Data: `df.isnull().sum()`

The `df.isnull().sum()` function checks for missing values in each column of the dataset. It returns a count of null values, helping us identify columns that may require cleaning or imputation.

In [69]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

#### Identify Unique Values: `df.nunique()`

The `df.nunique()` function returns the number of unique values in each column, which is useful for understanding the diversity in categorical features and for identifying columns that may have only one or very few unique values.


In [70]:
df.nunique()

survived         2
pclass           3
sex              2
age             88
sibsp            7
parch            7
fare           248
embarked         3
class            3
who              3
adult_male       2
deck             7
embark_town      3
alive            2
alone            2
dtype: int64

#### View Data Types: `df.dtypes`

The `df.dtypes` attribute shows the data types of each column in the dataset. This is important for confirming that columns are correctly typed, especially when handling numeric and categorical data.

In [71]:
df.dtypes

survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object

#### Check for Duplicates: `df.duplicated().sum()`

The `df.duplicated().sum()` function checks for duplicate rows in the dataset and returns a count of how many duplicates exist. It’s a simple way to ensure there are no redundant entries that could skew analysis.

In [72]:
df.duplicated().sum()

np.int64(107)


#### Things to get back to:

Purpose
The Titanic dataset serves as an excellent introduction to data science and machine learning concepts, allowing users to explore various data analysis techniques, visualization methods, and predictive modeling algorithms. By analyzing the dataset, one can uncover insights regarding the factors that influenced survival rates among different passenger groups.

Questions
- What factors seemed to affect whether someone survived?
- How did things like ticket class and fare relate to survival?
- Were there differences in survival rates based on gender and age?

Description
The dataset contains information about various aspects of the passengers, such as age, gender, and class, as well as ticket information and survival status.