# The Titanic Dataset
---
**Objective:**

In this notebook, we are going to create some visualizations to better explore our dataset and clean it as a next step.

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load Dataset

### Data Description
The Titanic dataset is a well-known dataset that contains information about passengers onboard the RMS Titanic, including their demographics, cabin class, ticket information, and survival status. The dataset is often used for data analysis and prediction tasks. It consists of the following columns:

- PassengerId: Unique identifier for each passenger
- Survived: Survival status (0 = No, 1 = Yes)
- Pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- Name: Passenger's name
- Sex: Passenger's gender
- Age: Passenger's age in years
- SibSp: Number of siblings/spouses aboard the Titanic
- Parch: Number of parents/children aboard the Titanic
- Ticket: Ticket number
- Fare: Fare paid for the ticket
- Cabin: Cabin number
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)


The goal of analyzing the Titanic dataset is often to explore patterns and factors that influenced the survival of passengers during the tragic event.

In [2]:


# Load the Titanic dataset into a DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')




Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# 1. Explore the Dataset

## 1.1 Diagnose data for cleaning

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

- .head() returns the first few rows (the “head” of the DataFrame).
- .tail() returns the last few rows (the “tail” of the DataFrame).
- .info() shows information on each of the columns, such as the data type and number of missing values.
- .shape returns the number of rows and columns of the DataFrame.
- .columns: An index of columns: the column names.


Notice how age, cabin, and embarked features have null values

## 1.2 Exploratory data analysis

### Check for missing Values

 Missing data can affect your analysis and models. Determine if any columns have missing values and decide how to handle them.

In [None]:
print(    )   # Count of missing values in each column


### Explore descriptive statistics

In [None]:
print(     )   # Summary statistics for numerical columns


### Data Visualization

In [None]:
import matplotlib.pyplot as plt

# Example: Histogram





# Example: Bar plot






## 1.3 Cleaning

### Handling Missing Values

In [None]:
# Create a copy of the DataFrame
df_copy = df.copy()

# Option 1: Remove rows with missing values




# Option 2: Impute missing values
# Fill missing values with a specific value




# Fill missing values with the mean value of the column




# Option 3: Drop columns with excessive missing data
# Set a threshold for the percentage of missing values in a column
threshold = 0.5  # Example threshold of 50%
df_copy_dropped_cols = df_copy.dropna(thresh=threshold*len(df_copy), axis=1)

# Print the modified DataFrames
print("DataFrame with dropped rows:\n", df_copy_dropped)
print("DataFrame with filled missing values:\n", df_copy_filled)
print("DataFrame with filled missing values using mean:\n", df_copy_mean)


### Data Types

In [None]:
# Create a copy of the DataFrame
df_copy = df.copy()

# Convert "survived" column to boolean




# Print the modified DataFrame
print(df_copy['Survived'])


## 1.4 Feature Correlation

In [None]:
import matplotlib.pyplot as plt

# Select the columns of interest
columns_of_interest = ['Age', 'Fare', 'SibSp', 'Parch']

# Scatter plots






# Correlation heatmap
correlation_matrix = df[columns_of_interest].corr()
plt.figure(figsize=(8, 6))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')
plt.colorbar()
tick_marks = range(len(columns_of_interest))
plt.xticks(tick_marks, columns_of_interest, rotation=45)
plt.yticks(tick_marks, columns_of_interest)
plt.title('Correlation Heatmap')
plt.show()

