<a href="https://colab.research.google.com/github/StacyChebet/Titanic-Machine-Learning-from-Disaster/blob/master/Titanic_Machine_Learning_from_Disaster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exploratory Data Analysis using "Titanic: Machine Learning from Disaster" dataset on Kaggle. <br><br>
**Dataset Overview** <br>
This dataset is a classic example used in machine learning and data analysis, containing information about the passengers of the Titanic, which sank in 1912. <br><br>

**Dataset Description** <br>
The dataset includes the following files:

**train.csv:** The training set, containing features and the target variable.<br>
**test.csv:** The test set, containing only features and used for model evaluation.<br>
**gender_submission.csv:** An example of a submission file in the correct format. <br><br>
**Features in the Dataset**<br>
**PassengerId:** Unique ID for each passenger <br>
**Survived:** Target variable indicating if the passenger survived (1) or not (0)<br>
**Pclass:** Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)<br>
**Name:** Passenger’s name <br>
**Sex:** Passenger’s gender <br>
**Age:** Passenger’s age <br>
**SibSp:** Number of siblings/spouses aboard the Titanic <br>
**Parch:** Number of parents/children aboard the Titanic <br>
**Ticket:** Ticket number <br>
**Fare:** Fare paid for the ticket <br>
**Cabin:** Cabin number <br>
**Embarked:** Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

# **Loading Libraries and Data** <br>
Libraries used: <br>
**Pandas:** For data manipulation <br>
**Numpy:** For numerical operations <br>
**Seaborn:** For data visualization

In [1]:
#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Setting visualization styles
sns.set(style="whitegrid")

#Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

#Changing directory
%cd /content/drive/My Drive/Colab Notebooks/Data Analytics - IBT/Titanic

#Loading the dataset
file_path = "train.csv"
df = pd.read_csv(file_path)

#Displaying the first few rows of the dataset
df.head()

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/Data Analytics - IBT/Titanic


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# **Initial Data Exploration**
In this step, we will conduct an initial exploration of the dataset to understand its structure and basic characteristics. We will:<br>
1. Check the shape of the dataset <br>
2. Display the data types of each column <br>
3. Get a summary of the dataset using descriptive statistics

In [2]:
#Checking the shape of the dataset
print(f"The data set contains {df.shape[0]} rows and {df.shape[1]} columns.")

#Displaying the data types of each column
print("\nData types of each column:")
print("df.dtypes")

#Getting a summary of the dataset using descriptive statistics
print("\nSummary of the dataset:")
print(df.describe().T)

The data set contains 891 rows and 12 columns.

Data types of each column:
df.dtypes

Summary of the dataset:
             count        mean         std   min       25%       50%    75%  \
PassengerId  891.0  446.000000  257.353842  1.00  223.5000  446.0000  668.5   
Survived     891.0    0.383838    0.486592  0.00    0.0000    0.0000    1.0   
Pclass       891.0    2.308642    0.836071  1.00    2.0000    3.0000    3.0   
Age          714.0   29.699118   14.526497  0.42   20.1250   28.0000   38.0   
SibSp        891.0    0.523008    1.102743  0.00    0.0000    0.0000    1.0   
Parch        891.0    0.381594    0.806057  0.00    0.0000    0.0000    0.0   
Fare         891.0   32.204208   49.693429  0.00    7.9104   14.4542   31.0   

                  max  
PassengerId  891.0000  
Survived       1.0000  
Pclass         3.0000  
Age           80.0000  
SibSp          8.0000  
Parch          6.0000  
Fare         512.3292  


# **Findings**
**1. Dataset Overview:**
*   The dataset contains 891 rows and 12 columns.
*  Features include both **numerical** (e.g., Age, Fare) and **categorical** (e.g., Sex, Embarked) variables. <br>

**Key Insights:**
*   **Survival Rate:** Approximately 38.38% of passengers survived.
*   **Class Distribution:** Most passengers were in the 3rd class.
*   **Missing Values:** Significant missing values in Age (177 missing).
*   **Fare:** The ticket prices varied widely, with a mean fare of 32.20 and a maximum fare of 512.33.





# **Identifying Categorical and Numerical Variables**
**Categorical Variables:** Variables that represent categories or labels. <br>
**Numerical Variables:** Variables that represent quantitative data. <br>

In this step, we will create two separate lists for categorical and numerical variables based on their data types.

In [3]:
#Identifying categorical and numerical variables
categorical_vars = df.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_vars = df.select_dtypes(include=['number']).columns.tolist()

print("Categorical Variables:", categorical_vars)
print("Numerical Variables:", numerical_vars)

Categorical Variables: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Numerical Variables: ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']


# **Univariate Analysis**
- Involves the examination and analysis of a single variable. <br>
- **Purpose** - describe and summarize the data, understand its distribution, central tendency, and variability. <br>
- Provides the foundational understanding of each variable independently before exploring relationships between multiple variables.<br>
## **Types of Univariate Analysis**
1. **For Numerical Variables**
- **Histogram** - Graphical representation of the distribution of numerical data.
 - Helps in understanding the frequency
distribution, identifying skewness, and detecting outliers.
- **Box Plot** - Standardized way of displaying the distribution of data based on a five-number summary: **minimum**, **first quartile (Q1)**, **median**, **third quartile (Q3)** and **maximum**.
 - Useful in identifying outliers and understanding thr spread and symmetry of the data.
- **Summary Statistics** - Includes measures such as mean, median, mode, standard deviation, variance, minimum, maximum, and quartiles which provide a numerical summary of the data's central tendency and dispersion.