# Exploratory Data Analysis on Titanic Dataset

This notebook performs an exploratory data analysis (EDA) on the Titanic dataset to uncover insights about passenger survival using statistical and visual methods. We will use Python libraries such as Pandas, Matplotlib, and Seaborn to analyze the data, identify trends, and visualize relationships. The deliverables include this Jupyter Notebook and a PDF report summarizing the findings.

## 1. Import Libraries and Load Data
We start by importing the necessary libraries and loading the Titanic dataset from the provided Kaggle URL.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load the dataset (using a publicly available Titanic dataset)
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)

# Display the first few rows
df.head()

## 2. Initial Data Exploration
We use `.info()`, `.describe()`, and `.value_counts()` to understand the dataset's structure, summary statistics, and distribution of categorical variables.

In [None]:
# Check dataset info
df.info()

# Summary statistics
df.describe()

# Check for missing values
df.isnull().sum()

# Value counts for categorical variables
print('Survived:\n', df['Survived'].value_counts())
print('Pclass:\n', df['Pclass'].value_counts())
print('Sex:\n', df['Sex'].value_counts())
print('Embarked:\n', df['Embarked'].value_counts())

**Observations**:
- The dataset has 891 rows and 12 columns.
- Missing values: `Age` (177), `Cabin` (687), `Embarked` (2).
- `Survived` is binary (0 = No, 1 = Yes), with 549 non-survivors and 342 survivors.
- `Pclass` (passenger class) has three categories: 1, 2, 3.
- `Sex` is categorical with 'male' and 'female'.
- `Embarked` has three ports: S (Southampton), C (Cherbourg), Q (Queenstown).

## 3. Data Cleaning
Handle missing values and prepare the data for analysis.

In [None]:
# Fill missing Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Embarked with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop Cabin due to excessive missing values
df.drop(columns=['Cabin'], inplace=True)

# Verify missing values
df.isnull().sum()

**Observations**:
- `Age` missing values filled with median (28.0).
- `Embarked` missing values filled with mode ('S').
- `Cabin` dropped due to 77% missing data.

## 4. Visualizations and Insights
We use histograms, boxplots, scatterplots, pairplots, and heatmaps to identify relationships and trends.

### 4.1 Univariate Analysis
#### Histogram of Age

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Age Distribution of Passengers')
plt.xlabel('Age')
plt.ylabel('Count')
plt.savefig('age_histogram.png')
plt.close()

**Observations**:
- The age distribution is right-skewed, with a peak around 20-30 years.
- Many passengers are young adults; fewer children and elderly.

#### Boxplot of Fare

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(y=df['Fare'])
plt.title('Fare Distribution')
plt.ylabel('Fare')
plt.savefig('fare_boxplot.png')
plt.close()

**Observations**:
- The fare distribution is highly skewed with many outliers.
- Most fares are below 100, but some passengers paid significantly more (up to 512).

### 4.2 Bivariate Analysis
#### Survival Rate by Sex

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(x='Sex', y='Survived', data=df)
plt.title('Survival Rate by Sex')
plt.ylabel('Survival Rate')
plt.savefig('survival_by_sex.png')
plt.close()

**Observations**:
- Females had a much higher survival rate (~74%) compared to males (~19%).
- This suggests a 'women first' policy during evacuation.

#### Survival Rate by Passenger Class

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(x='Pclass', y='Survived', data=df)
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.savefig('survival_by_pclass.png')
plt.close()

**Observations**:
- 1st class passengers had the highest survival rate (~63%), followed by 2nd class (~47%), and 3rd class (~24%).
- Higher socio-economic status likely influenced access to lifeboats.

#### Scatterplot of Age vs. Fare

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Fare', hue='Survived', data=df)
plt.title('Age vs. Fare by Survival')
plt.savefig('age_vs_fare_scatter.png')
plt.close()

**Observations**:
- No clear linear relationship between age and fare.
- Higher fares are associated with survival, especially for passengers paying >100.
- Survivors are spread across all ages.

### 4.3 Multivariate Analysis
#### Pairplot of Numerical Features

In [None]:
sns.pairplot(df[['Age', 'Fare', 'Pclass', 'Survived']], hue='Survived')
plt.savefig('pairplot.png')
plt.close()

**Observations**:
- `Pclass` and `Fare` show some separation between survivors and non-survivors.
- Lower `Pclass` (1st class) and higher `Fare` are associated with survival.
- `Age` does not show a strong pattern with survival.

#### Correlation Heatmap

In [None]:
plt.figure(figsize=(10, 8))
corr = df[['Age', 'Fare', 'Pclass', 'SibSp', 'Parch', 'Survived']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.savefig('correlation_heatmap.png')
plt.close()

**Observations**:
- `Survived` has a moderate negative correlation with `Pclass` (-0.34), indicating higher classes had better survival odds.
- `Fare` has a positive correlation with `Survived` (0.26), suggesting wealthier passengers were more likely to survive.
- `Age` has a weak correlation with `Survived` (-0.08), indicating age alone is not a strong predictor.

## 5. Summary of Findings
1. **Survival Rates**:
   - Females had a significantly higher survival rate (74%) than males (19%), likely due to prioritization during evacuation.
   - 1st class passengers had the highest survival rate (63%), followed by 2nd (47%) and 3rd class (24%), indicating socio-economic status played a role.

2. **Demographics**:
   - The majority of passengers were young adults (20-30 years), with fewer children and elderly.
   - Age alone was not a strong predictor of survival, though younger passengers slightly outnumbered older survivors.

3. **Economic Factors**:
   - Higher fares were associated with survival, likely because they corresponded to 1st class tickets.
   - The fare distribution was highly skewed, with a few passengers paying significantly more.

4. **Relationships**:
   - `Pclass` and `Fare` showed the strongest relationships with survival.
   - No strong correlation between `Age` and other variables, but `Pclass` and `Fare` were moderately correlated (-0.55).

5. **Data Quality**:
   - Missing values in `Age` and `Embarked` were handled appropriately.
   - `Cabin` was dropped due to excessive missing data, which may limit analysis of cabin location effects.
