# Titanic Survival Prediction â€” Exploratory Data Analysis (EDA)

This notebook walks through an exploratory analysis of the Titanic dataset. The aim is to uncover patterns and insights that might help explain passenger survival outcomes.

**Dataset:** `tested.csv` (Titanic passenger details)

**Key Questions:**
- What types of people were more likely to survive?
- Do age, gender, or passenger class play important roles?
- Can we find hidden patterns in family size, deck, or titles?

Letâ€™s dive in! ðŸš¢

## 1. Setup

Weâ€™ll start by importing the libraries we need and loading the dataset.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure settings for better visuals
sns.set(style="whitegrid")
pd.set_option('display.max_columns', None)

# Load dataset
df = pd.read_csv("tested.csv")
df.head()

## 2. First Look at the Data

Before diving deeper, letâ€™s check the dataset shape, data types, and missing values.

In [None]:
print("Shape of data:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())

df.describe(include="all")

**Observation:**
- We have 418 rows and 12 columns.
- `Age` has missing values (~20%).
- `Cabin` is mostly missing â€” we might drop it or simplify it into 'Deck'.
- `Fare` has only 1 missing value.
- The target column is `Survived` (0 = No, 1 = Yes).

## 3. Feature Engineering (Making Data More Useful)

To help our analysis, weâ€™ll create some extra features:
- **Deck**: Extracted from Cabin.
- **FamilySize**: Based on siblings/spouses + parents/children.
- **Title**: Extracted from passenger names.
- **Age_missing**: Flag to indicate missing ages.

In [None]:
# Extract Deck from Cabin (first character)
df['Deck'] = df['Cabin'].astype(str).str[0].replace('n','U')

# Create FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Extract Title from Name
def extract_title(name):
    if pd.isna(name): return "Unknown"
    parts = name.split(",")
    if len(parts) > 1:
        title = parts[1].split()[0]
        return title.replace(".", "")
    return "Unknown"

df['Title'] = df['Name'].apply(extract_title)

# Missing Age flag
df['Age_missing'] = df['Age'].isnull()

df[['Name','Title','Deck','FamilySize','Age_missing']].head()

## 4. Univariate Analysis

Letâ€™s look at single variables to understand distributions and spot any unusual patterns.

In [None]:
# Numeric distributions
for col in ['Age','Fare','FamilySize']:
    plt.figure(figsize=(10,4))
    plt.subplot(1,2,1)
    sns.histplot(df[col], bins=30, kde=True)
    plt.title(f"{col} Distribution")
    
    plt.subplot(1,2,2)
    sns.boxplot(x=df[col])
    plt.title(f"{col} Boxplot")
    plt.show()

**Observations:**
- Most passengers were between 20â€“40 years old, but there are children and elderly too.
- Fare is extremely skewed â€” a few very expensive tickets dominate.
- Family sizes are small for most, though a few had very large families.

In [None]:
# Categorical distributions
categorical = ['Sex','Pclass','Embarked','Title','Deck']
for col in categorical:
    plt.figure(figsize=(8,4))
    sns.countplot(data=df, x=col, order=df[col].value_counts().index)
    plt.title(f"{col} Countplot")
    plt.xticks(rotation=45)
    plt.show()

**Observations:**
- More males than females were on board.
- Majority of passengers were in 3rd class.
- Most people embarked from port 'S'.
- Titles give hints of social status (Mr, Mrs, Miss, etc.).

## 5. Survival Patterns (Bivariate Analysis)

Now letâ€™s see how survival relates to key variables like Sex, Class, Age, and Family Size.

In [None]:
# Survival by Sex and Pclass
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
sns.barplot(data=df, x='Sex', y='Survived', estimator=np.mean)
plt.title("Survival Rate by Sex")

plt.subplot(1,2,2)
sns.barplot(data=df, x='Pclass', y='Survived', estimator=np.mean, order=[1,2,3])
plt.title("Survival Rate by Pclass")
plt.show()

**Observation:** Women clearly had a higher survival rate. Also, being in 1st class gave a much better chance of survival compared to 3rd class.

In [None]:
# Age vs Survival
plt.figure(figsize=(10,5))
sns.kdeplot(data=df[df['Survived']==0], x='Age', fill=True, label='Did not survive', color='red')
sns.kdeplot(data=df[df['Survived']==1], x='Age', fill=True, label='Survived', color='green')
plt.title("Age Distribution by Survival")
plt.legend()
plt.show()

**Observation:** Children had higher chances of survival compared to adults. Older passengers (60+) had much lower survival rates.

In [None]:
# Survival by Embarked
sns.barplot(data=df, x='Embarked', y='Survived', estimator=np.mean)
plt.title("Survival Rate by Embarkation Port")
plt.show()

**Observation:** Passengers embarking from 'C' had higher survival rates compared to 'S' and 'Q'.

## 6. Correlation Heatmap

To see which numeric variables correlate with survival.

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(df[['Survived','Pclass','Age','SibSp','Parch','Fare','FamilySize']].corr(), 
            annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Heatmap")
plt.show()

**Observation:** Survival is negatively correlated with Pclass (higher class = better survival), positively correlated with Fare, and weakly related to Age.

## 7. Key Insights

- **Gender mattered a lot:** Women were far more likely to survive.
- **Class mattered:** 1st class passengers had much higher chances of survival.
- **Age mattered:** Children had higher survival chances; elderly passengers had poor outcomes.
- **Fare mattered:** Higher fare passengers tended to survive more (likely tied to class).
- **Embarkation mattered:** Those from Cherbourg ('C') had higher survival rates.

These findings align with historical accounts of the Titanic disaster.

## 8. Next Steps

With these insights, the next steps could be:
- Impute missing Age values intelligently (e.g., by Title).
- Encode categorical variables (Sex, Embarked, Title).
- Train machine learning models (Logistic Regression, RandomForest, etc.) to predict survival.
- Evaluate models using cross-validation.

But thatâ€™s for the modeling stage. For now, our EDA provides a solid understanding of the dataset.