# Titanic Dataset – EDA & Data Cleaning

This notebook performs exploratory data analysis and basic cleaning on the Titanic dataset. It prepares the dataset for future machine learning modeling.

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style='whitegrid')

## 2. Load Data

In [None]:
df = pd.read_csv('../data/titanic.csv')  # Adjust the path if needed
df.head()

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Overview of data
print(df.shape)
df.info()
df.describe(include='all')

In [None]:
# Missing values
missing = df.isnull().sum()
missing[missing > 0]

In [None]:
# Categorical variables distribution
categorical = df.select_dtypes(include='object')
for col in categorical.columns:
    print(df[col].value_counts())
    print('-'*40)

In [None]:
# Numerical variable distribution
numerical = df.select_dtypes(include=['int64', 'float64'])
numerical.hist(figsize=(12, 8))
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## 4. Data Cleaning

In [None]:
# Fill missing 'age' with median
df['age'].fillna(df['age'].median(), inplace=True)

In [None]:
# Drop rows with missing 'embarked'
df.dropna(subset=['embarked'], inplace=True)

In [None]:
# Drop 'deck' due to too many missing values
if 'deck' in df.columns:
    df.drop(columns=['deck'], inplace=True)

In [None]:
# Encode 'sex' and 'embarked'
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)

## 5. Preprocessing

In [None]:
# Drop irrelevant columns (like name, ticket, cabin)
df.drop(columns=['embark_town', 'alive', 'class', 'who', 'adult_male', 'alone'], inplace=True, errors='ignore')

In [None]:
# Train-test split
from sklearn.model_selection import train_test_split

X = df.drop('survived', axis=1)
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Conclusion

The Titanic dataset has been cleaned and preprocessed. It is now ready for machine learning modeling, such as logistic regression or decision trees.