
# Data Mining Project: Titanic Survival Classification

---

## 1. Introduction to the Problem

This project is aiming to solve is to **predict whether a passenger survived the Titanic disaster** based on their demographic and travel information.  
This is a **classification problem** because the target variable (Survived) is categorical (0 = No, 1 = Yes).  

**Questions:**  
- Can we accurately predict if a passenger survived based on their characteristics?  
- Which features are most important for survival?  



## 2. Introduction to the Data

The data for this problem is the **Titanic dataset**, which contains passenger information (name, age, gender, class, etc.) and survival status.  
- Source: [Kaggle Titanic Dataset](https://www.kaggle.com/c/titanic/data)  
- The dataset contains 891 rows and 12 columns.  

**Features include:**  
- **pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)  
- **sex**: Gender  
- **age**: Age in years  
- **sibsp**: # of siblings/spouses aboard  
- **parch**: # of parents/children aboard  
- **fare**: Passenger fare  
- **embarked**: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)  
- **survived**: Target (0 = No, 1 = Yes)  


In [None]:

import pandas as pd
import seaborn as sns
data = pd.read_csv("data/titanic.csv")


: 


## 3. Pre-processing the Data

Steps:  
- Drop irrelevant columns (like name, deck, embark_town).  
- Handle missing values in `age` and `embarked`.  
- Encode categorical variables (sex, class, embarked).  
- Ensure target variable is numeric.  


In [None]:
# Drop unnecessary columns
data = data.drop(columns=["class", "deck", "embark_town", "alive", "adult_male", "who"])

# Handle missing values 
data["age"] = data["age"].fillna(data["age"].median())
data["embarked"] = data["embarked"].fillna(data["embarked"].mode()[0])

# Encode categorical variables
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

for col in ["sex", "embarked"]:
    data[col] = encoder.fit_transform(data[col])

# Drop rows with missing target
data = data.dropna(subset=["survived"])

data.head()


## 4. Data Understanding / Visualization 

We will visualize the dataset to explore:  
- Survival rate by gender, class, and age.  
- Correlation heatmap.  


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Survival by gender
sns.countplot(x="sex", hue="survived", data=data)
plt.xlabel("Sex (0=Female, 1=Male)")
plt.ylabel("Count")
plt.title("Survival by Gender (0=Did not survive, 1=Survived)")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()

# Survival by class
sns.countplot(x="pclass", hue="survived", data=data)
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.title("Survival by Class (0=Did not survive, 1=Survived)")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()

# Age distribution
sns.histplot(data=data, x="age", hue="survived", bins=30, kde=True)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age Distribution by Survival (0=Did not survive, 1=Survived)")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()


## 5. Modeling 

These are the three classification models tested:  
- **Logistic Regression** (linear baseline).  
- **Decision Tree** (non-linear, interpretable).  
- **Random Forest** (ensemble, usually more accurate).  


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Features and target
X = data.drop("survived", axis=1)
y = data["survived"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)

# Initialize models
log_model = LogisticRegression(max_iter=500)
tree_model = DecisionTreeClassifier(random_state=42)
forest_model = RandomForestClassifier(random_state=42)

# Fit models
log_model.fit(X_train, y_train)
tree_model.fit(X_train, y_train)
forest_model.fit(X_train, y_train)



## 6. Evaluation 

The model will be evaluated by using:  
- **Accuracy**  
- **Precision, Recall, F1-score**  
- **Confusion Matrix**  


In [None]:

from sklearn.metrics import classification_report, confusion_matrix

models = {
    "Logistic Regression": log_model,
    "Decision Tree": tree_model,
    "Random Forest": forest_model
}

for name, model in models.items():
    print(f"--- {name} ---")
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\n")



## 7. Storytelling 

The models reveal that:  
- **Women had a much higher survival rate than men.**  
- **First-class passengers survived at higher rates than third-class passengers.**  
- **Younger passengers had a better chance of survival.**  

The analysis successfully answered our initial question: **Yes, we can predict survival with reasonable accuracy.**



## 8. Impact Section 

Even though this is a historical dataset, the implications extend to real-world applications:  
- **Social impact**: Highlights class and gender inequality during disasters.  
- **Ethical impact**: Predictive models should not be used to justify discriminatory practices.  
- **Practical impact**: Understanding survival factors can inform future safety policies.  



## 9. References 

- Kaggle Titanic Dataset: https://www.kaggle.com/c/titanic/data  
- Scikit-learn documentation: https://scikit-learn.org/  
- Jupiter Notebook outline created by Chatgpt GPT-5



## 10. Code 

Full code is provided in this notebook.  
