![image.png](https://i.imgur.com/a3uAqnb.png)

# Logistic Regression for Titanic Survival Prediction - Homework Assignment

In this homework, you will implement a **Logistic Regression classifier** to predict passenger survival on the Titanic. This project will help you understand the fundamentals of classification using logistic regression.

## 📌 Project Overview
- **Task**: Predict passenger survival on the Titanic
- **Algorithm**: Logistic Regression for binary classification
- **Dataset**: Titanic passenger dataset (provided)
- **Goal**: Build an accurate classification model using scikit-learn

## 📚 Learning Objectives
By completing this assignment, you will:
- Understand logistic regression for binary classification problems
- Learn data preprocessing and feature engineering techniques
- Practice exploratory data analysis (EDA)
- Implement feature selection and model evaluation
- Learn about classification metrics and model performance
- Identify the most important features for survival prediction

## 1️⃣ Initial Setup and Library Installation

**Task**: Set up the environment and install necessary libraries.

In [None]:
from IPython.display import clear_output

## 2️⃣ Library Installation (if needed)

**Task**: Install required libraries for the project.

In [None]:
# Incase you run this notebook outside colab (where the libraries aren't already pre-installed)

# %pip install numpy
# %pip install pandas
# %pip install matplotlib
# %pip install seaborn
# %pip install scikit-learn

clear_output()

## 3️⃣ Import Libraries and Configuration

**Task**: Import all necessary libraries and set up configuration parameters.

**Requirements**:
- Import data processing libraries (pandas, numpy)
- Import visualization libraries (matplotlib, seaborn)
- Import scikit-learn modules for preprocessing and modeling
- Set random seeds for reproducibility
- Configure display options for better data visualization

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Configure matplotlib
plt.style.use('default')
sns.set_palette("husl")

## 4️⃣ Data Loading and Initial Exploration

**Task**: Load the Titanic dataset and perform initial exploration.

**Requirements**:
- Download and load the dataset
- Display basic information about the data
- Check data types and structure
- Identify the target variable and features

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("yasserh/titanic-dataset") # Titanic-Dataset.csv

print("Path to dataset files:", path)

In [None]:
# TODO: Load the Titanic dataset
titanic_data = None

# TODO: Display basic information about the dataset

## 5️⃣ Exploratory Data Analysis (EDA)

**Task**: Perform comprehensive exploratory data analysis to understand the data.

**Requirements**:
- Examine data structure and missing values
- Analyze the distribution of the target variable
- Explore relationships between features and survival
- Create visualizations to understand data patterns

In [None]:
# TODO: Display first few rows of the dataset

# TODO: Get basic information about the dataset (shape, data types)

# TODO: Check for missing values

# TODO: Display statistical summary of numerical features

In [None]:
# TODO: Analyze the target variable distribution (Survived)

# TODO: Create visualizations for survival distribution

In [None]:
# TODO: Explore survival rates by different categorical features (Sex, Pclass, Embarked)

# TODO: Create  plots to visualize survival rates

In [None]:
# TODO: Analyze numerical features (Age, Fare, SibSp, Parch)

# TODO: Create histograms for numerical features

# TODO: Examine survival rates across different age groups and fare ranges

## 6️⃣ Data Preprocessing and Feature Engineering

**Task**: Clean and prepare the data for logistic regression modeling.

**Requirements**:
- Handle missing values appropriately
- Encode categorical variables
- Create new features if beneficial
- Scale numerical features if necessary
- Select relevant features for modeling

In [None]:
# TODO: Create a copy of the dataset for preprocessing
titanic_processed = titanic_data.copy()

# TODO: Handle missing values
# - Fill missing Age values (consider using median or mean)
# - Fill missing Embarked values (consider using mode)
# - Handle missing Cabin values (consider creating a binary feature)

# TODO: Feature engineering
# - Create FamilySize feature from SibSp and Parch
# - Create IsAlone feature
# - Create Age groups or Fare groups if beneficial

In [None]:
# TODO: Encode categorical variables
# - Convert Sex to numerical values
# - Encode Embarked using appropriate method
# - Handle any other categorical features

# TODO: Drop irrelevant columns (Name, PassengerId, Ticket, etc.)

# TODO: Verify the processed dataset

## 7️⃣ Feature Selection and Data Splitting

**Task**: Select the most relevant features and split the data for training and testing.

**Requirements**:
- Separate features (X) from target variable (y)
- Split data into training and testing sets (80:20 ratio)
- Apply feature scaling if necessary
- Ensure no data leakage

In [None]:
# TODO: Separate features from target variable
# X = features, y = target (Survived)

# TODO: Split the data into training and testing sets (80:20 split)

# TODO: Apply feature scaling if necessary (StandardScaler)

# TODO: Display the shapes of training and testing sets

## 8️⃣ Logistic Regression Model Training

**Task**: Train a logistic regression model on the training data.

**Requirements**:
- Initialize LogisticRegression with appropriate parameters
- Fit the model on training data
- Use cross-validation if needed

In [None]:
# TODO: Initialize the Logistic Regression model
# Consider parameters like random_state, max_iter

# TODO: Train the model on training data

# TODO: Display model parameters and coefficients

## 9️⃣ Model Evaluation and Performance Analysis

**Task**: Evaluate the trained model on both training and testing data.

**Requirements**:
- Make predictions on both training and testing sets
- Calculate various classification metrics
- Create confusion matrix
- Analyze model performance and potential overfitting

In [None]:
# TODO: Make predictions on training and testing sets

# TODO: Calculate accuracy, precision, recall, and F1-score for both sets

# TODO: Display classification report

# TODO: Create and visualize confusion matrix

In [None]:
# TODO: Create a comprehensive performance comparison
# Compare training vs testing performance to check for overfitting

# TODO: Visualize model performance metrics

## 🔟 Feature Importance Analysis

**Task**: Analyze which features most strongly influence passenger survival.

**Requirements**:
- Extract and interpret model coefficients
- Rank features by importance
- Create visualizations for feature importance
- Provide insights about survival factors

In [None]:
# TODO: Extract model coefficients

# TODO: Create feature importance visualization

# TODO: Rank features by their influence on survival prediction

# TODO: Interpret the results and provide insights

## 📝 Evaluation Criteria

Your homework will be evaluated based on:

1. **Implementation Correctness (40%)**
   - Proper data preprocessing and handling of missing values
   - Correct implementation of logistic regression
   - Appropriate feature engineering and selection
   - Proper train-test split methodology

2. **Model Performance (30%)**
   - Reasonable classification metrics (accuracy, precision, recall, F1-score)
   - Proper evaluation methodology
   - Analysis of model performance

3. **Code Quality and Analysis (20%)**
   - Clean, readable code with appropriate comments
   - Comprehensive exploratory data analysis
   - Good coding practices and organization

4. **Feature Importance Analysis (10%)**
   - Identification of most important survival factors
   - Clear interpretation of model coefficients
   - Meaningful insights and conclusions