# **Survival Analysis Revamp: Death Prediction 2.0**
## **Project Overview**
This project aims to revamp the original **death prediction model** into a **production-grade survival analysis system**. Instead of directly predicting an age of death, we model the **probability of survival over time**, accounting for censoring (individuals still alive).

## **Why Survival Analysis?**
Survival analysis is widely used in **healthcare, finance, and customer retention**:
- **Healthcare:** Predict patient survival rates.
- **Finance:** Credit risk and loan default probabilities.
- **Subscription Businesses:** Customer churn prediction (e.g., Netflix, Spotify).

## **Key Steps**
### **1️⃣ Reframe as Survival Analysis**
- Convert the dataset to survival format.
- Use Python’s `lifelines` and PyTorch-based `pycox`.
- Handle **censored data** (people still alive in 2024).

### **2️⃣ Train Survival Models**
- **Traditional Cox Proportional Hazards Model (`lifelines`)**
- **DeepSurv (Neural Networks for Survival Analysis)**
- **Transformer-based Time-to-Event Models (TFTs, Hugging Face Transformers)**

### **3️⃣ Deploy as an API**
- Wrap the trained model in a **FastAPI** server.
- Package with **Docker**.
- Deploy using **Google Cloud Run / AWS Lambda**.

## **Technologies Used**
- **Libraries:** `lifelines`, `pycox`, `FastAPI`, `Hugging Face Transformers`
- **Model Training:** Traditional (Cox Model) & Deep Learning (DeepSurv, TFT)
- **Deployment:** FastAPI, Docker, Google Cloud Run/AWS Lambda

---

> 📌 **Next Steps:** Run the first code cell to preprocess the dataset and train the baseline Cox Proportional Hazards Model.


In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import kagglehub
import os

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error, mean_squared_log_error, explained_variance_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from scipy.stats import norm

In [5]:
# Download Life Expectancy dataset
life_exp_path = kagglehub.dataset_download("kumarajarshi/life-expectancy-who")
life_exp_file = os.path.join(life_exp_path, "Life Expectancy Data.csv")
life_exp_df = pd.read_csv(life_exp_file)

print(life_exp_df.head())

heart_path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")
heart_file = os.path.join(heart_path, "heart.csv")
heart_df = pd.read_csv(heart_file)

print(heart_df.head())

age_path = kagglehub.dataset_download("imoore/age-dataset")
age_file = os.path.join(age_path, "AgeDataset-V1.csv")  #
age_df = pd.read_csv(age_file)

print(age_df.head())

       Country  Year      Status  Life expectancy   Adult Mortality  \
0  Afghanistan  2015  Developing              65.0            263.0   
1  Afghanistan  2014  Developing              59.9            271.0   
2  Afghanistan  2013  Developing              59.9            268.0   
3  Afghanistan  2012  Developing              59.5            272.0   
4  Afghanistan  2011  Developing              59.2            275.0   

   infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   ...  \
0             62     0.01               71.279624         65.0      1154  ...   
1             64     0.01               73.523582         62.0       492  ...   
2             66     0.01               73.219243         64.0       430  ...   
3             69     0.01               78.184215         67.0      2787  ...   
4             71     0.01                7.097109         68.0      3013  ...   

   Polio  Total expenditure  Diphtheria    HIV/AIDS         GDP  Population  \
0    6.