# Pancreatic Cancer Survival Prediction

This ML project predicts survival status of pancreatic cancer patients using clinical data. Inspired by the Molecular Twin study, we apply Logistic Regression with Recursive Feature Elimination and a Random Forest model to classify survival outcomes.

In [None]:
# Step 1: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report, confusion_matrix


## Step 2: Upload and Load Data

In [None]:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv("pre_processed_pancreatic_sample.csv")
df = df.drop(columns=["Unnamed: 0"])
df.head()

## Step 3: Prepare Features and Labels

In [None]:
X = df.drop("Survival_Status", axis=1)
y = df["Survival_Status"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 4: Standardize Features

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 5: Logistic Regression with RFE

In [None]:
lr = LogisticRegression(max_iter=1000)
rfe = RFE(lr, n_features_to_select=15)
X_train_rfe = rfe.fit_transform(X_train_scaled, y_train)
X_test_rfe = rfe.transform(X_test_scaled)
lr.fit(X_train_rfe, y_train)
y_pred_lr = lr.predict(X_test_rfe)
print("Logistic Regression Results:\n")
print(classification_report(y_test, y_pred_lr))

## Step 6: Random Forest Classifier

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Results:\n")
print(classification_report(y_test, y_pred_rf))

## Step 7: Confusion Matrix and Feature Importance

In [None]:
cm = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap="Blues", xticklabels=["Deceased", "Survived"], yticklabels=["Deceased", "Survived"])
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

importances = rf.feature_importances_
features = X.columns
importance_df = pd.DataFrame({"Feature": features, "Importance": importances})
importance_df = importance_df.sort_values(by="Importance", ascending=False).head(15)

plt.figure(figsize=(10, 6))
sns.barplot(x="Importance", y="Feature", data=importance_df)
plt.title("Top 15 Important Features")
plt.show()

## Conclusion

This notebook replicates the concept of using diverse clinical data to predict patient survival, similar to the Molecular Twin study. Random Forest performed well and highlighted the most important predictors in a resource-accessible clinical setting.