# ü´Å Lung Cancer Prediction using Logistic Regression

**Author:** Florencekumari Makwana  
**Dataset:** Lung Cancer Survey Dataset (309 records, 15 features)  
**Goal:** Predict whether a patient has lung cancer based on lifestyle and symptom data using Logistic Regression.

---

## üìã Project Overview

Lung cancer is one of the leading causes of cancer-related deaths worldwide. Early detection significantly improves survival rates. In this project, we build a **binary classification model** using Logistic Regression to predict lung cancer presence based on survey responses about symptoms and lifestyle habits.

### Workflow
1. Data Loading & Exploration
2. Data Preprocessing
3. Train/Test Split
4. Model Training (Logistic Regression)
5. Model Evaluation (Accuracy, Precision, Recall, Specificity, Confusion Matrix)
6. Visualizations

## üì¶ 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    confusion_matrix, roc_auc_score, ConfusionMatrixDisplay
)

# Display settings
pd.set_option('display.max_columns', None)
sns.set_theme(style='whitegrid', palette='muted')
print('Libraries loaded successfully ‚úÖ')

## üìÇ 2. Load the Dataset

In [None]:
df = pd.read_csv('survey_lung_cancer.csv')

# Strip whitespace from column names (some have trailing spaces)
df.columns = df.columns.str.strip()

print(f'Dataset shape: {df.shape}')
print(f'Columns: {list(df.columns)}')
df.head()

## üîç 3. Exploratory Data Analysis (EDA)

In [None]:
# Basic info and statistics
print('=== Dataset Info ===')
df.info()
print('\n=== Missing Values ===')
print(df.isnull().sum())
print('\n=== Statistical Summary ===')
df.describe()

In [None]:
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
counts = df['LUNG_CANCER'].value_counts()
axes[0].bar(counts.index, counts.values, color=['#2ecc71', '#e74c3c'], edgecolor='black', width=0.5)
axes[0].set_title('Lung Cancer Class Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Lung Cancer')
axes[0].set_ylabel('Count')
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['NO', 'YES'])
for i, v in enumerate(counts.values):
    axes[0].text(counts.index[i], v + 1, str(v), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(counts.values, labels=['NO', 'YES'], autopct='%1.1f%%',
            colors=['#2ecc71', '#e74c3c'], startangle=90,
            wedgeprops={'edgecolor': 'white', 'linewidth': 2})
axes[1].set_title('Lung Cancer Proportion', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('class_distribution.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'\nClass balance ‚Äî YES: {counts.get("YES", 0)} | NO: {counts.get("NO", 0)}')

In [None]:
# Gender & Age distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Gender distribution by lung cancer
gender_lc = df.groupby(['GENDER', 'LUNG_CANCER']).size().unstack(fill_value=0)
gender_lc.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'],
               edgecolor='black', rot=0)
axes[0].set_title('Gender vs Lung Cancer', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Count')
axes[0].legend(['NO', 'YES'])

# Age distribution
axes[1].hist(df[df['LUNG_CANCER'] == 'YES']['AGE'], bins=15, alpha=0.7,
             color='#e74c3c', label='YES', edgecolor='black')
axes[1].hist(df[df['LUNG_CANCER'] == 'NO']['AGE'], bins=15, alpha=0.7,
             color='#2ecc71', label='NO', edgecolor='black')
axes[1].set_title('Age Distribution by Lung Cancer', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Count')
axes[1].legend()

plt.tight_layout()
plt.savefig('gender_age_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Symptom feature comparison (binary features: 1=No, 2=Yes)
symptom_cols = [
    'SMOKING', 'YELLOW_FINGERS', 'ANXIETY', 'PEER_PRESSURE',
    'CHRONIC DISEASE', 'FATIGUE', 'ALLERGY', 'WHEEZING',
    'ALCOHOL CONSUMING', 'COUGHING', 'SHORTNESS OF BREATH',
    'SWALLOWING DIFFICULTY', 'CHEST PAIN'
]

# Calculate % of positive (value=2) symptom per lung cancer class
temp = df.copy()
temp['LUNG_CANCER_BIN'] = (temp['LUNG_CANCER'] == 'YES').astype(int)

symptom_rates = {}
for col in symptom_cols:
    yes_rate = ((temp[temp['LUNG_CANCER_BIN'] == 1][col] == 2).mean() * 100)
    no_rate  = ((temp[temp['LUNG_CANCER_BIN'] == 0][col] == 2).mean() * 100)
    symptom_rates[col] = {'Lung Cancer: YES': yes_rate, 'Lung Cancer: NO': no_rate}

symptom_df = pd.DataFrame(symptom_rates).T

fig, ax = plt.subplots(figsize=(14, 6))
x = np.arange(len(symptom_df))
width = 0.35
bars1 = ax.bar(x - width/2, symptom_df['Lung Cancer: YES'], width,
               label='Lung Cancer: YES', color='#e74c3c', alpha=0.85, edgecolor='black')
bars2 = ax.bar(x + width/2, symptom_df['Lung Cancer: NO'], width,
               label='Lung Cancer: NO', color='#2ecc71', alpha=0.85, edgecolor='black')
ax.set_xticks(x)
ax.set_xticklabels(symptom_df.index, rotation=45, ha='right', fontsize=10)
ax.set_ylabel('% of Patients with Symptom', fontsize=12)
ax.set_title('Symptom Prevalence by Lung Cancer Status', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_ylim(0, 110)
plt.tight_layout()
plt.savefig('symptom_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

## üõ†Ô∏è 4. Data Preprocessing

Steps:
- Encode the target variable (`LUNG_CANCER`): `YES ‚Üí 1`, `NO ‚Üí 0`
- Encode the `GENDER` column using one-hot encoding
- Verify there are no missing values
- Separate features (`X`) and target (`y`)

In [None]:
# Encode target variable
df['LUNG_CANCER'] = df['LUNG_CANCER'].map({'YES': 1, 'NO': 0})

# Check for missing values
missing = df.isnull().sum().sum()
print(f'Missing values: {missing} ‚Äî {"‚úÖ None found" if missing == 0 else "‚ö†Ô∏è Dropping rows with NaN"}')
df.dropna(inplace=True)

# Separate features and target
X = df.drop(columns=['LUNG_CANCER'])
y = df['LUNG_CANCER'].astype(int)

# One-hot encode categorical features (GENDER)
X = pd.get_dummies(X, drop_first=True)

print(f'Feature matrix shape: {X.shape}')
print(f'Target distribution ‚Äî 1 (YES): {y.sum()} | 0 (NO): {(y == 0).sum()}')
X.head()

## ‚úÇÔ∏è 5. Train / Test Split

We split the dataset into **80% training** and **20% testing** using a fixed `random_state=42` for reproducibility.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

print(f'Training samples : {len(X_train)}')
print(f'Testing samples  : {len(X_test)}')

## ü§ñ 6. Model Training ‚Äî Logistic Regression

Logistic Regression estimates the probability of class membership using the **sigmoid function**:

$$P(y=1 | X) = \frac{1}{1 + e^{-(w^T X + b)}}$$

We set `max_iter=500` to ensure convergence on this dataset.

In [None]:
# Train the model
model = LogisticRegression(max_iter=500, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Display model coefficients
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values('Coefficient', ascending=False)

print('Model trained successfully ‚úÖ')
print('\nTop 5 most influential features (positive):')
print(coef_df.head())
print('\nBottom 5 features (negative influence):')
print(coef_df.tail())

In [None]:
# Visualize feature coefficients
plt.figure(figsize=(10, 6))
colors = ['#e74c3c' if c > 0 else '#3498db' for c in coef_df['Coefficient']]
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, edgecolor='black')
plt.axvline(0, color='black', linewidth=0.8, linestyle='--')
plt.title('Logistic Regression ‚Äî Feature Coefficients', fontsize=14, fontweight='bold')
plt.xlabel('Coefficient Value')
plt.tight_layout()
plt.savefig('feature_coefficients.png', dpi=150, bbox_inches='tight')
plt.show()

## üìä 7. Model Evaluation

We evaluate the model using the following metrics:

| Metric | Formula | What it measures |
|--------|---------|------------------|
| **Accuracy** | (TP+TN)/(TP+TN+FP+FN) | Overall correct predictions |
| **Precision** | TP/(TP+FP) | Of predicted positives, how many are correct |
| **Recall (Sensitivity)** | TP/(TP+FN) | Of actual positives, how many were found |
| **Specificity** | TN/(TN+FP) | Of actual negatives, how many were correctly identified |
| **ROC-AUC** | Area under ROC curve | Overall discrimination ability |

In [None]:
# Compute metrics
accuracy  = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall    = recall_score(y_test, y_pred)
roc_auc   = roc_auc_score(y_test, y_prob)
confusion = confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = confusion.ravel()
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0

print('=' * 40)
print('       MODEL EVALUATION METRICS')
print('=' * 40)
print(f'  Accuracy    : {accuracy:.4f}  ({accuracy*100:.2f}%)')
print(f'  Precision   : {precision:.4f}  ({precision*100:.2f}%)')
print(f'  Recall      : {recall:.4f}  ({recall*100:.2f}%)')
print(f'  Specificity : {specificity:.4f}  ({specificity*100:.2f}%)')
print(f'  ROC-AUC     : {roc_auc:.4f}')
print('=' * 40)
print(f'\nConfusion Matrix:')
print(f'  TN={tn}  FP={fp}')
print(f'  FN={fn}  TP={tp}')

In [None]:
# Visualize confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
disp = ConfusionMatrixDisplay(confusion_matrix=confusion, display_labels=['NO (0)', 'YES (1)'])
disp.plot(ax=axes[0], colorbar=False, cmap='Blues')
axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')

# Metrics bar chart
metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'Specificity': specificity,
    'ROC-AUC': roc_auc
}
bars = axes[1].bar(metrics.keys(), metrics.values(),
                   color=['#3498db', '#2ecc71', '#e74c3c', '#9b59b6', '#f39c12'],
                   edgecolor='black', alpha=0.85)
axes[1].set_ylim(0, 1.1)
axes[1].set_ylabel('Score')
axes[1].set_title('Model Performance Metrics', fontsize=14, fontweight='bold')
for bar in bars:
    axes[1].text(bar.get_x() + bar.get_width()/2,
                 bar.get_height() + 0.02,
                 f'{bar.get_height():.3f}', ha='center', fontweight='bold', fontsize=10)

plt.tight_layout()
plt.savefig('model_evaluation.png', dpi=150, bbox_inches='tight')
plt.show()

## üèÅ 8. Conclusion

### Results Summary

| Metric | Score |
|--------|-------|
| **Accuracy** | 96.77% |
| **Precision** | 98.33% |
| **Recall (Sensitivity)** | 98.33% |
| **Specificity** | 50.00% |
| **ROC-AUC** | 0.9250 |

**Confusion Matrix:** TN=1 | FP=1 | FN=1 | TP=59

---

### Key Findings

The Logistic Regression model demonstrates **strong predictive performance** on the lung cancer survey dataset:

- **Accuracy (96.77%)** ‚Äî The model correctly classifies the vast majority of patients across 62 test samples.
- **Precision (98.33%)** ‚Äî When the model predicts lung cancer, it is almost always correct, producing only 1 false positive in the test set.
- **Recall (98.33%)** ‚Äî The model successfully identifies nearly all actual lung cancer cases (59 out of 60), which is critical in a medical screening context where missing a positive case has serious consequences.
- **Specificity (50.00%)** ‚Äî The model correctly identifies only 1 out of 2 true negative cases. This is primarily driven by the severe class imbalance in the dataset (~87% positive class, with only 39 NO cases total), leaving very few negative samples in the test set to evaluate on.
- **ROC-AUC (0.9250)** ‚Äî An AUC of 0.925 confirms that the model has excellent overall discriminative ability between cancer and non-cancer cases, well above a random baseline of 0.5.

### Feature Insights

Based on the learned model coefficients, the **top 5 most influential predictors** of lung cancer are:

1. **Fatigue** (coef: 1.478) ‚Äî strongest positive predictor
2. **Alcohol Consuming** (coef: 1.351)
3. **Swallowing Difficulty** (coef: 1.335)
4. **Chronic Disease** (coef: 1.281)
5. **Coughing** (coef: 1.241)

Notably, **Gender (Male)** had a slight negative coefficient (-0.185), suggesting males in this dataset were marginally less likely to be classified as positive after controlling for other factors. **Age** had a near-zero coefficient (0.025), indicating it adds little predictive value once symptoms are accounted for.

### Limitations & Future Work

- The dataset is relatively small (309 records), which limits generalizability to real-world clinical settings.
- The severe class imbalance (~87% YES) inflates accuracy and recall while suppressing specificity. Techniques like **SMOTE**, **class weighting** (`class_weight='balanced'`), or **undersampling** could produce a more balanced model.
- Future experiments could benchmark against **Random Forest**, **XGBoost**, or **SVM** classifiers.
- **Cross-validation** (e.g., 5-fold) would provide a more robust and reliable performance estimate than a single train/test split.
- Feature engineering or collection of more NO-class samples would likely improve specificity significantly.