# C2: ASSUMPTIONS OF LOGISTIC REGRESSION & MODEL EVALUATION

## Assumptions of Logistic Regression
Before performing Logistic Regression, we must check if certain conditions hold:

1. **Binary outcome**
    - The dependent variable must be binary (0/1, yes/no, true/false).
    - For multiclass problems, categorical extensions like *multinomial logistic regression* should be used.

2. **Independence of observations**
    - Each data point should be independent.
    - Example: A patient’s record should not influence another patient’s record.

3. **Linearity of Log-Odds**
    - Logistic regression assumes that the log(odds) has a linear relationship with the predictors.
    - Example: $\mathrm{log(odds)} = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{blood pressure}$

4. **No multicollinearity among predictors**
    - Predictors should not be highly correlated with each other.
    - Example: Height in cm and height in inches (redundant values).

5. **Large sample size**
    - Logistic regression works best with large datasets.
    - This is especially important when the event of interest is rare.

## Model Evaluation Metrics
- Logistic regression is primarily used for classification.
- We need to evaluate performance using multiple metrics, not just accuracy.

### Confusion Matrix

|                     | Predicted Positive  | Predicted Negative  |
| ------------------- | ------------------- | ------------------- |
| **Actual Positive** | True Positive (TP)  | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN)  |

### Metrics

1. **Accuracy**
    - $\mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
    - Shows overall correctness of the model.
    - Can be misleading in imbalanced datasets (e.g., when one class dominates).

2. **Precision**
    - $\mathrm{Precision} = \frac{TP}{TP + FP}$
    - Focus: Correctness of positive predictions.
    - Tells how many predicted positives were actually correct.
    - Example: Out of 100 predicted "cancer" cases, 80 were correct → precision = 0.8.

3. **Recall (Sensitivity / True Positive Rate)**
    - $\mathrm{Recall} = \frac{TP}{TP + FN}$
    - Focus: Capturing actual positives.
    - Tells how many of the real positive cases were correctly identified.
    - Example: Out of 100 actual cancer cases, the model detected 90 → recall = 0.9.

4. **F1 Score**
    - $\mathrm{F1} = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$
    - Harmonic mean of precision and recall.
    - Useful when there is class imbalance and we want a balance between precision and recall.

5. **ROC Curve & Area Under Curve (AUC)**
    - ROC curve plots True Positive Rate (Recall) vs. False Positive Rate.
    - AUC represents the area under this curve.
    - Interpretation:
        - AUC = 0.5 → Model is no better than random guessing.
        - AUC close to 1 → Model is very good at distinguishing between classes.


In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Example dataset
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62, 23, 34, 45, 36, 50],
    'bp': [120, 130, 140, 150, 160, 110, 125, 135, 128, 142],
    'has_disease': [0, 0, 1, 1, 1, 0, 0, 1, 0, 1]
})

X = data[['age', 'bp']]
y = data['has_disease']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]

print("Confusion Matrix :", confusion_matrix(y_test, y_pred))
print("Accuracy :", accuracy_score(y_test, y_pred))
print("Precision :", precision_score(y_test, y_pred, zero_division=0))
print("Recall :", recall_score(y_test, y_pred, zero_division=0))
print("F1 score :", f1_score(y_test, y_pred, zero_division=0))
print("ROC-AUC :", roc_auc_score(y_test, y_prob))

Confusion Matrix : [[2 0]
 [0 1]]
Accuracy : 1.0
Precision : 1.0
Recall : 1.0
F1 score : 1.0
ROC-AUC : 1.0


## Model Performance

Consider an example with a dataset of 100 patients:
- 95 healthy (class 0)
- 5 with disease (class 1)

If a model always predicts "healthy," it will achieve 95% accuracy.  
However, this is misleading because it fails to identify any diseased patients.  

This is why we use additional metrics such as **Precision, Recall, F1-score, and ROC-AUC**, instead of relying only on accuracy.

## Handling Imbalanced Data

When one class is much rarer than the other, models tend to ignore the minority class.  
Examples include **fraud detection, cancer diagnosis, churn prediction, and rare event forecasting**.

### Techniques to Handle Imbalance

1. **Resampling**
    - **Oversampling the minority class**: Duplicate rare samples or create synthetic ones.
    - **Undersampling the majority class**: Randomly drop some frequent class samples.
    - **Trade-off**: Oversampling may cause overfitting; undersampling can result in loss of useful information.

2. **Synthetic Minority Oversampling Technique (SMOTE)**
    - Creates new synthetic samples for the minority class by interpolating between existing samples.
    - Balances the dataset more effectively than simple duplication.
    - **Caution**: May create overlapping or noisy samples if not tuned carefully.

3. **Class Weight Adjustment**
    - Assign higher importance (weight) to the minority class during training.
    - In Logistic Regression, this can be done using `class_weight='balanced'`.
    - Helps the model pay more attention to rare events.

4. **Threshold Tuning**
    - Logistic regression outputs probabilities by default.
    - The usual cutoff is **0.5**, but in imbalanced problems, lowering the threshold can help capture more positives.
    - **Trade-off**: Lowering the threshold increases Recall but reduces Precision.

5. **Evaluation with Proper Metrics**
    - In imbalanced datasets, Accuracy is not reliable.
    - Metrics like **Precision-Recall Curve, F1-score, ROC-AUC, and Matthews Correlation Coefficient (MCC)** are more informative.


In [8]:
# Logistic regression with class weights
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X, y = make_classification(n_samples=1000, n_features=5, n_classes=2, weights=[0.9, 0.1], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

model_normal = LogisticRegression(max_iter=1000)
model_normal.fit(X_train, y_train)
print("Without class weight :", classification_report(y_test, model_normal.predict(X_test)))

model_balanced = LogisticRegression(class_weight='balanced', max_iter=1000)
model_balanced.fit(X_train, y_train)
print("With class weight :", classification_report(y_test, model_balanced.predict(X_test)))

Without class weight :               precision    recall  f1-score   support

           0       0.94      0.98      0.96       269
           1       0.70      0.45      0.55        31

    accuracy                           0.92       300
   macro avg       0.82      0.71      0.75       300
weighted avg       0.91      0.92      0.92       300

With class weight :               precision    recall  f1-score   support

           0       0.98      0.88      0.93       269
           1       0.45      0.84      0.58        31

    accuracy                           0.88       300
   macro avg       0.71      0.86      0.76       300
weighted avg       0.92      0.88      0.89       300



In [None]:
# Using SMOTE oversampling
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix

smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

print("Before SMOTE :", pd.Series(y_train).value_counts())
print("After SMOTE :", pd.Series(y_train_res).value_counts())

model_smote = LogisticRegression(max_iter=1000)
model_smote.fit(X_train_res,y_train_res)
y_pred = model_smote.predict(X_test)

print("Confustion matrix :", confusion_matrix(y_test, y_pred))
print("Classification report :", classification_report(y_test, y_pred))

Before SMOTE : 0    626
1     74
Name: count, dtype: int64
After SMOTE : 0    626
1    626
Name: count, dtype: int64
Confustion matrix : [[237  32]
 [  5  26]]
Classification report :               precision    recall  f1-score   support

           0       0.98      0.88      0.93       269
           1       0.45      0.84      0.58        31

    accuracy                           0.88       300
   macro avg       0.71      0.86      0.76       300
weighted avg       0.92      0.88      0.89       300

