# Logistic Regression - Binary Classification

**Algorithm 6 of 7**

Logistic Regression is used for binary classification problems. We'll convert the power consumption regression problem into a classification task: predicting whether consumption is High or Low.

**Key Concepts:**
- Sigmoid function for probability estimation
- Binary cross-entropy loss
- Decision boundary
- Classification metrics

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
print("✅ Libraries loaded")

## 1. Load Data and Create Binary Labels

In [None]:
with open('../datasets/processed/household_preprocessed.pkl', 'rb') as f:
    data = pickle.load(f)

X_train = data['X_train_scaled']
X_test = data['X_test_scaled']
y_train_reg = data['y_train']
y_test_reg = data['y_test']

# Create binary labels: High (1) vs Low (0) consumption
# Use median as threshold
threshold = y_train_reg.median()

y_train = (y_train_reg > threshold).astype(int).values
y_test = (y_test_reg > threshold).astype(int).values

print(f"Threshold for High/Low: {threshold:.3f} kW")
print(f"\nTraining set distribution:")
print(f"  Low (0): {(y_train == 0).sum():,} ({(y_train == 0).sum()/len(y_train)*100:.1f}%)")
print(f"  High (1): {(y_train == 1).sum():,} ({(y_train == 1).sum()/len(y_train)*100:.1f}%)")

print(f"\nTest set distribution:")
print(f"  Low (0): {(y_test == 0).sum():,} ({(y_test == 0).sum()/len(y_test)*100:.1f}%)")
print(f"  High (1): {(y_test == 1).sum():,} ({(y_test == 1).sum()/len(y_test)*100:.1f}%)")

## 2. Logistic Regression Theory

**Sigmoid Function:**
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Model:**
$$P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

**Decision Rule:**
- If $P(y=1|x) \geq 0.5$: Predict class 1 (High)
- If $P(y=1|x) < 0.5$: Predict class 0 (Low)

**Loss Function (Binary Cross-Entropy):**
$$L = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

## 3. Train Logistic Regression Model

In [None]:
print("="*70)
print("LOGISTIC REGRESSION")
print("="*70)

# Initialize and train model
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# Predictions
y_pred = lr_model.predict(X_test)
y_pred_proba = lr_model.predict_proba(X_test)[:, 1]  # Probability of class 1

print("✅ Model trained successfully!")

## 4. Model Evaluation

In [None]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)

print("="*70)
print("CLASSIFICATION METRICS")
print("="*70)
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")
print(f"AUC-ROC:   {auc:.4f}")

print("\n" + "="*70)
print("DETAILED CLASSIFICATION REPORT")
print("="*70)
print(classification_report(y_test, y_pred, target_names=['Low (0)', 'High (1)']))

## 5. Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Low', 'High'],
            yticklabels=['Low', 'High'],
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Label', fontweight='bold', fontsize=12)
plt.ylabel('True Label', fontweight='bold', fontsize=12)
plt.title('Confusion Matrix - Logistic Regression', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Confusion Matrix:")
print(f"  True Negatives (TN):  {cm[0,0]:,}")
print(f"  False Positives (FP): {cm[0,1]:,}")
print(f"  False Negatives (FN): {cm[1,0]:,}")
print(f"  True Positives (TP):  {cm[1,1]:,}")

## 6. ROC Curve

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc:.4f})')
plt.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Random Classifier')
plt.xlabel('False Positive Rate', fontweight='bold', fontsize=12)
plt.ylabel('True Positive Rate', fontweight='bold', fontsize=12)
plt.title('ROC Curve - Logistic Regression', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"AUC-ROC Score: {auc:.4f}")
print("\nInterpretation:")
if auc > 0.9:
    print("  Excellent classification performance!")
elif auc > 0.8:
    print("  Good classification performance")
elif auc > 0.7:
    print("  Fair classification performance")
else:
    print("  Poor classification performance")

## Conclusions

**Logistic Regression Results:**
- Algorithm 6 of 7 successfully implemented
- Binary classification: High vs Low power consumption
- Achieved {accuracy*100:.2f}% accuracy

**Key Metrics:**
- Precision: {precision:.4f} - How often predictions of "High" are correct
- Recall: {recall:.4f} - How often actual "High" cases are detected
- F1-Score: {f1:.4f} - Balanced measure of precision and recall
- AUC-ROC: {auc:.4f} - Model's discrimination ability

**Applications:**
- Classify consumption patterns
- Predict if household will have high usage
- Support energy management decisions
- Identify anomalies in consumption behavior