# Logistic Regression Tutorial
## Binary Classification: Predicting Expensive Houses

Welcome to **Logistic Regression** - our first classification algorithm!

### What you'll learn:
- How logistic regression differs from linear regression
- The sigmoid function and probability prediction
- Binary classification metrics (accuracy, precision, recall)
- ROC curves and confusion matrices
- Decision boundaries and feature importance

### Our Task:
Predict whether a house is **expensive** (>$350,000) based on its features.

Let's start classifying! üéØ

## Step 1: Import Libraries and Load Data

For classification, we need additional metrics and visualization tools.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                           f1_score, confusion_matrix, roc_auc_score,
                           roc_curve, classification_report)
import seaborn as sns

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Load the dataset
data = pd.read_csv('dataset.csv')

print("‚úÖ Libraries imported and data loaded!")
print(f"Dataset shape: {data.shape}")
print(f"Columns: {list(data.columns)}")
data.head()

## Step 2: Explore the Target Variable

Let's analyze our binary target variable - whether houses are expensive or not.

In [None]:
# Analyze the target variable
print("=" * 50)
print("TARGET VARIABLE ANALYSIS")
print("=" * 50)

# Count classes
target_counts = data['expensive'].value_counts()
print("Class distribution:")
print(f"Not Expensive (0): {target_counts[0]} ({target_counts[0]/len(data)*100:.1f}%)")
print(f"Expensive (1): {target_counts[1]} ({target_counts[1]/len(data)*100:.1f}%)")

# Visualize class distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Pie chart
labels = ['Not Expensive', 'Expensive']
colors = ['lightcoral', 'lightblue']
ax1.pie(target_counts.values, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
ax1.set_title('Class Distribution')

# Bar chart
ax2.bar(labels, target_counts.values, color=colors, alpha=0.7)
ax2.set_ylabel('Count')
ax2.set_title('Class Counts')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Check if dataset is balanced
balance_ratio = min(target_counts) / max(target_counts)
if balance_ratio > 0.8:
    print("‚úÖ Dataset is well balanced")
elif balance_ratio > 0.6:
    print("üëç Dataset is reasonably balanced")
else:
    print("‚ö†Ô∏è Dataset is imbalanced - consider balancing techniques")

## Step 3: Feature Analysis by Class

Let's analyze how our features differ between expensive and non-expensive houses.

In [None]:
# Analyze features by class
print("=" * 50)
print("FEATURE ANALYSIS BY CLASS")
print("=" * 50)

numerical_features = ['area', 'bedrooms', 'age']

for feature in numerical_features:
    expensive_mean = data[data['expensive'] == 1][feature].mean()
    not_expensive_mean = data[data['expensive'] == 0][feature].mean()
    
    print(f"\n{feature.upper()}:")
    print(f"  Expensive houses: {expensive_mean:.2f}")
    print(f"  Not expensive houses: {not_expensive_mean:.2f}")
    print(f"  Difference: {expensive_mean - not_expensive_mean:.2f}")

# Visualize feature distributions by class
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Feature Analysis by Class', fontsize=16, fontweight='bold')

colors = ['lightcoral', 'lightblue']
labels = ['Not Expensive', 'Expensive']

# Area distribution by class
for i, class_val in enumerate([0, 1]):
    class_data = data[data['expensive'] == class_val]['area']
    axes[0, 0].hist(class_data, alpha=0.7, label=labels[i], color=colors[i], bins=10)
axes[0, 0].set_xlabel('House Area (sq ft)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Area Distribution by Class')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Box plot: Area by class
data.boxplot(column='area', by='expensive', ax=axes[0, 1])
axes[0, 1].set_xlabel('Expensive (0=No, 1=Yes)')
axes[0, 1].set_ylabel('Area (sq ft)')
axes[0, 1].set_title('Area Distribution by Class')

# Age distribution by class
for i, class_val in enumerate([0, 1]):
    class_data = data[data['expensive'] == class_val]['age']
    axes[1, 0].hist(class_data, alpha=0.7, label=labels[i], color=colors[i], bins=10)
axes[1, 0].set_xlabel('House Age (years)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Age Distribution by Class')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Scatter plot: Area vs Price, colored by class
for i, class_val in enumerate([0, 1]):
    class_data = data[data['expensive'] == class_val]
    axes[1, 1].scatter(class_data['area'], class_data['price'],
                      alpha=0.7, label=labels[i], color=colors[i])
axes[1, 1].set_xlabel('Area (sq ft)')
axes[1, 1].set_ylabel('Price ($)')
axes[1, 1].set_title('Area vs Price (by Class)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 4: Data Preparation for Classification

Prepare our features and target for logistic regression, including encoding categorical variables and scaling.

In [None]:
# Prepare data for logistic regression
print("=" * 50)
print("DATA PREPARATION")
print("=" * 50)

# Handle categorical variables (encode location)
label_encoder = LabelEncoder()
data_processed = data.copy()
data_processed['location_encoded'] = label_encoder.fit_transform(data_processed['location'])

print("Location encoding:")
for i, location in enumerate(label_encoder.classes_):
    print(f"  {location} -> {i}")

# Select features (excluding price as it's used to create target)
feature_columns = ['area', 'bedrooms', 'age', 'location_encoded']
X = data_processed[feature_columns].values
y = data_processed['expensive'].values

print(f"\nFeatures shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature columns: {feature_columns}")

# Split the data with stratification (maintains class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Ensures balanced split
)

print(f"\nData split:")
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

# Check class distribution in splits
train_dist = np.bincount(y_train)
test_dist = np.bincount(y_test)
print(f"\nClass distribution:")
print(f"Training: {train_dist} ({train_dist/len(y_train)*100})")
print(f"Testing: {test_dist} ({test_dist/len(y_test)*100})")

## Step 5: Feature Scaling

Scale features for optimal logistic regression performance.

In [None]:
# Feature scaling (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("=" * 50)
print("FEATURE SCALING")
print("=" * 50)
print("‚úÖ Features scaled successfully!")

# Show scaling effect
print("\nScaling effect (first 3 features):")
for i, feature in enumerate(feature_columns[:3]):
    print(f"\n{feature}:")
    print(f"  Original - Mean: {X_train[:, i].mean():.2f}, Std: {X_train[:, i].std():.2f}")
    print(f"  Scaled   - Mean: {X_train_scaled[:, i].mean():.2f}, Std: {X_train_scaled[:, i].std():.2f}")

## Step 6: Train Logistic Regression Model

Train our logistic regression classifier and analyze the learned parameters.

In [None]:
# Train logistic regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)

print("=" * 50)
print("MODEL TRAINING")
print("=" * 50)
print("‚úÖ Model trained successfully!")

# Extract model parameters
coefficients = model.coef_[0]
intercept = model.intercept_[0]

print(f"\nModel Parameters:")
print(f"Intercept: {intercept:.4f}")
print(f"\nCoefficients:")
for feature, coef in zip(feature_columns, coefficients):
    print(f"  {feature:15s}: {coef:8.4f}")

# Feature importance analysis
feature_importance = list(zip(feature_columns, np.abs(coefficients)))
feature_importance.sort(key=lambda x: x[1], reverse=True)

print(f"\nüìä Feature Importance (coefficient magnitudes):")
for i, (feature, importance) in enumerate(feature_importance, 1):
    print(f"  {i}. {feature:15s}: {importance:.4f}")

# Coefficient interpretation
print(f"\nüîç Coefficient Interpretation (Odds Ratios):")
for feature, coef in zip(feature_columns, coefficients):
    odds_ratio = np.exp(coef)
    if coef > 0:
        effect = "increases"
    else:
        effect = "decreases"
    print(f"  {feature:15s}: {effect} odds by factor of {odds_ratio:.3f}")

## Step 7: Make Predictions and Get Probabilities

Use our trained model to make predictions and get probability estimates.

In [None]:
# Make predictions
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]  # Probability of class 1

print("=" * 50)
print("PREDICTIONS")
print("=" * 50)
print(f"‚úÖ Predictions completed on {len(y_pred)} test samples!")
print(f"Probability range: {y_prob.min():.3f} to {y_prob.max():.3f}")

# Show sample predictions
print(f"\nüìã Sample Predictions:")
print(f"{'Actual':>8} {'Predicted':>10} {'Probability':>12} {'Confidence':>12}")
print("-" * 45)

for i in range(min(10, len(y_test))):
    actual = y_test[i]
    predicted = y_pred[i]
    probability = y_prob[i]
    confidence = max(probability, 1 - probability)
    
    print(f"{actual:7d} {predicted:9d} {probability:11.3f} {confidence:11.3f}")

# Prediction statistics
pred_dist = np.bincount(y_pred)
print(f"\nüìä Prediction Distribution:")
print(f"Predicted Not Expensive: {pred_dist[0]}")
print(f"Predicted Expensive: {pred_dist[1]}")

## Step 8: Model Evaluation

Evaluate our classification model using comprehensive metrics.

In [None]:
# Calculate classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print("=" * 50)
print("MODEL EVALUATION")
print("=" * 50)

print("üìä Classification Metrics:")
print(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.1f}%)")
print(f"  Precision: {precision:.4f} ({precision*100:.1f}%)")
print(f"  Recall:    {recall:.4f} ({recall*100:.1f}%)")
print(f"  F1-Score:  {f1:.4f}")
print(f"  ROC-AUC:   {roc_auc:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"\nüìà Confusion Matrix:")
print(f"              Predicted")
print(f"           Not Exp  Expensive")
print(f"Actual Not Exp  {tn:3d}      {fp:3d}")
print(f"    Expensive   {fn:3d}      {tp:3d}")

print(f"\nüîç Detailed Breakdown:")
print(f"  True Positives (TP):  {tp} - Correctly predicted expensive")
print(f"  True Negatives (TN):  {tn} - Correctly predicted not expensive")
print(f"  False Positives (FP): {fp} - Incorrectly predicted expensive")
print(f"  False Negatives (FN): {fn} - Incorrectly predicted not expensive")

# Performance interpretation
print(f"\nüí° Performance Interpretation:")
if accuracy > 0.9:
    print("üåü Excellent accuracy!")
elif accuracy > 0.8:
    print("‚úÖ Good accuracy!")
elif accuracy > 0.7:
    print("üëç Fair accuracy.")
else:
    print("‚ö†Ô∏è Poor accuracy - model needs improvement.")

if roc_auc > 0.9:
    print("üåü Excellent discrimination ability!")
elif roc_auc > 0.8:
    print("‚úÖ Good discrimination ability!")
elif roc_auc > 0.7:
    print("üëç Fair discrimination ability.")
else:
    print("‚ö†Ô∏è Poor discrimination.")

# Detailed classification report
print(f"\nüìã Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Expensive', 'Expensive']))

## Step 9: Visualization of Results

Create comprehensive visualizations to understand our model's performance.

In [None]:
# Create comprehensive visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Logistic Regression Classification Results', fontsize=16, fontweight='bold')

# 1. Confusion Matrix Heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0],
            xticklabels=['Not Expensive', 'Expensive'],
            yticklabels=['Not Expensive', 'Expensive'])
axes[0, 0].set_title('Confusion Matrix')
axes[0, 0].set_xlabel('Predicted')
axes[0, 0].set_ylabel('Actual')

# 2. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
axes[0, 1].plot(fpr, tpr, color='darkorange', lw=2, 
                label=f'ROC curve (AUC = {roc_auc:.3f})')
axes[0, 1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
axes[0, 1].set_xlim([0.0, 1.0])
axes[0, 1].set_ylim([0.0, 1.05])
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend(loc="lower right")
axes[0, 1].grid(True, alpha=0.3)

# 3. Probability Distribution by Class
prob_expensive = y_prob[y_test == 1]
prob_not_expensive = y_prob[y_test == 0]

axes[0, 2].hist(prob_not_expensive, alpha=0.7, label='Not Expensive', 
                color='lightcoral', bins=15, density=True)
axes[0, 2].hist(prob_expensive, alpha=0.7, label='Expensive', 
                color='lightblue', bins=15, density=True)
axes[0, 2].axvline(x=0.5, color='red', linestyle='--', linewidth=2, label='Threshold')
axes[0, 2].set_xlabel('Predicted Probability')
axes[0, 2].set_ylabel('Density')
axes[0, 2].set_title('Probability Distribution by Class')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Feature Importance
coefficients_abs = np.abs(coefficients)
feature_names = feature_columns

importance_order = np.argsort(coefficients_abs)[::-1]
sorted_features = [feature_names[i] for i in importance_order]
sorted_coefficients = coefficients_abs[importance_order]

bars = axes[1, 0].bar(range(len(sorted_features)), sorted_coefficients, 
                      alpha=0.7, color=['blue', 'green', 'red', 'orange'])
axes[1, 0].set_xlabel('Features')
axes[1, 0].set_ylabel('Coefficient Magnitude')
axes[1, 0].set_title('Feature Importance')
axes[1, 0].set_xticks(range(len(sorted_features)))
axes[1, 0].set_xticklabels(sorted_features, rotation=45)
axes[1, 0].grid(True, alpha=0.3)

# 5. Prediction Confidence
confidence = np.maximum(y_prob, 1 - y_prob)
correct_predictions = (y_pred == y_test)

axes[1, 1].scatter(confidence[correct_predictions], [1]*sum(correct_predictions), 
                   alpha=0.6, color='green', label='Correct', s=30)
axes[1, 1].scatter(confidence[~correct_predictions], [0]*sum(~correct_predictions), 
                   alpha=0.6, color='red', label='Incorrect', s=30)
axes[1, 1].set_xlabel('Prediction Confidence')
axes[1, 1].set_ylabel('Prediction Outcome')
axes[1, 1].set_title('Prediction Confidence vs Accuracy')
axes[1, 1].set_yticks([0, 1])
axes[1, 1].set_yticklabels(['Incorrect', 'Correct'])
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Metrics Comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
values = [accuracy, precision, recall, f1, roc_auc]
colors_metric = ['blue', 'green', 'red', 'orange', 'purple']

bars = axes[1, 2].bar(metrics, values, alpha=0.7, color=colors_metric)
axes[1, 2].set_ylabel('Score')
axes[1, 2].set_title('Classification Metrics Summary')
axes[1, 2].set_ylim([0, 1])
axes[1, 2].tick_params(axis='x', rotation=45)
axes[1, 2].grid(True, alpha=0.3)

# Add value labels on bars
for bar, value in zip(bars, values):
    height = bar.get_height()
    axes[1, 2].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                    f'{value:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## Step 10: Summary and Key Learnings

### üéØ What We Accomplished:
1. **Converted regression to classification** by creating binary target
2. **Analyzed class distributions** and feature differences
3. **Prepared data** with encoding and scaling
4. **Trained logistic regression** classifier
5. **Made probability predictions** on test data
6. **Evaluated performance** with classification metrics
7. **Visualized results** with ROC curves and confusion matrices

### üìä Key Results:
- **Accuracy**: {accuracy:.1%} of predictions correct
- **ROC-AUC**: {roc_auc:.3f} discrimination ability
- **Precision**: {precision:.1%} of expensive predictions correct
- **Recall**: {recall:.1%} of expensive houses identified

### üí° Key Learnings:
- **Logistic regression** uses sigmoid function for probability prediction
- **Classification metrics** are different from regression metrics
- **ROC-AUC** measures model's ability to distinguish classes
- **Confusion matrix** provides detailed breakdown of predictions
- **Feature scaling** is important for logistic regression

### üöÄ Next Steps:
- Explore **polynomial regression** for non-linear relationships
- Learn about **regularization** (Ridge, Lasso) for overfitting
- Try **multi-class classification** problems
- Experiment with **feature engineering** and selection

### ü§î Questions to Consider:
- How would different probability thresholds affect results?
- What if we had more than 2 classes?
- How do we handle highly imbalanced datasets?
- When is classification better than regression?

Congratulations on mastering logistic regression! üéâ