# Task 2.1: Baseline Model - Logistic Regression Implementation 

## Objective

The goal of this task is to establish a baseline classification model using Logistic Regression. This model will serve as the benchmark for comparing more complex supervised learning algorithms in subsequent tasks.

## Understanding Key Metrics

Since this is a supervised classification task, we use specific metrics to evaluate how well our model performs:

1. **Accuracy:** - Measures the overall correctness of predictions. Range is 0 to 1, where 1 indicates perfect classification.
2. **Precision (Macro):** - Average precision across all classes, measuring the proportion of correct positive predictions.
3. **Recall (Macro):** - Average recall across all classes, measuring the proportion of actual positives correctly identified.
4. **F1-Score (Macro):** - Harmonic mean of precision and recall, providing a balanced measure of model performance. Range is 0 to 1.

## Step 1: Environment Setup and Library Imports

Import all required libraries for data manipulation, modeling, and evaluation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import pickle
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Step 2: Load Data and Feature Selection


**Features to EXCLUDE:**
- **IDs:** `id`, `host_id` (unique identifiers, not predictive)
- **Dates:** `host_since`, `first_review`, `last_review` (already converted to numeric features)
- **Target:** `value_category` (what we're predicting)
- **Data Leakage:** `fp_score`, `rating_normalized`, `price_normalized` (used to create the target)

**Features to INCLUDE (62 total):**
- Price and accommodation features
- Review scores (CRITICAL!)
- Host characteristics
- Availability metrics
- Location features
- Categorical encodings

In [None]:
# Load the full dataset
df = pd.read_csv('../../data/processed/listings_cleaned_with_target.csv')

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")

# Define features to EXCLUDE
exclude_cols = [
    'id', 'host_id', 'host_since', 'first_review', 'last_review',
    'value_category',  # Target variable
    'fp_score',  # Data leakage: used to create target
    'rating_normalized',  # Data leakage: part of fp_score
    'price_normalized'  # Data leakage: part of fp_score
]

# Select all other columns as features
feature_cols = [col for col in df.columns if col not in exclude_cols]

print(f"\nFeatures selected: {len(feature_cols)}")
print(f"\nTop 10 features:")
for i, col in enumerate(feature_cols[:10], 1):
    print(f"  {i}. {col}")
print("  ...")

# Separate features and target
X = df[feature_cols].copy()
y = df['value_category'].copy()

print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nClass balance:")
print(y.value_counts(normalize=True) * 100)

## Step 3: Train-Test Split (80-20)

Split data into training (80%) and testing (20%) sets with stratification to maintain class balance.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y  # Maintain class distribution
)
# Remove leaky features
leaky_features = [
    'price', 'price_normalized', 'price_per_person', 'price_per_bathroom',
    'price_per_bedroom', 'review_scores_rating', 'review_scores_value',
    'value_density', 'estimated_revenue_l365d'  # these also use price
]

# Drop leaky features that exist in the dataset
cols_to_drop = [col for col in leaky_features if col in X_train.columns]
X_train = X_train.drop(columns=cols_to_drop)
X_test = X_test.drop(columns=cols_to_drop)

# Updated feature_cols to match
feature_cols = X_train.columns.tolist()

print(f"Dropped {len(cols_to_drop)} leaky features: {cols_to_drop}")
print(f"Remaining features: {X_train.shape[1]}")

print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Testing set: {X_test.shape[0]} samples ({X_test.shape[0]/len(df)*100:.1f}%)")
print(f"Number of features: {X_train.shape[1]}")

print(f"\nClass distribution in training set:")
for category in sorted(y_train.unique()):
    count = (y_train == category).sum()
    pct = count / len(y_train) * 100
    print(f"  {category}: {count} ({pct:.2f}%)")

## Step 4: Feature Scaling (Standardization)

**Why Scale?**
- Features have different ranges (e.g., price: 0-1000, bedrooms: 1-5)
- Logistic Regression is sensitive to feature scales
- Standardization: Transform to mean=0, std=1

**Important:** Fit scaler on training data only, then transform both train and test!

In [None]:
# Initialize scaler
scaler = StandardScaler()

# Fit on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_cols, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_cols, index=X_test.index)

print(" Features scaled successfully!")
print(f"\nScaling verification (first 3 features):")
for col in feature_cols[:3]:
    print(f"\n{col}:")
    print(f"  Train - Mean: {X_train_scaled[col].mean():.6f}, Std: {X_train_scaled[col].std():.6f}")
    print(f"  Test  - Mean: {X_test_scaled[col].mean():.6f}, Std: {X_test_scaled[col].std():.6f}")

## Step 5: Train Logistic Regression Model

**Hyperparameters:**
- **C=1.0:** Regularization strength (inverse, smaller = stronger regularization)
- **penalty='l2':** Ridge regularization (prevents overfitting)
- **max_iter=1000:** Maximum iterations for convergence
- **random_state=42:** For reproducibility
- **class_weight='balanced':** Automatically adjust weights for class imbalance

In [None]:
# Initialize and train model
lr_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
    C=1.0,
    penalty='l2',
    class_weight='balanced'
)

print("Training model...")
lr_model.fit(X_train_scaled, y_train)
print(" Model training complete!")

print(f"\nModel details:")
print(f"  Classes: {lr_model.classes_}")
print(f"  Number of iterations: {lr_model.n_iter_[0]}")

## Step 6: Model Evaluation

Evaluate model performance on both training and testing sets.

In [None]:
# Make predictions
y_train_pred = lr_model.predict(X_train_scaled)
y_test_pred = lr_model.predict(X_test_scaled)

# Calculate metrics
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred, average='macro')
test_recall = recall_score(y_test, y_test_pred, average='macro')
test_f1 = f1_score(y_test, y_test_pred, average='macro')

print("=" * 60)
print("MODEL PERFORMANCE")
print("=" * 60)
print(f"Training Accuracy:   {train_acc:.4f} ({train_acc*100:.2f}%)")
print(f"Testing Accuracy:    {test_acc:.4f} ({test_acc*100:.2f}%)")
print(f"Precision (Macro):   {test_precision:.4f}")
print(f"Recall (Macro):      {test_recall:.4f}")
print(f"F1-Score (Macro):    {test_f1:.4f}")

# Check for overfitting
overfit_gap = train_acc - test_acc
print(f"\nOverfitting check:")
print(f" Train-Test Gap: {overfit_gap:.4f} ({overfit_gap*100:.2f}%)")
if overfit_gap < 0.05:
    print(" Good generalization!")
elif overfit_gap < 0.10:
    print(" Slight overfitting")
else:
    print(" Significant overfitting")

## Step 7: Detailed Classification Report

Per-class performance breakdown showing precision, recall, and F1-score for each value category.

In [None]:
print("\n" + "=" * 60)
print("DETAILED CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(y_test, y_test_pred))

# Prediction distribution
print("\nPrediction distribution on test set:")
pred_counts = pd.Series(y_test_pred).value_counts()
for category in sorted(pred_counts.index):
    count = pred_counts[category]
    pct = count / len(y_test_pred) * 100
    print(f"  {category}: {count} ({pct:.2f}%)")

## Step 8: Feature Importance Analysis

Analyze which features are most important for classification by examining model coefficients.

In [None]:
# Get feature coefficients for each class
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'Excellent_coef': lr_model.coef_[0],
    'Fair_coef': lr_model.coef_[1],
    'Poor_coef': lr_model.coef_[2]
})

# Calculate average absolute coefficient
feature_importance['avg_abs_coef'] = feature_importance[['Excellent_coef', 'Fair_coef', 'Poor_coef']].abs().mean(axis=1)
feature_importance = feature_importance.sort_values('avg_abs_coef', ascending=False)

print("\n" + "=" * 60)
print(" TOP 15 MOST IMPORTANT FEATURES")
print("=" * 60)
for i, row in feature_importance.head(15).iterrows():
    print(f"{row['feature']:45s} | Avg |Coef|: {row['avg_abs_coef']:.4f}")

print("\n Interpretation:")
print("  - Larger |coefficient| = more important for classification")
print("  - Positive coef = increases probability of that class")
print("  - Negative coef = decreases probability of that class")

## Step 9: Save Results and Model

Save all outputs for future use and comparison with other models.

In [None]:
# Create directories if they don't exist
Path('../../data/processed').mkdir(parents=True, exist_ok=True)
Path('../../models').mkdir(parents=True, exist_ok=True)

# Save scaled data
X_train_scaled.to_csv('../../data/processed/X_train_scaled.csv', index=False)
X_test_scaled.to_csv('../../data/processed/X_test_scaled.csv', index=False)
y_train.to_frame(name='value_category').to_csv('../../data/processed/y_train.csv', index=False)
y_test.to_frame(name='value_category').to_csv('../../data/processed/y_test.csv', index=False)

# Save scaler
with open('../../models/standard_scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

# Save model
with open('../../models/logistic_regression_model.pkl', 'wb') as f:
    pickle.dump(lr_model, f)

# Save results
results_df = pd.DataFrame({
    'model': ['Logistic Regression'],
    'train_accuracy': [train_acc],
    'test_accuracy': [test_acc],
    'precision_macro': [test_precision],
    'recall_macro': [test_recall],
    'f1_macro': [test_f1]
})
results_df.to_csv('../../data/processed/logistic_regression_results.csv', index=False)

# Save predictions
predictions_df = pd.DataFrame({
    'y_true': y_test.values,
    'y_pred': y_test_pred
})
predictions_df.to_csv('../../data/processed/logistic_regression_predictions.csv', index=False)

print(" ALL FILES SAVED SUCCESSFULLY!")
print("\nFiles created:")
print("   data/processed/")
print("     ├── X_train_scaled.csv")
print("     ├── X_test_scaled.csv")
print("     ├── y_train.csv")
print("     ├── y_test.csv")
print("     ├── logistic_regression_results.csv")
print("     └── logistic_regression_predictions.csv")
print("\n   models/")
print("     ├── logistic_regression_model.pkl")
print("     └── standard_scaler.pkl")