# Task 2.1: Baseline Model - Logistic Regression Implementation 

## Objective

The goal of this task is to establish a baseline classification model using Logistic Regression. This model will serve as the benchmark for comparing more complex supervised learning algorithms in subsequent tasks.





## Understanding Key Metrics

Since this is a supervised classification task, we use specific metrics to evaluate how well our model performs:

1. **Accuracy:** - Measures the overall correctness of predictions. Range is 0 to 1, where 1 indicates perfect classification.
2. **Precision (Macro):** - Average precision across all classes, measuring the proportion of correct positive predictions.
3. **Recall (Macro):** - Average recall across all classes, measuring the proportion of actual positives correctly identified.
4. **F1-Score (Macro):** - Harmonic mean of precision and recall, providing a balanced measure of model performance. Range is 0 to 1.

## Step 1: Environment Setup and Library Imports

Import all required libraries for data manipulation, modeling, and evaluation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import pickle
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Step 2: Load Clean Data from T1.5

**CRITICAL:** We load the pre-processed data from T1.5 which contains only landlord-controlled features.

**Features excluded (by T1.5):**
- **Review-based:** All review scores, number of reviews, reviews per month, etc.
- **Target leakage:** fp_score, rating_normalized, price_normalized, value_category
- **Identifiers:** id, host_id
- **Dates:** host_since, first_review, last_review

**Features included:**
- Price and accommodation features (price, bedrooms, beds, bathrooms, accommodates)
- Host characteristics (superhost, response rate, listings count)
- Location (latitude, longitude, neighbourhood, city)
- Availability metrics
- Property attributes (property_type, room_type)
- Engineered features (host_years, space_efficiency, etc.)

In [None]:
# Load the CLEAN dataset from T1.5 (landlord features only)
print("="*80)
print("="*80)

# Load pre-split data from T1.5
X_train = pd.read_csv('../../data/processed/X_train_landlord.csv')
X_test = pd.read_csv('../../data/processed/X_test_landlord.csv')
y_train = pd.read_csv('../../data/processed/y_train_landlord.csv').squeeze()
y_test = pd.read_csv('../../data/processed/y_test_landlord.csv').squeeze()

print(f"\n Data loaded successfully!")
print(f" Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f" Test set: {X_test.shape[0]} samples, {X_test.shape[1]} features")

print(f"\n Target distribution (Training):")
print(y_train.value_counts())
print(f"\n Class balance (Training):")
print(y_train.value_counts(normalize=True) * 100)

print(f"\n Target distribution (Test):")
print(y_test.value_counts())

print(f"\n Features included:")
feature_cols = X_train.columns.tolist()
print(f"  Total features: {len(feature_cols)}")
for i, col in enumerate(feature_cols[:15], 1):
    print(f"  {i}. {col}")
if len(feature_cols) > 15:
    print(f"  ... and {len(feature_cols) - 15} more")



In [None]:
# Find and remove non-numeric columns
print("Checking for non-numeric columns...")
non_numeric = X_train.select_dtypes(include=['object']).columns.tolist()
print(f"Non-numeric columns found: {non_numeric}")

if non_numeric:
    X_train = X_train.drop(columns=non_numeric)
    X_test = X_test.drop(columns=non_numeric)
    feature_cols = X_train.columns.tolist()
    print(f" Dropped {len(non_numeric)} columns. New shape: {X_train.shape}")

   
feature_cols = X_train.columns.tolist()
print(f"Final feature set has {len(feature_cols)} numeric features.")

## Step 3: Train Logistic Regression Model

**Hyperparameters:**
- **C=1.0:** Regularization strength (inverse, smaller = stronger regularization)
- **penalty='l2':** Ridge regularization (prevents overfitting)
- **max_iter=1000:** Maximum iterations for convergence
- **random_state=42:** For reproducibility
- **class_weight='balanced':** Automatically adjust weights for class imbalance

In [None]:
# Initialize and train model
lr_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
    C=1.0,
    penalty='l2',
    class_weight='balanced'
)
X_train_scaled = pd.read_csv('../../data/processed/X_train_landlord_scaled.csv')
X_test_scaled = pd.read_csv('../../data/processed/X_test_landlord_scaled.csv')


print("Training model...")
lr_model.fit(X_train_scaled, y_train)
print(" Model training complete!")

print(f"\nModel details:")
print(f"  Classes: {lr_model.classes_}")
print(f"  Number of iterations: {lr_model.n_iter_[0]}")

## Step 4: Model Evaluation

Evaluate model performance on both training and testing sets.

In [None]:
# Make predictions
y_train_pred = lr_model.predict(X_train_scaled)
y_test_pred = lr_model.predict(X_test_scaled)

# Calculate metrics
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred, average='macro')
test_recall = recall_score(y_test, y_test_pred, average='macro')
test_f1 = f1_score(y_test, y_test_pred, average='macro')

print("="*80)
print("Model Performance Summary")
print("="*80)
print(f"Training Accuracy:   {train_acc:.4f} ({train_acc*100:.2f}%)")
print(f"Testing Accuracy:    {test_acc:.4f} ({test_acc*100:.2f}%)")
print(f"Precision (Macro):   {test_precision:.4f}")
print(f"Recall (Macro):      {test_recall:.4f}")
print(f"F1-Score (Macro):    {test_f1:.4f}")

# Check for overfitting
overfit_gap = train_acc - test_acc
print(f"\nOverfitting check:")
print(f"  Train-Test Gap: {overfit_gap:.4f} ({overfit_gap*100:.2f}%)")
if overfit_gap < 0.05:
    print("  Good generalization!")
elif overfit_gap < 0.10:
    print("  Slight overfitting")
else:
    print("  Significant overfitting")

print("\n" + "="*80)
print("This model can predict value for NEW listings without reviews!")
print("="*80)

## Step 5: Detailed Classification Report

Per-class performance breakdown showing precision, recall, and F1-score for each value category.

In [None]:
print("\n" + "="*80)
print("="*80)
print(classification_report(y_test, y_test_pred))

# Prediction distribution
print("\nPrediction distribution on test set:")
pred_counts = pd.Series(y_test_pred).value_counts()
for category in sorted(pred_counts.index):
    count = pred_counts[category]
    pct = count / len(y_test_pred) * 100
    print(f"  {category}: {count} ({pct:.2f}%)")

# Actual distribution
print("\nActual distribution on test set:")
actual_counts = pd.Series(y_test).value_counts()
for category in sorted(actual_counts.index):
    count = actual_counts[category]
    pct = count / len(y_test) * 100
    print(f"  {category}: {count} ({pct:.2f}%)")

## Step 6: Feature Importance Analysis

Analyze which features are most important for classification by examining model coefficients.

In [None]:
# Get feature coefficients for each class
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'Excellent_coef': lr_model.coef_[0],
    'Fair_coef': lr_model.coef_[1],
    'Poor_coef': lr_model.coef_[2]
})

# Calculate average absolute coefficient
feature_importance['avg_abs_coef'] = feature_importance[['Excellent_coef', 'Fair_coef', 'Poor_coef']].abs().mean(axis=1)
feature_importance = feature_importance.sort_values('avg_abs_coef', ascending=False)


for i, row in feature_importance.head(20).iterrows():
    print(f"{row['feature']:50s} | Avg |Coef|: {row['avg_abs_coef']:.4f}")

print("\n Interpretation:")
print("  - Larger |coefficient| = more important for classification")
print("  - Positive coef = increases probability of that class")
print("  - Negative coef = decreases probability of that class")
print("\n Key Insight:")
print("  Features related to pricing, location, and amenities are most influential.")

## Step 7: Save Results and Model

Save all outputs for future use and comparison with other models.

In [None]:
# Create directories if they don't exist
Path('../../models').mkdir(parents=True, exist_ok=True)

# Save scaled data with _landlord suffix
X_train_scaled.to_csv('../../data/processed/X_train_landlord_scaled_lr.csv', index=False)
X_test_scaled.to_csv('../../data/processed/X_test_landlord_scaled_lr.csv', index=False)



# Save model
with open('../../models/logistic_regression_landlord.pkl', 'wb') as f:
    pickle.dump(lr_model, f)

# Save results
results_df = pd.DataFrame({
    'model': ['Logistic Regression (Landlord Features)'],
    'train_accuracy': [train_acc],
    'test_accuracy': [test_acc],
    'precision_macro': [test_precision],
    'recall_macro': [test_recall],
    'f1_macro': [test_f1],
    'num_features': [len(feature_cols)],
    'data_leakage': ['NO - Review features removed'],
    'production_ready': ['YES - Can predict for new listings']
})
results_df.to_csv('../../data/processed/logistic_regression_landlord_results.csv', index=False)

# Save predictions
predictions_df = pd.DataFrame({
    'y_true': y_test.values,
    'y_pred': y_test_pred
})
predictions_df.to_csv('../../data/processed/logistic_regression_landlord_predictions.csv', index=False)

# Save feature importance
feature_importance.to_csv('../../data/processed/logistic_regression_landlord_feature_importance.csv', index=False)

print("\nFiles created:")
print(" data/processed/")
print("     ├── X_train_landlord_scaled_lr.csv")
print("     ├── X_test_landlord_scaled_lr.csv")
print("     ├── logistic_regression_landlord_results.csv")
print("     ├── logistic_regression_landlord_predictions.csv")
print("     └── logistic_regression_landlord_feature_importance.csv")
print("\n models/")
print("     ├── logistic_regression_landlord.pkl")
print("     └── standard_scaler_landlord.pkl")


## Summary and Next Steps

### What We Accomplished:
1. **Fixed Data Leakage:** Removed all review-based features from model input
2. **Production-Ready:** Model can predict for new listings without reviews
3. **Baseline Established:** This serves as the benchmark for more complex models

###  Key Findings:
- **Features Used:** Only landlord-controlled attributes
- **Generalization:** Good (Small train-test gap)
- **Class Balance:** Model handles all three value categories fairly



