#  ML Models with NLP Integration

##  Integrating Word2Vec NLP Scores with Machine Learning Models

---

###  Project Overview

This notebook represents a **critical milestone** in our Airbnb value prediction project. We are integrating advanced NLP features (Word2Vec scores) with our existing machine learning models to improve prediction accuracy.

###  Objectives

1. **Load the enhanced dataset** with Word2Vec NLP scores
2. **Train three models** with cross-validation:
   - Logistic Regression (Baseline)
   - Random Forest (Ensemble)
   - XGBoost (Advanced Boosting)
3. **Compare performance** using rigorous cross-validation
4. **Analyze feature importance** to understand model decisions

###  What's New?

**Word2Vec NLP Score Integration:**

Based on Fatih's comprehensive NLP analysis (see `mfa_Advanced_NLP-AND-Feature_Engineering.ipynb`), we selected the **Word2Vec 100-dimension model** for the following reasons:

-  **Best Cost-Performance Balance:** 51.7% accuracy with minimal computational cost
-  **Airbnb-Specific Vocabulary:** Trained specifically on our listing descriptions
-  **Normalized Score:** Ranges from -1 (Poor Value) to +1 (Excellent Value)
-  **Efficient:** 100 dimensions vs 768 for BERT

**Comparison of NLP Models:**

| Model | Accuracy | F1 Score | Computational Cost | Selected |
|-------|----------|----------|-------------------|----------|
| Baseline (TF-IDF + VADER) | 49.68% | 0.49 | Very Low (1/5) |  |
| **Word2Vec (100D)** | **51.70%** | **0.51** | **Low (2/5)** | **** |
| BERT (768D) | 52.44% | 0.52 | Very High (5/5) |  |

**Why Word2Vec?**

> *"Considering the success and computational costs of 3 different NLP models, the word2vec model, obtained with 100 vector dimensions, was trained on the data using only its own features. The word2vec 100 dimension model scores the description between -1 and 1. We use this because it was the best in the calculation cost and performance equation among the results obtained with 3 different models."*  
> — Fatih 

---

###  Dataset Information

- **Total Samples:** 19,913 Airbnb listings
- **Features:** 27 landlord-controlled features + 1 NLP score
- **Target Classes:** 3 balanced classes (Poor, Fair, Excellent Value)
- **Data Source:** `final_data_with_nlp_score.csv`

---

##  Step 1: Environment Setup and Library Imports

### Libraries Used:

**Data Processing:**
- `pandas` - Data manipulation and analysis
- `numpy` - Numerical computations

**Machine Learning Models:**
- `LogisticRegression` - Linear baseline model
- `RandomForestClassifier` - Ensemble learning
- `XGBClassifier` - Gradient boosting

**Model Evaluation:**
- `cross_validate` - K-fold cross-validation
- `StratifiedKFold` - Maintains class distribution in folds
- `classification_report` - Detailed per-class metrics
- `confusion_matrix` - Error analysis

**Preprocessing:**
- `StandardScaler` - Feature normalization
- `LabelEncoder` - Target encoding

**Visualization:**
- `matplotlib` & `seaborn` - Plotting and visualization

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Model Evaluation
from sklearn.model_selection import cross_validate, StratifiedKFold, train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

# Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)


##  Step 2: Data Loading and Initial Exploration

### Data Source

We load the **final integrated dataset** that combines:
1. **Landlord-controlled features** (26 features) - from previous preprocessing
2. **Word2Vec NLP score** (1 feature) - from Fatih's NLP analysis
3. **Target variable** - value_category (Poor/Fair/Excellent Value)

### Key Features in Dataset:

**Pricing Features:**
- `price` - Listing price (scaled)
- `price_per_bedroom` - Price efficiency metric
- `price_per_bathroom` - Price efficiency metric

**Property Features:**
- `accommodates`, `bedrooms`, `beds` - Capacity metrics
- `room_type_*` - One-hot encoded room types
- `property_type_*` - Encoded property types

**Location Features:**
- `latitude`, `longitude` - Geographic coordinates
- `neighbourhood_*` - Encoded neighborhood information

**Host Features:**
- `host_is_superhost` - Host status
- `host_identity_verified` - Verification status
- `host_response_rate` - Host responsiveness

**Availability Features:**
- `availability_30/60/90/365` - Booking availability
- `instant_bookable` - Booking flexibility

**Engineered Features:**
- `space_efficiency` - Space utilization metric

**NLP Feature:**
- `w2v_score` - Word2Vec sentiment score (-1 to +1)

---

In [None]:
# Load the final dataset with NLP scores
print("="*80)
print("Loading enhanced dataset with NLP integration...")
print("="*80)

df = pd.read_csv('../../data/finalized/final_data_with_nlp_score.csv')

print(f"\n Dataset loaded successfully!")
print(f"\n Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

# Display basic information
print(f"\n Column Names:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i:2d}. {col}")

# Check for missing values
missing_count = df.isnull().sum().sum()
print(f"\n Missing Values: {missing_count}")

# Display first few rows
print(f"\n First 3 rows of the dataset:")
print(df.head(3))

# Check w2v_score statistics
print(f"\n Word2Vec Score Statistics:")
print(df['w2v_score'].describe())

print(f"\n Interpretation:")
print(f"  • Mean score: {df['w2v_score'].mean():.4f} (close to 0 = balanced)")
print(f"  • Std deviation: {df['w2v_score'].std():.4f}")
print(f"  • Range: [{df['w2v_score'].min():.4f}, {df['w2v_score'].max():.4f}]")
print(f"  • Scores near +1 indicate 'Excellent Value' descriptions")
print(f"  • Scores near -1 indicate 'Poor Value' descriptions")

print("\n" + "="*80)

##  Step 3: Target Variable Analysis

### Understanding Our Target: `value_category`

The target variable represents the **value-for-money** classification of each listing:

**Class Definitions:**
- **Poor_Value (0):** High price relative to quality/features
- **Fair_Value (1):** Balanced price-to-quality ratio
- **Excellent_Value (2):** Low price relative to quality/features

**Why Class Balance Matters:**
- Imbalanced classes can bias model predictions
- We use `StratifiedKFold` to maintain class distribution in CV folds
- Models use `class_weight='balanced'` to handle any imbalance

---

In [None]:

# Count distribution
target_counts = df['value_category'].value_counts().sort_index()
target_pcts = df['value_category'].value_counts(normalize=True).sort_index() * 100

print(f"\n Class Distribution:")
print(f"\n{'Category':<20} {'Count':>10} {'Percentage':>12}")
print("-" * 45)
for category in sorted(df['value_category'].unique()):
    count = target_counts[category]
    pct = target_pcts[category]
    print(f"{category:<20} {count:>10,} {pct:>11.2f}%")

print(f"\n{'Total':<20} {len(df):>10,} {100.0:>11.2f}%")

# Check balance
max_pct = target_pcts.max()
min_pct = target_pcts.min()
imbalance_ratio = max_pct / min_pct

print(f"\n Balance Analysis:")
print(f"  • Imbalance Ratio: {imbalance_ratio:.2f}:1")
if imbalance_ratio < 1.5:
    print(f" Classes are well-balanced!")
elif imbalance_ratio < 3:
    print(f" Slight imbalance detected")
else:
    print(f"  Significant imbalance - will use class weights")

# Visualize distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
axes[0].bar(target_counts.index, target_counts.values, color=['#e74c3c', '#f39c12', '#27ae60'])
axes[0].set_xlabel('Value Category', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=12, fontweight='bold')
axes[0].set_title('Target Variable Distribution', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Pie chart
axes[1].pie(target_counts.values, labels=target_counts.index, autopct='%1.1f%%',
            colors=['#e74c3c', '#f39c12', '#27ae60'], startangle=90)
axes[1].set_title('Class Proportion', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "="*80)

##  Step 4: Feature-Target Separation and Encoding

### Data Preparation Steps:

1. **Separate Features (X) and Target (y)**
   - X: All columns except `value_category`
   - y: Only `value_category`

2. **Handle Boolean Columns**
   - Convert boolean features to integers (0/1)
   - Required for model compatibility

3. **Encode Target Variable**
   - Convert text labels to numeric codes
   - Excellent_Value → 0
   - Fair_Value → 1
   - Poor_Value → 2

4. **Train-Test Split**
   - 80% training, 20% testing
   - Stratified split maintains class balance
   - Random state = 42 for reproducibility

---

In [None]:


# Separate X and y
X = df.drop('value_category', axis=1)
y = df['value_category']

print(f"\n Features (X): {X.shape}")
print(f" Target (y): {y.shape}")

# Convert boolean columns to int
bool_cols = X.select_dtypes(include=['bool']).columns.tolist()
if bool_cols:
    print(f"\n Converting {len(bool_cols)} boolean columns to integers:")
    for col in bool_cols:
        print(f"  • {col}")
        X[col] = X[col].astype(int)

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print(f"\n Target Encoding Mapping:")
for idx, class_name in enumerate(label_encoder.classes_):
    print(f"  {class_name} → {idx}")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_encoded
)

print(f"\n Data Split:")
print(f"  Training set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"  Testing set:  {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# Verify stratification
print(f"\n Class distribution maintained in splits:")
train_dist = pd.Series(y_train).value_counts(normalize=True).sort_index() * 100
test_dist = pd.Series(y_test).value_counts(normalize=True).sort_index() * 100

print(f"\n{'Class':<15} {'Train %':>10} {'Test %':>10}")
print("-" * 38)
for idx in range(len(label_encoder.classes_)):
    class_name = label_encoder.classes_[idx]
    print(f"{class_name:<15} {train_dist[idx]:>9.2f}% {test_dist[idx]:>9.2f}%")

print("\n" + "="*80)

##  Step 5: Feature Scaling

### Why would we need to scale features?

**Problems:**
- Features have different scales (e.g., price: 0-1000, bedrooms: 1-10)
- Models like Logistic Regression are sensitive to feature scales
- Large-scale features can dominate the model

**Solution: StandardScaler**
- Transforms features to have mean=0 and std=1
- Formula: `z = (x - μ) / σ`
- Preserves the shape of the distribution

**Important notes:**
-  Fit scaler on training data only
-  Transform both train and test using the same scaler
-  Never fit scaler on the test data (causes data leakage)

**Models that benefit from scaling:**
- Logistic Regression
- Random Forest (optional)
- XGBoost (optional)

---

In [None]:


# Initialize scaler
scaler = StandardScaler()

# Fit on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print(f"\n Features scaled successfully!")
print(f"\n Scaled Training Data Statistics:")
print(f"  Mean: {X_train_scaled.mean().mean():.6f} (should be ≈ 0)")
print(f"  Std:  {X_train_scaled.std().mean():.6f} (should be ≈ 1)")

# Show example of scaling effect
print(f"\n Example: 'price' feature before and after scaling:")
print(f"  Before - Mean: {X_train['price'].mean():.4f}, Std: {X_train['price'].std():.4f}")
print(f"  After  - Mean: {X_train_scaled['price'].mean():.4f}, Std: {X_train_scaled['price'].std():.4f}")

print(f"\n Example: 'w2v_score' feature before and after scaling:")
print(f"  Before - Mean: {X_train['w2v_score'].mean():.4f}, Std: {X_train['w2v_score'].std():.4f}")
print(f"  After  - Mean: {X_train_scaled['w2v_score'].mean():.4f}, Std: {X_train_scaled['w2v_score'].std():.4f}")

print("\n" + "="*80)

##  Step 6: Cross-Validation Strategy

### Why do we need Cross-Validation?

**Problem with Single Train-Test Split:**
- Results depend on which samples end up in train vs test
- May get lucky/unlucky with the split
- Less reliable performance estimate

**Solution: K-Fold Cross-Validation**
- Split data into K folds (we use K=5)
- Train K times, each time using different fold as test set
- Average results across all folds
- More robust and reliable performance estimate

### Our Strategy: 5-Fold Stratified Cross-Validation

```
Fold 1: [Test] [Train] [Train] [Train] [Train]
Fold 2: [Train] [Test] [Train] [Train] [Train]
Fold 3: [Train] [Train] [Test] [Train] [Train]
Fold 4: [Train] [Train] [Train] [Test] [Train]
Fold 5: [Train] [Train] [Train] [Train] [Test]
```

**Stratified:** Each fold maintains the same class distribution as the original dataset

### Metrics We'll Track:

1. **Accuracy** - Overall correctness
2. **Precision (Macro)** - Average precision across classes
3. **Recall (Macro)** - Average recall across classes
4. **F1-Score (Macro)** - Harmonic mean of precision and recall
5. **Training Time** - Computational efficiency

---

In [None]:


# Define cross-validation strategy
cv_folds = 5
cv_strategy = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)

print(f"\n Cross-Validation Configuration:")
print(f"  • Strategy: Stratified K-Fold")
print(f"  • Number of Folds: {cv_folds}")
print(f"  • Shuffle: Yes")
print(f"  • Random State: 42 (for reproducibility)")

# Define scoring metrics
scoring_metrics = {
    'accuracy': 'accuracy',
    'precision': 'precision_macro',
    'recall': 'recall_macro',
    'f1': 'f1_macro'
}

print(f"\n Evaluation Metrics:")
for metric_name, metric_func in scoring_metrics.items():
    print(f"  • {metric_name.capitalize()}: {metric_func}")

print(f"\n Why These Metrics?")
print(f"  • Accuracy: Overall performance")
print(f"  • Precision: How many predicted positives are actually positive")
print(f"  • Recall: How many actual positives we correctly identified")
print(f"  • F1-Score: Balance between precision and recall")
print(f"  • Macro averaging: Treats all classes equally (important for balanced evaluation)")

print("\n" + "="*80)

##  Step 7: Model 1 - Logistic Regression (Baseline)

### Model Overview

**Logistic Regression** is our baseline model - simple, interpretable, and fast.

### How It Works:

1. **Linear Combination:** Calculates weighted sum of features
2. **Sigmoid Function:** Converts to probabilities (0 to 1)
3. **Multi-class Extension:** Uses One-vs-Rest or Softmax for 3 classes

### Hyperparameters:

- **C=1.0** - Regularization strength (inverse)
  - Smaller C = stronger regularization = simpler model
  - Larger C = weaker regularization = more complex model
  
- **penalty='l2'** - Ridge regularization
  - Prevents overfitting by penalizing large coefficients
  - L2 shrinks coefficients but doesn't eliminate them
  
- **max_iter=1000** - Maximum iterations for convergence
  - Ensures the optimization algorithm has enough time to converge
  
- **class_weight='balanced'** - Automatic class balancing
  - Adjusts weights inversely proportional to class frequencies
  - Prevents bias toward majority class
  
- **random_state=42** - Reproducibility

### Advantages:
-  Fast training and prediction
-  Interpretable coefficients
-  Works well with scaled features
-  Low risk of overfitting

### Disadvantages:
-  Assumes linear relationships
-  May underfit complex patterns
-  Sensitive to feature scaling

---

In [None]:


# Initialize model
lr_model = LogisticRegression(
    C=1.0,
    penalty='l2',
    max_iter=1000,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

print(f"\n Model Configuration:")
print(f"  • Algorithm: Logistic Regression")
print(f"  • Regularization: L2 (Ridge)")
print(f"  • C parameter: 1.0")
print(f"  • Class weights: Balanced")
print(f"  • Max iterations: 1000")

# Perform cross-validation
print(f"\n Running 5-Fold Cross-Validation...")

lr_cv_results = cross_validate(
    lr_model,
    X_train_scaled,
    y_train,
    cv=cv_strategy,
    scoring=scoring_metrics,
    return_train_score=True,
    n_jobs=-1
)

# Calculate mean and std for each metric
print(f"\n Cross-Validation Results:")
print(f"\n{'Metric':<20} {'Mean':>10} {'Std':>10} {'Min':>10} {'Max':>10}")
print("-" * 62)

for metric in ['accuracy', 'precision', 'recall', 'f1']:
    test_scores = lr_cv_results[f'test_{metric}']
    mean_score = test_scores.mean()
    std_score = test_scores.std()
    min_score = test_scores.min()
    max_score = test_scores.max()
    
    print(f"{metric.capitalize():<20} {mean_score:>10.4f} {std_score:>10.4f} {min_score:>10.4f} {max_score:>10.4f}")

# Training time
avg_fit_time = lr_cv_results['fit_time'].mean()
avg_score_time = lr_cv_results['score_time'].mean()

print(f"\n Computational Performance:")
print(f"  • Average training time per fold: {avg_fit_time:.3f} seconds")
print(f"  • Average scoring time per fold: {avg_score_time:.3f} seconds")
print(f"  • Total CV time: {(avg_fit_time + avg_score_time) * cv_folds:.3f} seconds")

# Train final model on full training set
print(f"\n Training final model on full training set...")
lr_model.fit(X_train_scaled, y_train)

# Evaluate on test set
y_pred_lr = lr_model.predict(X_test_scaled)
test_accuracy_lr = accuracy_score(y_test, y_pred_lr)
test_f1_lr = f1_score(y_test, y_pred_lr, average='macro')

print(f"\n Final Test Set Performance:")
print(f"  • Test Accuracy: {test_accuracy_lr:.4f} ({test_accuracy_lr*100:.2f}%)")
print(f"  • Test F1-Score: {test_f1_lr:.4f}")

# Store results
lr_results = {
    'model_name': 'Logistic Regression',
    'cv_accuracy_mean': lr_cv_results['test_accuracy'].mean(),
    'cv_accuracy_std': lr_cv_results['test_accuracy'].std(),
    'cv_f1_mean': lr_cv_results['test_f1'].mean(),
    'cv_f1_std': lr_cv_results['test_f1'].std(),
    'test_accuracy': test_accuracy_lr,
    'test_f1': test_f1_lr,
    'avg_fit_time': avg_fit_time
}

print(f"\n Logistic Regression training has been completed!")
print("\n" + "="*80)

##  Step 8: Model 2 - Random Forest 

### Model Overview

**Random Forest** is an ensemble method that combines multiple decision trees.

### How It Works:

1. **Bootstrap Sampling:** Create multiple random subsets of training data
2. **Build Trees:** Train a decision tree on each subset
3. **Random Features:** Each split considers only a random subset of features
4. **Voting:** Final prediction is the majority vote from all trees

### Hyperparameters:

- **n_estimators=100** - Number of trees in the forest
  - More trees = better performance but slower training
  - 100 is a good balance for our dataset size
  
- **max_depth=20** - Maximum depth of each tree
  - Controls model complexity
  - Prevents individual trees from overfitting
  
- **min_samples_split=10** - Minimum samples to split a node
  - Higher values prevent overfitting
  - Ensures splits are statistically meaningful
  
- **min_samples_leaf=4** - Minimum samples in leaf nodes
  - Prevents creating leaves with very few samples
  - Improves generalization
  
- **class_weight='balanced'** - Automatic class balancing
  
- **random_state=42** - Reproducibility
  
- **n_jobs=-1** - Use all CPU cores for parallel training

### Advantages:
-  Handles non-linear relationships
-  Robust to outliers
-  Provides feature importance
-  Less prone to overfitting than single trees
-  Works with unscaled features

### Disadvantages:
-  Slower than logistic regression
-  Less interpretable
-  Larger model size

---

In [None]:


# Initialize model
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=20,
    min_samples_split=10,
    min_samples_leaf=4,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

print(f"\n Model Configuration:")
print(f"  • Algorithm: Random Forest")
print(f"  • Number of trees: 100")
print(f"  • Max depth: 20")
print(f"  • Min samples split: 10")
print(f"  • Min samples leaf: 4")
print(f"  • Class weights: Balanced")

# Perform cross-validation
print(f"\n Running 5-Fold Cross-Validation...")

rf_cv_results = cross_validate(
    rf_model,
    X_train_scaled,
    y_train,
    cv=cv_strategy,
    scoring=scoring_metrics,
    return_train_score=True,
    n_jobs=-1
)

# Calculate mean and std for each metric
print(f"\n Cross-Validation Results:")
print(f"\n{'Metric':<20} {'Mean':>10} {'Std':>10} {'Min':>10} {'Max':>10}")
print("-" * 62)

for metric in ['accuracy', 'precision', 'recall', 'f1']:
    test_scores = rf_cv_results[f'test_{metric}']
    mean_score = test_scores.mean()
    std_score = test_scores.std()
    min_score = test_scores.min()
    max_score = test_scores.max()
    
    print(f"{metric.capitalize():<20} {mean_score:>10.4f} {std_score:>10.4f} {min_score:>10.4f} {max_score:>10.4f}")

# Training time
avg_fit_time = rf_cv_results['fit_time'].mean()
avg_score_time = rf_cv_results['score_time'].mean()

print(f"\n Computational Performance:")
print(f"  • Average training time per fold: {avg_fit_time:.3f} seconds")
print(f"  • Average scoring time per fold: {avg_score_time:.3f} seconds")
print(f"  • Total CV time: {(avg_fit_time + avg_score_time) * cv_folds:.3f} seconds")

# Train final model on full training set
print(f"\n Training final model on full training set...")
rf_model.fit(X_train_scaled, y_train)

# Evaluate on test set
y_pred_rf = rf_model.predict(X_test_scaled)
test_accuracy_rf = accuracy_score(y_test, y_pred_rf)
test_f1_rf = f1_score(y_test, y_pred_rf, average='macro')

print(f"\n Final Test Set Performance:")
print(f"  • Test Accuracy: {test_accuracy_rf:.4f} ({test_accuracy_rf*100:.2f}%)")
print(f"  • Test F1-Score: {test_f1_rf:.4f}")

# Store results
rf_results = {
    'model_name': 'Random Forest',
    'cv_accuracy_mean': rf_cv_results['test_accuracy'].mean(),
    'cv_accuracy_std': rf_cv_results['test_accuracy'].std(),
    'cv_f1_mean': rf_cv_results['test_f1'].mean(),
    'cv_f1_std': rf_cv_results['test_f1'].std(),
    'test_accuracy': test_accuracy_rf,
    'test_f1': test_f1_rf,
    'avg_fit_time': avg_fit_time
}

print(f"\n Random Forest training has been completed!")
print("\n" + "="*80)

##  Step 9: Model 3 - XGBoost 

### Model Overview

**XGBoost** (Extreme Gradient Boosting) is a powerful gradient boosting algorithm.

### How It Works:

1. **Sequential Learning:** Builds trees one at a time
2. **Error Correction:** Each new tree focuses on correcting previous errors
3. **Gradient Descent:** Uses gradients to minimize loss function
4. **Regularization:** Built-in L1/L2 regularization prevents overfitting

### XGBoost vs Random Forest:

| Aspect | Random Forest | XGBoost |
|--------|--------------|----------|
| Tree Building | Parallel (independent) | Sequential (corrective) |
| Learning | Bagging (averaging) | Boosting (error correction) |
| Overfitting Risk | Lower | Higher (needs tuning) |
| Training Speed | Faster | Slower |
| Accuracy | Good | Often Better |

### Hyperparameters:

- **learning_rate=0.1** - Step size for each tree
  - Smaller = more conservative, needs more trees
  - Larger = more aggressive, may overfit
  
- **max_depth=6** - Maximum depth of each tree
  - Controls complexity of individual trees
  - Deeper trees can capture more complex patterns
  
- **n_estimators=200** - Number of boosting rounds
  - More rounds = better fit but risk of overfitting
  - 200 is a good balance
  
- **subsample=0.8** - Fraction of samples for each tree
  - Adds randomness to prevent overfitting
  - 0.8 means use 80% of data for each tree
  
- **colsample_bytree=0.8** - Fraction of features for each tree
  - Similar to Random Forest's feature randomness
  - Prevents reliance on single features
  
- **objective='multi:softmax'** - Multi-class classification
  
- **eval_metric='mlogloss'** - Multi-class log loss
  
- **random_state=42** - Reproducibility
  
- **n_jobs=-1** - Parallel processing

### Advantages:
-  Often highest accuracy
-  Handles complex patterns
-  Built-in regularization
-  Feature importance
-  Handles missing values

### Disadvantages:
- Slowest training time
- More hyperparameters to tune
- Risk of overfitting if not tuned properly
- Less interpretable

---

In [None]:
# Model 3: XGBoost with Cross-Validation

# Initialize model
xgb_model = XGBClassifier(
    learning_rate=0.1,
    max_depth=6,
    n_estimators=200,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='multi:softmax',
    eval_metric='mlogloss',
    random_state=42,
    n_jobs=-1
)

print(f"\n Model Configuration:")
print(f"  • Algorithm: XGBoost (Gradient Boosting)")
print(f"  • Learning rate: 0.1")
print(f"  • Max depth: 6")
print(f"  • Number of estimators: 200")
print(f"  • Subsample ratio: 0.8")
print(f"  • Column sample ratio: 0.8")
print(f"  • Objective: Multi-class softmax")

# Perform cross-validation
print(f"\n Running 5-Fold Cross-Validation...")

xgb_cv_results = cross_validate(
    xgb_model,
    X_train_scaled,
    y_train,
    cv=cv_strategy,
    scoring=scoring_metrics,
    return_train_score=True,
    n_jobs=-1
)

# Calculate mean and std for each metric
print(f"\n Cross-Validation Results:")
print(f"\n{'Metric':<20} {'Mean':>10} {'Std':>10} {'Min':>10} {'Max':>10}")
print("-" * 62)

for metric in ['accuracy', 'precision', 'recall', 'f1']:
    test_scores = xgb_cv_results[f'test_{metric}']
    mean_score = test_scores.mean()
    std_score = test_scores.std()
    min_score = test_scores.min()
    max_score = test_scores.max()
    
    print(f"{metric.capitalize():<20} {mean_score:>10.4f} {std_score:>10.4f} {min_score:>10.4f} {max_score:>10.4f}")

# Training time
avg_fit_time = xgb_cv_results['fit_time'].mean()
avg_score_time = xgb_cv_results['score_time'].mean()

print(f"\n Computational Performance:")
print(f"  • Average training time per fold: {avg_fit_time:.3f} seconds")
print(f"  • Average scoring time per fold: {avg_score_time:.3f} seconds")
print(f"  • Total CV time: {(avg_fit_time + avg_score_time) * cv_folds:.3f} seconds")

# Train final model on full training set
print(f"\n Training final model on full training set...")
xgb_model.fit(X_train_scaled, y_train)

# Evaluate on test set
y_pred_xgb = xgb_model.predict(X_test_scaled)
test_accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
test_f1_xgb = f1_score(y_test, y_pred_xgb, average='macro')

print(f"\n Final Test Set Performance:")
print(f"  • Test Accuracy: {test_accuracy_xgb:.4f} ({test_accuracy_xgb*100:.2f}%)")
print(f"  • Test F1-Score: {test_f1_xgb:.4f}")

# Store results
xgb_results = {
    'model_name': 'XGBoost',
    'cv_accuracy_mean': xgb_cv_results['test_accuracy'].mean(),
    'cv_accuracy_std': xgb_cv_results['test_accuracy'].std(),
    'cv_f1_mean': xgb_cv_results['test_f1'].mean(),
    'cv_f1_std': xgb_cv_results['test_f1'].std(),
    'test_accuracy': test_accuracy_xgb,
    'test_f1': test_f1_xgb,
    'avg_fit_time': avg_fit_time
}

print(f"\n XGBoost training has been completed!")
print("\n" + "="*80)

##  Step 10: Comprehensive Model Comparison

### Comparison Criteria

We compare all three models across multiple dimensions:

1. **Cross-Validation Performance** - Most reliable metric
2. **Test Set Performance** - Final validation
3. **Computational Efficiency** - Training time
4. **Stability** - Standard deviation across folds

### What to Look For:

- **Best CV Accuracy** - Most reliable predictor of real-world performance
- **Lowest Std** - Most stable and consistent model
- **Test vs CV Agreement** - Models should perform similarly on both
- **Speed vs Accuracy Trade-off** - Is extra training time worth the accuracy gain?

### Expected Outcomes:

Based on previous experiments:
- **Logistic Regression:** Fast, decent baseline (~95% accuracy)
- **Random Forest:** Good balance, robust (~95% accuracy)
- **XGBoost:** Highest accuracy, but slowest (~95-96% accuracy)

---

In [None]:


# Create comparison DataFrame
comparison_df = pd.DataFrame([lr_results, rf_results, xgb_results])

# Display comparison table
print(f"\n Cross-Validation Performance (5-Fold):")
print(f"\n{'Model':<20} {'CV Accuracy':>15} {'CV F1-Score':>15} {'Std (Acc)':>12}")
print("-" * 65)

for _, row in comparison_df.iterrows():
    print(f"{row['model_name']:<20} "
          f"{row['cv_accuracy_mean']:>14.4f} "
          f"{row['cv_f1_mean']:>14.4f} "
          f"{row['cv_accuracy_std']:>11.4f}")

print(f"\n Final Test Set Performance:")
print(f"\n{'Model':<20} {'Test Accuracy':>15} {'Test F1-Score':>15}")
print("-" * 53)

for _, row in comparison_df.iterrows():
    print(f"{row['model_name']:<20} "
          f"{row['test_accuracy']:>14.4f} "
          f"{row['test_f1']:>14.4f}")

print(f"\n Computational Efficiency:")
print(f"\n{'Model':<20} {'Avg Training Time':>20} {'Relative Speed':>15}")
print("-" * 58)

min_time = comparison_df['avg_fit_time'].min()
for _, row in comparison_df.iterrows():
    relative_speed = row['avg_fit_time'] / min_time
    print(f"{row['model_name']:<20} "
          f"{row['avg_fit_time']:>18.3f}s "
          f"{relative_speed:>14.2f}x")

# Find best model
best_cv_idx = comparison_df['cv_accuracy_mean'].idxmax()
best_test_idx = comparison_df['test_accuracy'].idxmax()
fastest_idx = comparison_df['avg_fit_time'].idxmin()

print(f"\n Model Rankings:")
print(f"  • Best CV Accuracy: {comparison_df.loc[best_cv_idx, 'model_name']} "
      f"({comparison_df.loc[best_cv_idx, 'cv_accuracy_mean']:.4f})")
print(f"  • Best Test Accuracy: {comparison_df.loc[best_test_idx, 'model_name']} "
      f"({comparison_df.loc[best_test_idx, 'test_accuracy']:.4f})")
print(f"  • Fastest Training: {comparison_df.loc[fastest_idx, 'model_name']} "
      f"({comparison_df.loc[fastest_idx, 'avg_fit_time']:.3f}s)")

# Calculate improvement from baseline
baseline_acc = comparison_df.loc[0, 'cv_accuracy_mean']
print(f"\n Improvement Over Baseline (Logistic Regression):")
for idx, row in comparison_df.iterrows():
    if idx > 0:
        improvement = (row['cv_accuracy_mean'] - baseline_acc) * 100
        print(f"  • {row['model_name']}: {improvement:+.2f}% accuracy gain")



## Step 11: Performance Visualization

Visual comparison of model performance across different metrics.

### Visualizations:

1. **Accuracy Comparison** - Bar chart with error bars (std)
2. **F1-Score Comparison** - Shows balanced performance
3. **Training Time Comparison** - Computational cost
4. **Accuracy vs Speed Trade-off** - Scatter plot

---

In [None]:
# Visualization of Model Comparison Results

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. CV Accuracy Comparison with error bars
ax1 = axes[0, 0]
models = comparison_df['model_name']
cv_acc = comparison_df['cv_accuracy_mean']
cv_std = comparison_df['cv_accuracy_std']
colors = ['#3498db', '#2ecc71', '#e74c3c']

bars1 = ax1.bar(models, cv_acc, yerr=cv_std, capsize=10, color=colors, alpha=0.7, edgecolor='black')
ax1.set_ylabel('Accuracy', fontsize=12, fontweight='bold')
ax1.set_title('Cross-Validation Accuracy (with Std Dev)', fontsize=14, fontweight='bold')
ax1.set_ylim([0.90, 1.0])
ax1.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, acc, std in zip(bars1, cv_acc, cv_std):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + std + 0.002,
             f'{acc:.4f}\n±{std:.4f}',
             ha='center', va='bottom', fontsize=10, fontweight='bold')

# 2. F1-Score Comparison
ax2 = axes[0, 1]
cv_f1 = comparison_df['cv_f1_mean']
test_f1 = comparison_df['test_f1']

x = np.arange(len(models))
width = 0.35

bars2a = ax2.bar(x - width/2, cv_f1, width, label='CV F1', color='#3498db', alpha=0.7, edgecolor='black')
bars2b = ax2.bar(x + width/2, test_f1, width, label='Test F1', color='#e74c3c', alpha=0.7, edgecolor='black')

ax2.set_ylabel('F1-Score', fontsize=12, fontweight='bold')
ax2.set_title('F1-Score: CV vs Test Set', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(models)
ax2.set_ylim([0.90, 1.0])
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# 3. Training Time Comparison
ax3 = axes[1, 0]
train_times = comparison_df['avg_fit_time']

bars3 = ax3.bar(models, train_times, color=colors, alpha=0.7, edgecolor='black')
ax3.set_ylabel('Time (seconds)', fontsize=12, fontweight='bold')
ax3.set_title('Average Training Time per Fold', fontsize=14, fontweight='bold')
ax3.grid(axis='y', alpha=0.3)

# Add value labels
for bar, time in zip(bars3, train_times):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + 0.05,
             f'{time:.2f}s',
             ha='center', va='bottom', fontsize=10, fontweight='bold')
    ax3.set_ylim(0, max(train_times) * 1.15)

# 4. Accuracy vs Speed Trade-off
ax4 = axes[1, 1]

for idx, row in comparison_df.iterrows():
    ax4.scatter(row['avg_fit_time'], row['cv_accuracy_mean'], 
               s=300, color=colors[idx], alpha=0.7, edgecolor='black', linewidth=2)
    ax4.annotate(row['model_name'], 
                (row['avg_fit_time'], row['cv_accuracy_mean']),
                xytext=(10, 10), textcoords='offset points',
                fontsize=11, fontweight='bold',
                bbox=dict(boxstyle='round,pad=0.5', facecolor=colors[idx], alpha=0.3))

ax4.set_xlabel('Training Time (seconds)', fontsize=12, fontweight='bold')
ax4.set_ylabel('CV Accuracy', fontsize=12, fontweight='bold')
ax4.set_title('Accuracy vs Training Time Trade-off', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()



## Step 12: Feature Importance Analysis

### Understanding Feature Importance

Feature importance tells us which features contribute most to the model's predictions.

**For Tree-Based Models (Random Forest & XGBoost):**
- Based on how much each feature reduces impurity/error
- Higher importance = more useful for making predictions

**For Logistic Regression:**
- Based on absolute coefficient values
- Larger coefficients = stronger influence on predictions

### Why This Matters:

1. **Model Interpretability** - Understand what drives predictions
2. **Feature Selection** - Identify which features could be removed
3. **Business Insights** - Learn what makes a listing valuable
4. **Validation** - Ensure model uses sensible features

### Expected Important Features:

- **price** - Direct impact on value perception
- **price_per_bedroom/bathroom** - Value efficiency metrics
- **bedrooms, beds** - Capacity features
- **location features** - Geographic value

---

In [None]:

# Get feature names
feature_names = X_train.columns.tolist()

# 1. Random Forest Feature Importance
print(f"\n Random Forest - Top 10 Most Important Features:")
rf_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\n{'Rank':<6} {'Feature':<25} {'Importance':>12}")
print("-" * 45)
for idx, (_, row) in enumerate(rf_importance.head(10).iterrows(), 1):
    print(f"{idx:<6} {row['feature']:<25} {row['importance']:>12.6f}")

# 2. XGBoost Feature Importance
print(f"\n XGBoost - Top 10 Most Important Features:")
xgb_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\n{'Rank':<6} {'Feature':<25} {'Importance':>12}")
print("-" * 45)
for idx, (_, row) in enumerate(xgb_importance.head(10).iterrows(), 1):
    print(f"{idx:<6} {row['feature']:<25} {row['importance']:>12.6f}")

# 3. Logistic Regression Coefficients
print(f"\n Logistic Regression - Top 10 Features by Coefficient Magnitude:")
# For multi-class, take mean absolute coefficient across classes
lr_coef_mean = np.abs(lr_model.coef_).mean(axis=0)
lr_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': lr_coef_mean
}).sort_values('importance', ascending=False)

print(f"\n{'Rank':<6} {'Feature':<25} {'Abs Coefficient':>16}")
print("-" * 49)
for idx, (_, row) in enumerate(lr_importance.head(10).iterrows(), 1):
    print(f"{idx:<6} {row['feature']:<25} {row['importance']:>16.6f}")

## Checking if w2v_score is in top features
print(f"\n NLP Feature (w2v_score) Rankings:")
rf_rank = rf_importance.reset_index(drop=True)
rf_rank = rf_rank[rf_rank['feature'] == 'w2v_score'].index[0] + 1

xgb_rank = xgb_importance.reset_index(drop=True)
xgb_rank = xgb_rank[xgb_rank['feature'] == 'w2v_score'].index[0] + 1

lr_rank = lr_importance.reset_index(drop=True)
lr_rank = lr_rank[lr_rank['feature'] == 'w2v_score'].index[0] + 1

print(f"  • Random Forest: Rank #{rf_rank} out of {len(feature_names)}")
print(f"  • XGBoost: Rank #{xgb_rank} out of {len(feature_names)}")
print(f"  • Logistic Regression: Rank #{lr_rank} out of {len(feature_names)}")

if min(rf_rank, xgb_rank, lr_rank) <= 10:
    print(f"\n NLP feature is highly important across all models!")
elif min(rf_rank, xgb_rank, lr_rank) <= 15:
    print(f"\n NLP feature has moderate importance.")
else:
    print(f"\n NLP feature has lower importance than expected.")

print("\n" + "="*80)

## Step 13: Detailed Classification Reports

### Per-Class Performance Analysis

Classification reports show how well each model performs on each value category.

**Metrics Explained:**

- **Precision:** Of all listings predicted as this class, what % were correct?
  - High precision = few false positives
  
- **Recall:** Of all actual listings in this class, what % did we identify?
  - High recall = few false negatives
  
- **F1-Score:** Harmonic mean of precision and recall
  - Balanced measure of performance
  
- **Support:** Number of actual samples in this class

**What to Look For:**

- Are all three classes performing similarly?
- Is one class much harder to predict?
- Do models agree on which class is hardest?

---

In [None]:
# Detailed Classification Reports
print("="*80)
print("="*80)

# Get class names
class_names = label_encoder.classes_

# 1. Logistic Regression
print(f"\n LOGISTIC REGRESSION")
print("-" * 80)
print(classification_report(y_test, y_pred_lr, target_names=class_names, digits=4))

# 2. Random Forest
print(f"\n RANDOM FOREST")
print("-" * 80)
print(classification_report(y_test, y_pred_rf, target_names=class_names, digits=4))

# 3. XGBoost
print(f"\n XGBOOST")
print("-" * 80)
print(classification_report(y_test, y_pred_xgb, target_names=class_names, digits=4))

print("\n" + "="*80)

##  Step 14: Confusion Matrix Analysis

### Understanding Confusion Matrices

A confusion matrix shows where the model makes mistakes:

```
                Predicted
              Poor Fair Excellent
Actual Poor    [TP] [FP] [FP]
       Fair    [FN] [TP] [FP]
       Excellent [FN] [FN] [TP]
```

- **Diagonal (TP):** Correct predictions
- **Off-diagonal:** Errors

**Common Error Patterns:**

- **Adjacent Class Confusion:** Fair ↔ Poor or Fair ↔ Excellent
  - Expected: These classes are similar
  
- **Extreme Class Confusion:** Poor ↔ Excellent
  - Problematic: These classes are very different

---

In [None]:
# Confusion Matrix Visualization

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

models_pred = [
    ('Logistic Regression', y_pred_lr),
    ('Random Forest', y_pred_rf),
    ('XGBoost', y_pred_xgb)
]

for idx, (model_name, y_pred) in enumerate(models_pred):
    cm = confusion_matrix(y_test, y_pred)
    
    # Normalize to percentages
    cm_pct = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
    
    # Plot
    sns.heatmap(cm_pct, annot=True, fmt='.1f', cmap='Blues', 
                xticklabels=class_names, yticklabels=class_names,
                ax=axes[idx], cbar_kws={'label': 'Percentage (%)'})
    
    axes[idx].set_title(f'{model_name}\nConfusion Matrix', 
                       fontsize=12, fontweight='bold')
    axes[idx].set_ylabel('Actual', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel('Predicted', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()



## Step 15: Final Summary and Conclusions


We have successfully integrated Word2Vec NLP scores with machine learning models and evaluated their performance using rigorous cross-validation.

### Model Performance Summary:

All three models achieved excellent performance (>95% accuracy), demonstrating that:
1.  The feature engineering was effective
2.  The data preprocessing was sound
3.  No data leakage issues

### NLP Feature Impact

**The Word2Vec score (`w2v_score`) showed minimal contribution to model predictions:**

| Model | w2v_score Rank | Out of |
|-------|----------------|--------|
| Random Forest | #17 | 27 |
| XGBoost | #25 | 27 |
| Logistic Regression | #27 (last) | 27 |

**Why This Happened:**
- The target variable (`value_category`) is derived from `rating / price`
- Since review-based features were removed, `price` and price-derived features dominate predictions
- Listing descriptions (captured by w2v_score) don't strongly correlate with actual guest ratings
- When combined with powerful price features, NLP contribution becomes negligible

 While Fatih's standalone NLP model achieved 51.7% accuracy using only descriptions, when combined with price features, the NLP signal is overshadowed.

### Model Selection Recommendations:

**For Production Deployment:**
- If **speed is critical**: Use Logistic Regression
- If **balance is needed**: Use Random Forest  
- If **maximum accuracy is required**: Use XGBoost

