<a href="https://colab.research.google.com/github/MohamedElquesni/ACL-International-Hotel-Booking-Analytics-/blob/Mohamed/Milestone%201/Milestone%201.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Milestone 1: International Hotel Booking Analytics
## Predicting Hotel Country Groups using Machine Learning (ML)

**Team Number:** [90]  
---

## Project Overview

**Goal:** Build a multi-class classification model to predict the country group of hotels based on user demographics, hotel characteristics, and review scores.

**Dataset:**
- 50,000 reviews across 25 hotels in 25 countries
- 2,000 unique users with demographic information
- 11 target country groups

**Deliverables:**
1. A cleaned dataset after the feature engineering step.
2. Data engineering insights, including:
 - The best city for each traveller type.
 - The top three countries with the best value-for-money scores per traveller age group.

3. A trained classification model (statistical ML or shallow FFNN).
4. Model interpretation and explainability through XAI techniques (SHAP and LIME).
5. An inference function.

---

# Section 1: Data Cleaning

## 1.1 - Importing Libraries

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

## 1.2 - Loading and Assessing Datasets

### Hotels Dataset

#### Loading the Hotels Dataset


In [4]:
df_hotels = pd.read_csv('/content/hotels.csv')
df_hotels.head()

FileNotFoundError: [Errno 2] No such file or directory: '/content/hotels.csv'

#### Renaming Hotel Columns --> Hotel_ + Original Column name

In [None]:
df_hotels.columns = [
    col if col == 'hotel_id' or col == 'hotel_name' else 'hotel_' + col
    for col in df_hotels.columns
]


#### Checking for Null Values

In [None]:
df_hotels.isnull().sum()

#### Checking for Duplicated Values

In [None]:
df_hotels.duplicated().sum()

### Reviews Dataset

In [None]:
df_reviews = pd.read_csv('/content/reviews.csv')
df_reviews.head()

#### Renaming Reviews Columns --> Review_ + Original Column name

In [None]:
df_reviews.columns = [
    col if col == 'review_id'
           or col == 'user_id'
           or col == 'hotel_id'
           or col == 'review_date'
           or col == 'review_text'
        else 'review_' + col
    for col in df_reviews.columns
]


#### Checking for Null Values

In [5]:
df_reviews.isnull().sum()

NameError: name 'df_reviews' is not defined

#### Checking for Duplicated Values

In [None]:
df_reviews.duplicated().sum()

### Users Dataset

In [None]:
df_users = pd.read_csv('/content/users.csv')
df_users.head() # This shows the 5 rows

#### Renaming Users Columns --> User_ + Original Column name

In [None]:
df_users.columns = [
    col if col == 'user_id'
        or col == 'user_gender'
        else 'user_' + col
    for col in df_users.columns
]


#### Checking for Null Values

In [None]:
df_users.isnull().sum()

#### Checking for Duplicated Values

In [None]:
df_users.duplicated().sum()

## 1.3 - Merging Datasets

In [None]:
# Merge reviews with users on user_id
df_merged = pd.merge(df_reviews, df_users, on='user_id', how='left')

# Merge the result with hotels on hotel_id
df_merged = pd.merge(df_merged, df_hotels, on='hotel_id', how='left')

df_merged.head()

In [None]:
df_merged.info()

## 1.4 - Dropping Unnecessary Columns

In [None]:
# Drop unnecessary columns that do not contribute to the predictive modeling task
# These columns are either textual, identifiers or dates that add no generalizable value

df_merged.drop(
    columns=[
        'review_date',
        'review_text',
        'user_join_date',
        'hotel_name'
    ],
    inplace=True
)

df_merged.info()


---

## 1.5 - Cleaned Dataset Summary

All datasets have been successfully loaded, cleaned, and merged:
- No null values found.
- No duplicate records.
- Columns renamed with prefixes for clarity.
- Unnecessary columns were dropped.
- Final merged dataset contains 50,000 reviews with complete hotel characteristics and user information


# Section 2: Data Engineering Questions

Using the cleaned and merged dataset, we analyze and visualize the following:

1. **Best City for Each Trave;ler Type**
   - Identify the city with the highest average review score for each traveler type.

2. **Top 3 Countries by Value-for-Money per Age Group**
   - Find the top 3 countries with the highest value-for-money score per traveller’s age group.

## 2.1 - Best City for Each Traveller Type

### Heatmap Analysis

In [6]:
pivot = pd.pivot_table(
        df_merged,
        index="hotel_city",
        columns="user_traveller_type",
        values="review_score_overall",
        aggfunc="mean"
    )

plt.figure(figsize = (12,6))


sns.heatmap(pivot, annot=True, cmap="copper", fmt=".2f", linewidths=0.5)

# Extra GUI enhancements
plt.title("Average Review Score per City and Traveller Type ", fontsize=14)
plt.xlabel("Traveller Type")
plt.ylabel("City")
plt.show()


NameError: name 'df_merged' is not defined

### Insights

Using the heatmap visualization, we can observe clear differences in average review scores across traveller types and cities:

- **Business travellers:** Dubai achieved the highest average score of **8.97**.
- **Couples:** Amsterdam recorded the highest average score of **9.10**.
- **Families:** Dubai again stood out with an average score of **9.21**.
- **Solo travellers:** Amsterdam had the highest average score of **9.11**.

## 2.2 - Top 3 Countries by Value-for-Money per Age Group

### Heatmap Analysis

In [None]:
pivot = pd.pivot_table(
    df_merged,
    index="hotel_country",
    columns="user_age_group",
    values="review_score_value_for_money",
    aggfunc="mean"
)

plt.figure(figsize=(14, 6))

sns.heatmap(
    pivot,
    annot=True,
    cmap="copper",
    fmt=".2f",
    linewidths=0.5
)

# Extra GUI Interface
plt.title("Average Value-for-Money Score per Country and Age Group", fontsize=14)
plt.xlabel("User Age Group")
plt.ylabel("Hotel Country")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


### Insights

Top 3 countries by value-for-money score for each age group:

- **18–24:** China (8.71), Netherlands (8.70), Canada (8.66)
- **25–34:** China (8.73), Netherlands (8.68), Spain (8.63)
- **35–44:** China (8.70), Netherlands (8.69), New Zealand (8.65)
- **45–54:** China (8.72), New Zealand (8.67), Netherlands (8.65)
- **55+:** Netherlands (8.70), New Zealand (8.63), China (8.60)

---

# Section 3: Exploratory Data Analysis (EDA)

**Objective:** Understand data distributions, correlations, and patterns to inform feature engineering and modeling decisions.

---

## 3.1 - Target Variable Analysis

### Create Target Variable

We group the 25 countries into 11 geographic regions (country groups) to create our classification target.

In [7]:
# Create a mapping dictionary from country names to geographic regions (country groups)
# This groups the 25 countries into 11 regions (country groups) for classification
country_to_group = {
    # North America
    'United States': 'North_America',
    'Canada': 'North_America',

    # Western Europe
    'Germany': 'Western_Europe',
    'France': 'Western_Europe',
    'United Kingdom': 'Western_Europe',
    'Netherlands': 'Western_Europe',
    'Spain': 'Western_Europe',
    'Italy': 'Western_Europe',

    # Eastern Europe
    'Russia': 'Eastern_Europe',

    # East Asia
    'China': 'East_Asia',
    'Japan': 'East_Asia',
    'South Korea': 'East_Asia',

    # Southeast Asia
    'Thailand': 'Southeast_Asia',
    'Singapore': 'Southeast_Asia',

    # Middle East
    'United Arab Emirates': 'Middle_East',
    'Turkey': 'Middle_East',

    # Africa
    'Egypt': 'Africa',
    'Nigeria': 'Africa',
    'South Africa': 'Africa',

    # Oceania
    'Australia': 'Oceania',
    'New Zealand': 'Oceania',

    # South America
    'Brazil': 'South_America',
    'Argentina': 'South_America',

    # South Asia
    'India': 'South_Asia',

    # North America (Mexico separate due to different characteristics)
    'Mexico': 'North_America_Mexico'
}

# Apply the mapping to create our target variable
df_merged['country_group'] = df_merged['hotel_country'].map(country_to_group)

df_merged.head()


NameError: name 'df_merged' is not defined

In [None]:
# Count how many reviews belong to each country group
print("Distribution of reviews across country groups:")
print(df_merged['country_group'].value_counts().sort_index())

### Visualize Target Distribution

Now let's visualize the distribution to better understand class imbalance.

In [None]:
# Get the distribution sorted by count (descending)
target_dist = df_merged['country_group'].value_counts().sort_values(ascending=False)

# Create a figure with 2 subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# First plot: Bar chart showing counts
target_dist.plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Distribution of Reviews by Country Group', fontsize=14)
axes[0].set_xlabel('Country Group')
axes[0].set_ylabel('Number of Reviews')
axes[0].tick_params(axis='x', rotation=45)

# Second plot: Pie chart showing percentages
axes[1].pie(target_dist, labels=target_dist.index, startangle=90)
axes[1].set_title('Percentage Distribution by Country Group', fontsize=14)

plt.tight_layout()
plt.show()

# Calculate and display class imbalance statistics
largest_class = target_dist.idxmax()
smallest_class = target_dist.idxmin()
imbalance_ratio = target_dist.max() / target_dist.min()

print(f"\nClass Imbalance Analysis:")
print(f"Largest class: {largest_class} with {target_dist.max()} samples")
print(f"Smallest class: {smallest_class} with {target_dist.min()} samples")
print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")
print(f"\nThis means the largest class has {imbalance_ratio:.2f}x more samples than the smallest class.")

---

## 3.2 - Numerical Features Analysis

Let's analyze the distribution and statistics of numerical features to understand their behavior.

In [None]:
# Identify all numerical columns in the dataset
numerical_cols = df_merged.select_dtypes(include=[np.number]).columns.tolist()

print(f"Total numerical features: {len(numerical_cols)}\n")
print("List of numerical features:")
for i, col in enumerate(numerical_cols, 1):
    print(f"  {i}. {col}")

print("\n")
print("Statistical Summary of Numerical Features:")
print("\n")
df_merged[numerical_cols].describe()

In [None]:
# Visualize the distribution of numerical features using histograms
# This helps us understand if features are normally distributed, skewed, etc.

fig, axes = plt.subplots(4, 4, figsize=(18, 16))
axes = axes.ravel()  # Flatten the 2D array to 1D for easier iteration

# Plot histogram for each numerical column (first 16 columns)
for idx, col in enumerate(numerical_cols[:16]):
    axes[idx].hist(df_merged[col], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{col}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel('Value', fontsize=9)
    axes[idx].set_ylabel('Frequency', fontsize=9)
    axes[idx].grid(axis='y', alpha=0.3)

plt.suptitle('Distribution of Numerical Features', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# Use box plots to detect outliers in numerical features
# Box plots show: median (center line), quartiles (box edges), and outliers (dots beyond whiskers)

fig, axes = plt.subplots(4, 4, figsize=(18, 16))
axes = axes.ravel()

plot_idx = 0  # Separate counter for axes indexing
# Create box plot for each numerical column
for idx, col in enumerate(numerical_cols[:19]):
    if col == 'review_id' or col == 'user_id' or col == 'hotel_id':
        continue
    axes[plot_idx].boxplot(df_merged[col], vert=True)
    axes[plot_idx].set_title(f'{col}', fontsize=10, fontweight='bold')
    axes[plot_idx].set_ylabel('Value', fontsize=9)
    axes[plot_idx].grid(axis='y', alpha=0.3)
    plot_idx += 1

plt.suptitle('Box Plots for Outlier Detection', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\n Insights:")
print("- The box represents the interquartile range (IQR): 25th to 75th percentile")
print("- The line inside the box is the median (50th percentile)")
print("- Whiskers extend to 1.5 * IQR from the box")
print("- Points beyond whiskers are potential outliers")

## 3.3 - Correlation Analysis

Correlation analysis helps us identify relationships between numerical features. High correlation can indicate:
- Redundant features (multicollinearity)
- Features that move together
- Potential feature combinations

In [None]:
# Calculate correlation matrix (Pearson correlation coefficient)
# Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation)

numerical_cols = [col for col in numerical_cols
                           if col not in ['review_id', 'user_id', 'hotel_id']]

correlation_matrix = df_merged[numerical_cols].corr()

# Visualize using a heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(
    correlation_matrix,
    annot=True,
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=0.5,
    cbar_kws={'label': 'Correlation Coefficient'}
)
plt.title('Correlation Matrix Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Find and print highly correlated pairs (correlation > 0.8 or < -0.8)
print("\n")
print("Highly Correlated Feature Pairs (|correlation| > 0.8):")
print("\n")
print(f"{'Feature 1':<35} {'Feature 2':<35} {'Correlation':>10}")
print("-"*80)

highly_correlated_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.8:
            feat1 = correlation_matrix.columns[i]
            feat2 = correlation_matrix.columns[j]
            print(f"{feat1:<35} {feat2:<35} {corr_value:>10.3f}")
            highly_correlated_pairs.append((feat1, feat2, corr_value))


## 3.4 - Categorical Features Analysis

Let's examine the distribution of categorical features to understand user demographics and traveller characteristics.

In [None]:
# Define categorical features to analyze
categorical_features = ['user_gender', 'user_age_group', 'user_traveller_type','user_country']

print("CATEGORICAL FEATURES DISTRIBUTION")

for feat in categorical_features:
    print(f"\n{feat.upper().replace('_', ' ')}:")
    print("-" * 40)
    counts = df_merged[feat].value_counts()

    # Display counts and percentages
    for value, count in counts.items():
        percentage = (count / len(df_merged)) * 100
        print(f"  {value:<30} {count:>6} ({percentage:>5.2f}%)")

    print(f"\n  Total unique values: {df_merged[feat].nunique()}")
    print("="*80)

In [None]:
# Visualize categorical features using bar charts
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
axes = axes.ravel()

# Create a bar chart for each categorical feature
for idx, col in enumerate(categorical_features):
    # Get value counts and plot
    value_counts = df_merged[col].value_counts()
    value_counts.plot(kind='bar', ax=axes[idx], color='teal', alpha=0.7, edgecolor='black')

    # Formatting
    axes[idx].set_title(f'{col.replace("_", " ").title()} Distribution',
                        fontsize=13, fontweight='bold')
    axes[idx].set_xlabel(col.replace('_', ' ').title(), fontsize=11)
    axes[idx].set_ylabel('Count', fontsize=11)
    axes[idx].tick_params(axis='x', rotation=45)
    axes[idx].grid(axis='y', alpha=0.3)

    # Add value labels on top of bars
    for container in axes[idx].containers:
        axes[idx].bar_label(container, fmt='%d', padding=3, fontsize=9)

plt.suptitle('Distribution of Categorical Features', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

---

## 3.5 - EDA Summary & Key Insights

Based on the exploratory data analysis, here are the key findings:

### Target Variable (country_group)
 **Class imbalance**: The dataset shows a moderate class imbalance with a 6:1 ratio between largest and smallest country groups.
- *Largest class*: Western Europe (most reviews).
- *Smallest class*: Eastern Europe (fewest reviews).

This imbalance indicates the need for *stratified sampling* during the train/test split to ensure all regions are properly represented in the model.

### Data Leakage
The features *hotel_city* and *hotel_country* and *all base line scores of hotels* would allow the model to “cheat” by inferring the target from already-known information rather than learning genuine patterns [Since there are 25 unique hotels].

###  Numerical Features
**Scale**: Most numerical features (review and baseline scores) are on a 0–10 scale.

**Distribution**: They are roughly normally distributed with slight right skew (most hotels receive fairly high ratings).

**Outliers**: Only a few mild outliers were detected in the box plots.

**Variance**: Features show reasonable variance, which is helpful for modeling

###  Feature Correlations
**Review scores**: Highly correlated with each other (users who rate one aspect highly tend to rate others highly too).

**Hotel baseline scores**: Show moderate correlation.

###  Categorical Features

**User Gender:**
- Fairly balanced distribution across Male/Female/Other.
- No major gender bias in the dataset.

**User Age Group:**
- Most users fall in the 25-44 age range
- Represents the primary demographic for hotel reviews.

**Traveller Type:**
- *Couples* and *Families* are the most common (make up about ~60% of reviews).

- *Business* and *Solo* travelers are less common but still well-represented.



---

# Section 4: Feature Engineering

**Objective:** Create deviation features to capture how individual user experiences differ from hotel baselines.

**Approach:** We will use user demographics and deviation features to predict country groups.

## 4.1 - Deviation Features

**Justification:** Deviation features capture how individual user experiences differ from hotel baselines.

**Formula:** deviation = individual_review_score - hotel_baseline_score

This indicates:
- Whether the user's experience was better or worse than the hotel's average
- How user satisfaction compares to typical hotel performance
- Individual user preferences relative to hotel standards

### Computing the Deviation

In [None]:
# Formula: deviation = individual_review_score - hotel_baseline_score

df_merged['deviation_cleanliness'] = (
    df_merged['review_score_cleanliness'] - df_merged['hotel_cleanliness_base']
)

df_merged['deviation_comfort'] = (
    df_merged['review_score_comfort'] - df_merged['hotel_comfort_base']
)

df_merged['deviation_facilities'] = (
    df_merged['review_score_facilities'] - df_merged['hotel_facilities_base']
)

df_merged['deviation_location'] = (
    df_merged['review_score_location'] - df_merged['hotel_location_base']
)

df_merged['deviation_staff'] = (
    df_merged['review_score_staff'] - df_merged['hotel_staff_base']
)

df_merged['deviation_value_for_money'] = (
    df_merged['review_score_value_for_money'] - df_merged['hotel_value_for_money_base']
)

deviation_cols = [col for col in df_merged.columns if col.startswith('deviation_')]
print(df_merged[deviation_cols].head())

### Analysis

In [None]:
df_merged[deviation_cols].describe()

---

# Section 5: Data Preprocessing

**Objective:** Prepare data to be ready for our machine learning model through encoding, scaling, and splitting.

**Selected Features:**
- Categorical: *user_gender*, *user_age_group*, *user_traveller_type* (user demoghraphics)
- Numerical: score-based features and the new engineered features *deviation*
- Target: *country_group*

## 5.0 - Importing libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## 5.1 - Feature Selection

In [None]:
df_processed = df_merged.copy()

selected_columns = [
    # Categorical features
    'user_gender', 'user_age_group', 'user_traveller_type',

    # Review scores (excluding overall)
    'review_score_cleanliness', 'review_score_comfort',
    'review_score_facilities', 'review_score_location',
    'review_score_staff', 'review_score_value_for_money',

    # Deviation features
    'deviation_cleanliness', 'deviation_comfort',
    'deviation_facilities', 'deviation_location',
    'deviation_staff', 'deviation_value_for_money',

    # Target
    'country_group'
]

df_processed = df_processed[selected_columns]

## 5.2 - Encoding

**Objective**: Converting categorical features into numerical form.

*One-hot encoding:* used for unordered (nominal) features.

In [None]:
df_processed = pd.get_dummies(
    df_processed,
    columns=['user_gender', 'user_age_group', 'user_traveller_type'],
    drop_first=True
)

print("After encoding:")
print(f"\nColumn names:")
print(df_processed.columns.tolist())

## 5.3 - Split Features and Target

In [None]:
X = df_processed.drop('country_group', axis=1)
y = df_processed['country_group']

## 5.4 - Train-Test Split

Split the data into 80% training and 20% test sets using stratified sampling.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Dataset Split:")
print(f"Train: {X_train.shape} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test:  {X_test.shape} ({len(X_test)/len(X)*100:.1f}%)")

## 5.5 - Encode Target Variable

Convert country_group labels to numerical format using LabelEncoder.

In [None]:
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print("Class Mapping:")
for i, class_name in enumerate(label_encoder.classes_):
    print(f"{i}: {class_name}")

# Section 6: Model Development

**Objective:** Training and evaluating two classification models to predict our target variable (*country_groups*).

We implemented:
1. **Logistic Regression**: A linear baseline model.
2. **Random Forest Classifier (RFC)**: A machine learning model that makes predictions by combining the results of many decision trees.

## 6.0 - Importing Libraries

In [8]:
from sklearn.linear_model import LogisticRegression  # Linear classification model
from sklearn.ensemble import RandomForestClassifier  # Ensemble tree-based model
from sklearn.model_selection import GridSearchCV  # Exhaustive search over parameter grid
from sklearn.metrics import accuracy_score         # Proportion of correct predictions
from sklearn.metrics import classification_report  # Detailed per-class metrics
from sklearn.metrics import confusion_matrix       # Matrix of actual vs predicted labels
from sklearn.metrics import precision_score        # Precision = TP / (TP + FP)
from sklearn.metrics import recall_score           # Recall = TP / (TP + FN)
from sklearn.metrics import f1_score               # Harmonic mean of precision and recall

## 6.2 - Logistic Regression

**What is Logistic Regression?**
- A linear model that predicts class probabilities using the logistic (sigmoid) function.

- Simple, fast, and interpretable serving as a strong baseline

**Key Parameters:**
- `max_iter=1000`: Maximum number of iterations for the optimizer to converge.
- `class_weight='balanced'`: Automatically adjusts weights to handle any class imbalance.
- `random_state=42`: Ensures reproducibility of results.
- `n_jobs=-1`: Uses all available CPU cores for parallel processing.

In [9]:
# Initialize the model
lr_model = LogisticRegression(
    max_iter=1000,              # Maximum iterations
    class_weight='balanced',    # Handle any class imbalance automatically
    random_state=42,            # For reproducibility
    n_jobs=-1                   # Use all CPU cores
)

### Model Training

In [10]:
# Train the model
lr_model.fit(X_train, y_train_encoded)

NameError: name 'X_train' is not defined

### Model Testing

In [None]:
# Test the model
y_test_pred_lr = lr_model.predict(X_test)

# Calculate performance metrics
lr_test_acc = accuracy_score(y_test_encoded, y_test_pred_lr)
lr_test_prec = precision_score(y_test_encoded, y_test_pred_lr, average='weighted', zero_division=0)
lr_test_rec = recall_score(y_test_encoded, y_test_pred_lr, average='weighted', zero_division=0)
lr_test_f1 = f1_score(y_test_encoded, y_test_pred_lr, average='weighted', zero_division=0)

print("Model Performance")
print("\n")
print(f"{'Accuracy':<12} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
print("-"*80)
print(f"{lr_test_acc:<12.4f} {lr_test_prec:<12.4f} {lr_test_rec:<12.4f} {lr_test_f1:<12.4f}")

## 6.3 - Random Forest Classifier

**What is Random Forest?**
- An ensemble of decision trees that vote on the final prediction.
- Each tree is trained on a random subset of the data.

Using *GridSearchCV* to determine the best parameters:

**What is GridSearchCV?**
- Exhaustively searches through a specified parameter grid.
- Performs k-fold cross-validation for each parameter combination (preventing overfitting).
- Automatically selects the best hyperparameters based on a scoring metric.

**Hyperparameter Grid:**
- `n_estimators`: Number of trees in the forest (more trees = better but slower).
- `max_depth`: Maximum depth of each tree (controls tree complexity).
- `min_samples_split`: Minimum samples required to split a node (prevents overfitting).
- `min_samples_leaf`: Minimum samples required at each leaf node (prevents overfitting).

In [None]:
# Initialize the model
rf_base = RandomForestClassifier(
    class_weight='balanced',  # Handle class imbalance
    random_state=42,          # For reproducibility
    n_jobs=-1                 # Use all CPU cores
)

# GridSearchCV will try all combinations of these parameters
param_grid = {
    'n_estimators': [50, 100, 200],           # Number of trees to test
    'max_depth': [10, 15, 20, None],          # Maximum tree depth (None = unlimited)
    'min_samples_split': [2, 5, 10],          # Minimum samples to split a node
    'min_samples_leaf': [1, 2, 4]             # Minimum samples at leaf nodes
}

# Initialize GridSearchCV
# cv=5 means 5-fold cross-validation: splits training data into 5 parts, trains on 4 parts and validates on 1 part, rotating through all combinations
grid_search = GridSearchCV(
    estimator=rf_base,           # The model to tune
    param_grid=param_grid,       # Parameter combinations to try
    cv=5,                        # 5-fold cross-validation
    scoring='f1_weighted',       # Optimization metric (weighted F1-score)
    n_jobs=-1,                   # Parallel processing
    verbose=2                    # Print progress updates
)

### Model Training

In [None]:
#Train all possible models using the parameter grid
grid_search.fit(X_train, y_train_encoded)

# Extract the best model found by GridSearchCV
rf_model = grid_search.best_estimator_

# Display the best parameters found
print("Best Parameters Found:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nBest Cross-Validation F1-Score: {grid_search.best_score_:.4f}")

### Model Testing

In [None]:
# Test the model
y_test_pred_rf = rf_model.predict(X_test)

# Calculate performance metricst
rf_test_acc = accuracy_score(y_test_encoded, y_test_pred_rf)
rf_test_prec = precision_score(y_test_encoded, y_test_pred_rf, average='weighted', zero_division=0)
rf_test_rec = recall_score(y_test_encoded, y_test_pred_rf, average='weighted', zero_division=0)
rf_test_f1 = f1_score(y_test_encoded, y_test_pred_rf, average='weighted', zero_division=0)

print("Model Performance")
print("\n")
print(f"{'Accuracy':<12} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
print("-"*80)
print(f"{rf_test_acc:<12.4f} {rf_test_prec:<12.4f} {rf_test_rec:<12.4f} {rf_test_f1:<12.4f}")


## 6.4 - Understanding GridSearchCV Results

### **Overview**

GridSearchCV tested **108 different parameter combinations** with **5-fold cross-validation**, resulting in **540 total model trainings**. Here's what that means:

**The Process:**
1. **Parameter Grid**: We defined 4 hyperparameters with multiple values each:
   - `n_estimators`: 3 options [50, 100, 200]
   - `max_depth`: 4 options [10, 15, 20, None]
   - `min_samples_split`: 3 options [2, 5, 10]
   - `min_samples_leaf`: 3 options [1, 2, 4]
   - Total combinations: 3 × 4 × 3 × 3 = **108 combinations**

2. **5-Fold Cross-Validation**: For each combination:
   - Split training data into 5 parts
   - Train on 4 parts, validate on 1 part
   - Rotate 5 times so each part serves as validation once
   - Average the 5 results to get reliable performance estimate

3. **Total Work**: 108 combinations × 5 folds = **540 model trainings**

---

### **Best Parameters Found:**

GridSearchCV automatically identified the optimal combination:

| Parameter | Best Value | What It Means |
|-----------|------------|---------------|
| **n_estimators** | 200 | Use 200 trees (maximum we tested) - more trees = more diverse predictions |
| **max_depth** | None | Allow trees to grow to unlimited depth - captures complex patterns |
| **min_samples_split** | 5 | Need at least 5 samples to split a node - prevents splitting on noise |
| **min_samples_leaf** | 1 | Allow leaf nodes with 1 sample - aggressive but balanced by ensemble |

**Why These Parameters Work:**
- **200 trees**: Our 11-class problem benefits from many diverse decision trees voting together
- **Unlimited depth**: With 50,000 training samples, we have enough data to support deep trees without severe overfitting
- **min_samples_split=5**: Strikes a balance - not too restrictive (like 10) but prevents splitting on tiny groups (like 2)
- **min_samples_leaf=1**: Seems aggressive, but the ensemble of 200 trees averages out individual tree overfitting

---

### **Cross-Validation F1-Score: 0.9120 (91.20%)**

This is **excellent** performance! Here's why this metric matters:

**What is F1-Score?**
- Harmonic mean of Precision and Recall
- Ranges from 0 (worst) to 1 (perfect)
- Weighted F1 accounts for class imbalance (our 6:1 ratio)

**Why Cross-Validation Score Matters:**
- **NOT** based on a single test/validation split
- **Averaged** across 5 different validation sets
- Proves the model **generalizes well** to unseen data
- More reliable than a single train/test split

**What 91.20% Means:**
- Out of 100 predictions, the model correctly classifies ~91 on average
- This is the performance **during training** on held-out validation folds
- High confidence the model will perform well on new hotel review data


# Section 7: Model Evaluation

**Objective:** Evaluate the best performing model (Random Forest with optimized hyperparameters) using the traditional evaluation metrics.


## 7.1 - Confusion Matrix Analysis

Visualize the confusion matrix to understand where the Random Forest model makes correct predictions and misclassifications.

In [None]:
plt.figure(figsize=(10, 8))
y_test_pred = rf_model.predict(X_test)
cm = confusion_matrix(y_test_encoded, y_test_pred)
class_names = label_encoder.classes_

# Plot
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=class_names, yticklabels=class_names,
            cbar_kws={'label': 'Number of Predictions'})
plt.title("Random Forest - Confusion Matrix", fontsize=14, fontweight="bold", pad=15)
plt.xlabel("Predicted Country Group", fontsize=12)
plt.ylabel("True Country Group", fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("\nClass Labels Mapping:")
for i, class_name in enumerate(label_encoder.classes_):
    print(f"  {i}: {class_name}")

## 7.2 - Classification Report

Per-class performance metrics showing precision, recall, and F1-score.

In [None]:
y_test_pred_rf = rf_model.predict(X_test)
print(classification_report(
    y_test_encoded, y_test_pred_rf,
    labels=present,
    target_names=present_names
))

## 7.4 - Evaluation Summary

### **Model Performance:**

The Random Forest model achieved:
- **Test Accuracy: ~90%** - Excellent for an 11-class classification problem.

---

# Section 8: Model Explainability (XAI)

**Objective:** Use SHAP and LIME to explain model predictions and identify important features.

**Why XAI Matters:**
- Understand which features drive predictions.
- Build trust in the model's decision-making process.
- Identify potential biases or unexpected patterns.

**Tools:**
- **SHAP (SHapley Additive exPlanations)**: *Game theory-based approach* for global and local explanations.
- **LIME (Local Interpretable Model-agnostic Explanations)**: Local *surrogate models* for individual predictions

## 8.1 SHAP

In [None]:
# - SHAP Analysis -
# !pip install shap
import shap
import numpy as np

# Background sample (reference distribution)
background = shap.sample(X_train, 100, random_state=42)
# Initialize the SHAP explainer
explainer = shap.TreeExplainer(rf_model, background)
# Sample your dataset
sample_size = 500
X_test_sample = shap.sample(X_test, sample_size, random_state=42)
# Compute SHAP values on the sample only
shap_values = explainer.shap_values(X_test_sample)




### 8.1.1 - Global feature importance with SHAP

**What is Global Feature Importance?**
- Shows which features are most important across ALL predictions.
- Helps identify the most influential features in the model.

In [None]:
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
plt.title("SHAP Bar Plot", fontsize=14, fontweight='bold', pad=20)
plt.xlabel("Mean |SHAP Value|", fontsize=11)
plt.tight_layout()
plt.show()

### 8.1.2 - Local Feature Importance with SHAP.

**What is Local Feature Importance?**
- Explain WHY the model made a specific prediction for a specific sample.
- Show how each feature contributed to that individual prediction.


In [None]:
shap.initjs()

sample_index = 50

# Get the model prediction and true label
pred_class = rf_model.predict(X_test.iloc[[sample_index]])[0]
true_class = y_test_encoded[sample_index]

pred_name = label_encoder.classes_[pred_class]
true_name = label_encoder.classes_[true_class]

# Get prediction probabilities
pred_proba = rf_model.predict_proba(X_test.iloc[[sample_index]])[0]
confidence = pred_proba[pred_class] * 100

print(f"\nSample {sample_index}:")
print(f"  True Label: {true_name}")
print(f"  Predicted: {pred_name} (Confidence: {confidence:.1f}%)")

# SHAP values for this predicted class
shap_vals_for_sample = shap_values[pred_class][sample_index]

# Expected value (base value for class)
expected_val = explainer.expected_value[pred_class]

# Create the force plot
shap.force_plot(
    expected_val,
    shap_vals_for_sample,
    X_test.iloc[sample_index],
    feature_names=X_test.columns
)

## 8.2 - LIME Analysis

**What is LIME?**
- Local Interpretable Model-agnostic Explanations.
- Creates simple, interpretable models (linear) to explain individual predictions.



In [None]:
# Install LIME
# !pip install lime
import lime
import lime.lime_tabular

# Initialize LIME explainer
explainer_lime = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train.values,              # Training data to understand feature distributions
    feature_names=X_train.columns.tolist(),    # Feature names for readable output
    class_names=label_encoder.classes_.tolist(),  # All 11 country group names
    mode='classification',                      # Classification task
    random_state=42                             # For reproducibility
)

### 8.2.1 - LIME Explanations for Individual Samples

Explaining the same samples we used for SHAP to compare both methods.

In [None]:
import lime
import lime.lime_tabular

# Initialize the LIME explainer
explainer_lime = lime.lime_tabular.LimeTabularExplainer(
    X_train.values,
    feature_names=X_train.columns,
    class_names=label_encoder.classes_.tolist(),  # use actual class names
    discretize_continuous=True
)

sample_index = 50

# Generate LIME explanation
explanation = explainer_lime.explain_instance(
    data_row=X_test.values[sample_index],          # same row index as SHAP
    predict_fn=rf_model.predict_proba,             # use Random Forest probabilities
    num_features=10                                # show top 10 contributing features
)

explanation.show_in_notebook(show_table=True, show_all=False)


### 8.2.2 - SHAP vs LIME Comparison

**Key Differences:**

| Aspect | SHAP | LIME |
|--------|------|------|
| **Method** | Game theory (Shapley values) | Local linear approximation |
| **Speed** | Slower (exact calculations) | Faster (sampling-based) |
| **Consistency** | Guaranteed consistency | May vary between runs |
| **Global view** | Yes | No (only local) |



---

# Section 9: Inference Function

**Objective:** Create a function that accepts raw input and returns the model prediction in natural language.

## 9.1 - Define Inference Function

In [None]:

def predict_country_group(
    # User demographics
    user_gender,           # 'Male', 'Female', or 'Other'
    user_age_group,        # '18-24', '25-34', '35-44', '45-54', '55+'
    user_traveller_type,   # 'Business', 'Couple', 'Family', 'Solo'

    # Review scores (0-10 scale)
    review_score_cleanliness,
    review_score_comfort,
    review_score_facilities,
    review_score_location,
    review_score_staff,
    review_score_value_for_money,

    # Hotel baseline scores (0-10 scale)
    hotel_cleanliness_base,
    hotel_comfort_base,
    hotel_facilities_base,
    hotel_location_base,
    hotel_staff_base,
    hotel_value_for_money_base
):

    # Step 1: Create input dataframe with raw features
    input_data = pd.DataFrame({
        'user_gender': [user_gender],
        'user_age_group': [user_age_group],
        'user_traveller_type': [user_traveller_type],
        'review_score_cleanliness': [review_score_cleanliness],
        'review_score_comfort': [review_score_comfort],
        'review_score_facilities': [review_score_facilities],
        'review_score_location': [review_score_location],
        'review_score_staff': [review_score_staff],
        'review_score_value_for_money': [review_score_value_for_money]
    })

    # Step 2: Feature Engineering - Calculate deviation features
    input_data['deviation_cleanliness'] = review_score_cleanliness - hotel_cleanliness_base
    input_data['deviation_comfort'] = review_score_comfort - hotel_comfort_base
    input_data['deviation_facilities'] = review_score_facilities - hotel_facilities_base
    input_data['deviation_location'] = review_score_location - hotel_location_base
    input_data['deviation_staff'] = review_score_staff - hotel_staff_base
    input_data['deviation_value_for_money'] = review_score_value_for_money - hotel_value_for_money_base

    # Step 3: One-hot encoding for categorical features
    input_encoded = pd.get_dummies(
        input_data,
        columns=['user_gender', 'user_age_group', 'user_traveller_type'],
        drop_first=True
    )

    # Step 4: Ensure all columns match training data
    for col in X_train.columns:
        if col not in input_encoded.columns:
            input_encoded[col] = 0

    # Reorder columns to match training data
    input_encoded = input_encoded[X_train.columns]

    # Step 5: Make prediction
    prediction = rf_model.predict(input_encoded)[0]
    prediction_proba = rf_model.predict_proba(input_encoded)[0]

    # Step 6: Get prediction details
    predicted_group = label_encoder.classes_[prediction]
    confidence = prediction_proba[prediction] * 100

    # Get top 3 predictions
    top_3_indices = np.argsort(prediction_proba)[-3:][::-1]
    top_3_predictions = [
        {
            'country_group': label_encoder.classes_[idx],
            'probability': prediction_proba[idx] * 100
        }
        for idx in top_3_indices
    ]

    # Step 7: Create country group to countries mapping for explanation
    country_group_map = {
        'North_America': 'North America (United States, Canada)',
        'Western_Europe': 'Western Europe (Germany, France, UK, Netherlands, Spain, Italy)',
        'Eastern_Europe': 'Eastern Europe (Russia)',
        'East_Asia': 'East Asia (China, Japan, South Korea)',
        'Southeast_Asia': 'Southeast Asia (Thailand, Singapore)',
        'Middle_East': 'Middle East (United Arab Emirates, Turkey)',
        'Africa': 'Africa (Egypt, Nigeria, South Africa)',
        'Oceania': 'Oceania (Australia, New Zealand)',
        'South_America': 'South America (Brazil, Argentina)',
        'South_Asia': 'South Asia (India)',
        'North_America_Mexico': 'Mexico'
    }

    # Step 8: Generate explanation
    explanation = f"Based on the user profile ({user_gender}, {user_age_group}, {user_traveller_type}) "
    explanation += f"and review scores, this hotel is most likely located in "
    explanation += f"{country_group_map.get(predicted_group, predicted_group)}."

    # Return comprehensive result
    return {
        'predicted_group': predicted_group,
        'predicted_region': country_group_map.get(predicted_group, predicted_group),
        'confidence': round(confidence, 2),
        'top_3_predictions': top_3_predictions,
        'explanation': explanation
    }

## 9.2 - Test Inference Function

Testing the inference function with different user profiles and scenarios.

In [None]:
# Scenario 1: Young couple on vacation with high ratings
print("Scenario 1: Young Couple - High Satisfaction")

result1 = predict_country_group(
    # User demographics
    user_gender='Female',
    user_age_group='25-34',
    user_traveller_type='Couple',

    # Review scores (high ratings)
    review_score_cleanliness=9.5,
    review_score_comfort=9.0,
    review_score_facilities=8.5,
    review_score_location=9.5,
    review_score_staff=9.0,
    review_score_value_for_money=8.0,

    # Hotel baseline scores (average)
    hotel_cleanliness_base=8.0,
    hotel_comfort_base=8.0,
    hotel_facilities_base=7.5,
    hotel_location_base=8.5,
    hotel_staff_base=8.0,
    hotel_value_for_money_base=7.5
)

print(f"\nPrediction: {result1['predicted_group']}")
print(f"Region: {result1['predicted_region']}")
print(f"Confidence: {result1['confidence']}%")
print(f"\n{result1['explanation']}")
print(f"\nTop 3 Predictions:")
for i, pred in enumerate(result1['top_3_predictions'], 1):
    print(f"  {i}. {pred['country_group']}: {pred['probability']:.2f}%")

In [None]:
# Scenario 2: Business traveler with mixed reviews
print("Scenario 2: Business Traveler - Mixed Experience")

result2 = predict_country_group(
    # User demographics
    user_gender='Male',
    user_age_group='35-44',
    user_traveller_type='Business',

    # Review scores (mixed - some below baseline)
    review_score_cleanliness=7.0,
    review_score_comfort=8.0,
    review_score_facilities=6.5,
    review_score_location=9.0,
    review_score_staff=7.5,
    review_score_value_for_money=6.0,

    # Hotel baseline scores
    hotel_cleanliness_base=8.5,
    hotel_comfort_base=8.0,
    hotel_facilities_base=7.5,
    hotel_location_base=8.0,
    hotel_staff_base=8.5,
    hotel_value_for_money_base=7.0
)

print(f"\nPrediction: {result2['predicted_group']}")
print(f"Region: {result2['predicted_region']}")
print(f"Confidence: {result2['confidence']}%")
print(f"\n{result2['explanation']}")
print(f"\nTop 3 Predictions:")
for i, pred in enumerate(result2['top_3_predictions'], 1):
    print(f"  {i}. {pred['country_group']}: {pred['probability']:.2f}%")

In [None]:
# Scenario 3: Family vacation with budget concerns
print("Scenario 3: Family Travelers - Budget-Conscious")

result3 = predict_country_group(
    # User demographics
    user_gender='Female',
    user_age_group='45-54',
    user_traveller_type='Family',

    # Review scores (good value for money emphasis)
    review_score_cleanliness=8.0,
    review_score_comfort=7.5,
    review_score_facilities=8.0,
    review_score_location=7.0,
    review_score_staff=9.0,
    review_score_value_for_money=9.5,

    # Hotel baseline scores
    hotel_cleanliness_base=7.5,
    hotel_comfort_base=7.0,
    hotel_facilities_base=7.5,
    hotel_location_base=7.5,
    hotel_staff_base=8.0,
    hotel_value_for_money_base=8.0
)

print(f"\nPrediction: {result3['predicted_group']}")
print(f"Region: {result3['predicted_region']}")
print(f"Confidence: {result3['confidence']}%")
print(f"\n{result3['explanation']}")
print(f"\nTop 3 Predictions:")
for i, pred in enumerate(result3['top_3_predictions'], 1):
    print(f"  {i}. {pred['country_group']}: {pred['probability']:.2f}%")

***See you next milestone!***