<a href="https://colab.research.google.com/github/MohamedElquesni/ACL-International-Hotel-Booking-Analytics-/blob/Nadine/Milestone%201/Milestone%201.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Milestone 1: International Hotel Booking Analytics
## Predicting Hotel Country Groups using Machine Learning (ML)

**Team Number:** [90]  
---

## Project Overview

**Goal:** Build a multi-class classification model to predict the country group of hotels based on user demographics, hotel characteristics, and review scores.

**Dataset:**
- 50,000 reviews across 25 hotels in 25 countries
- 2,000 unique users with demographic information
- 11 target country groups

**Deliverables:**
1. A cleaned dataset after the feature engineering step.
2. Data engineering insights, including:
 - The best city for each traveller type.
 - The top three countries with the best value-for-money scores per traveller age group.

3. A trained classification model (statistical ML or shallow FFNN).
4. Model interpretation and explainability through XAI techniques (SHAP and LIME).
5. An inference function.

---

# Section 1: Data Cleaning

## 1.1 - Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

## 1.2 - Loading and Assessing Datasets

### Hotels Dataset

#### Loading the Hotels Dataset


In [None]:
df_hotels = pd.read_csv('../Dataset [Original]/hotels.csv')
df_hotels.head()

#### Renaming Hotel Columns --> Hotel_ + Original Column name

In [None]:
df_hotels.columns = [
    col if col == 'hotel_id' or col == 'hotel_name' else 'hotel_' + col
    for col in df_hotels.columns
]


#### Checking for Null Values

In [None]:
df_hotels.isnull().sum()

#### Checking for Duplicated Values

In [None]:
df_hotels.duplicated().sum()

### Reviews Dataset

In [None]:
df_reviews = pd.read_csv('../Dataset [Original]/reviews.csv')
df_reviews.head()

#### Renaming Reviews Columns --> Review_ + Original Column name

In [None]:
df_reviews.columns = [
    col if col == 'review_id'
           or col == 'user_id'
           or col == 'hotel_id'
           or col == 'review_date'
           or col == 'review_text'
        else 'review_' + col
    for col in df_reviews.columns
]


#### Checking for Null Values

In [None]:
df_reviews.isnull().sum()

#### Checking for Duplicated Values

In [None]:
df_reviews.duplicated().sum()

### Users Dataset

In [None]:
df_users = pd.read_csv('../Dataset [Original]/users.csv')
df_users.head() # This shows the 5 rows

#### Renaming Users Columns --> User_ + Original Column name

In [None]:
df_users.columns = [
    col if col == 'user_id'
        or col == 'user_gender'
        else 'user_' + col
    for col in df_users.columns
]


#### Checking for Null Values

In [None]:
df_users.isnull().sum()

#### Checking for Duplicated Values

In [None]:
df_users.duplicated().sum()

## 1.3 - Merging Datasets

In [None]:
# Merge reviews with users on user_id
df_merged = pd.merge(df_reviews, df_users, on='user_id', how='left')

# Merge the result with hotels on hotel_id
df_merged = pd.merge(df_merged, df_hotels, on='hotel_id', how='left')

df_merged.head()

In [None]:
df_merged.info()

## 1.4 - Dropping Unnecessary Columns

In [None]:
# Drop unnecessary columns that do not contribute to the predictive modeling task
# These columns are either textual, identifiers or dates that add no generalizable value

df_merged.drop(
    columns=[
        'review_date',
        'review_text',
        'user_join_date',
        'hotel_name'
    ],
    inplace=True
)

df_merged.info()


---

## 1.5 - Cleaned Dataset Summary

All datasets have been successfully loaded, cleaned, and merged:
- No null values found.
- No duplicate records.
- Columns renamed with prefixes for clarity.
- Unnecessary columns were dropped.
- Final merged dataset contains 50,000 reviews with complete hotel characteristics and user information


# Section 2: Data Engineering Questions

Using the cleaned and merged dataset, we analyze and visualize the following:

1. **Best City for Each Trave;ler Type**
   - Identify the city with the highest average review score for each traveler type.

2. **Top 3 Countries by Value-for-Money per Age Group**
   - Find the top 3 countries with the highest value-for-money score per traveller’s age group.

## 2.1 - Best City for Each Traveller Type

### Heatmap Analysis

In [None]:
pivot = pd.pivot_table(
        df_merged,
        index="hotel_city",
        columns="user_traveller_type",
        values="review_score_overall",
        aggfunc="mean"
    )

plt.figure(figsize = (12,6))


sns.heatmap(pivot, annot=True, cmap="copper", fmt=".2f", linewidths=0.5)

# Extra GUI enhancements
plt.title("Average Review Score per City and Traveller Type ", fontsize=14)
plt.xlabel("Traveller Type")
plt.ylabel("City")
plt.show()


### Insights

Using the heatmap visualization, we can observe clear differences in average review scores across traveller types and cities:

- **Business travellers:** Dubai achieved the highest average score of **8.97**.
- **Couples:** Amsterdam recorded the highest average score of **9.10**.
- **Families:** Dubai again stood out with an average score of **9.21**.
- **Solo travellers:** Amsterdam had the highest average score of **9.11**.

## 2.2 - Top 3 Countries by Value-for-Money per Age Group

### Heatmap Analysis

In [None]:
pivot = pd.pivot_table(
    df_merged,
    index="hotel_country",
    columns="user_age_group",
    values="review_score_value_for_money",
    aggfunc="mean"
)

plt.figure(figsize=(14, 6))

sns.heatmap(
    pivot,
    annot=True,
    cmap="copper",
    fmt=".2f",
    linewidths=0.5
)

# Extra GUI Interface
plt.title("Average Value-for-Money Score per Country and Age Group", fontsize=14)
plt.xlabel("User Age Group")
plt.ylabel("Hotel Country")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


### Insights

Top 3 countries by value-for-money score for each age group:

- **18–24:** China (8.71), Netherlands (8.70), Canada (8.66)
- **25–34:** China (8.73), Netherlands (8.68), Spain (8.63)
- **35–44:** China (8.70), Netherlands (8.69), New Zealand (8.65)
- **45–54:** China (8.72), New Zealand (8.67), Netherlands (8.65)
- **55+:** Netherlands (8.70), New Zealand (8.63), China (8.60)

---

# Section 3: Exploratory Data Analysis (EDA)

**Objective:** Understand data distributions, correlations, and patterns to inform feature engineering and modeling decisions.

---

## 3.1 - Target Variable Analysis

### Create Target Variable

We group the 25 countries into 11 geographic regions (country groups) to create our classification target.

In [None]:
# Create a mapping dictionary from country names to geographic regions (country groups)
# This groups the 25 countries into 11 regions (country groups) for classification
country_to_group = {
    # North America
    'United States': 'North_America',
    'Canada': 'North_America',

    # Western Europe
    'Germany': 'Western_Europe',
    'France': 'Western_Europe',
    'United Kingdom': 'Western_Europe',
    'Netherlands': 'Western_Europe',
    'Spain': 'Western_Europe',
    'Italy': 'Western_Europe',

    # Eastern Europe
    'Russia': 'Eastern_Europe',

    # East Asia
    'China': 'East_Asia',
    'Japan': 'East_Asia',
    'South Korea': 'East_Asia',

    # Southeast Asia
    'Thailand': 'Southeast_Asia',
    'Singapore': 'Southeast_Asia',

    # Middle East
    'United Arab Emirates': 'Middle_East',
    'Turkey': 'Middle_East',

    # Africa
    'Egypt': 'Africa',
    'Nigeria': 'Africa',
    'South Africa': 'Africa',

    # Oceania
    'Australia': 'Oceania',
    'New Zealand': 'Oceania',

    # South America
    'Brazil': 'South_America',
    'Argentina': 'South_America',

    # South Asia
    'India': 'South_Asia',

    # North America (Mexico separate due to different characteristics)
    'Mexico': 'North_America_Mexico'
}

# Apply the mapping to create our target variable
df_merged['country_group'] = df_merged['hotel_country'].map(country_to_group)

df_merged.head()


In [None]:
# Count how many reviews belong to each country group
print("Distribution of reviews across country groups:")
print(df_merged['country_group'].value_counts().sort_index())

### Visualize Target Distribution

Now let's visualize the distribution to better understand class imbalance.

In [None]:
# Get the distribution sorted by count (descending)
target_dist = df_merged['country_group'].value_counts().sort_values(ascending=False)

# Create a figure with 2 subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# First plot: Bar chart showing counts
target_dist.plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Distribution of Reviews by Country Group', fontsize=14)
axes[0].set_xlabel('Country Group')
axes[0].set_ylabel('Number of Reviews')
axes[0].tick_params(axis='x', rotation=45)

# Second plot: Pie chart showing percentages
axes[1].pie(target_dist, labels=target_dist.index, startangle=90)
axes[1].set_title('Percentage Distribution by Country Group', fontsize=14)

plt.tight_layout()
plt.show()

# Calculate and display class imbalance statistics
largest_class = target_dist.idxmax()
smallest_class = target_dist.idxmin()
imbalance_ratio = target_dist.max() / target_dist.min()

print(f"\nClass Imbalance Analysis:")
print(f"Largest class: {largest_class} with {target_dist.max()} samples")
print(f"Smallest class: {smallest_class} with {target_dist.min()} samples")
print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")
print(f"\nThis means the largest class has {imbalance_ratio:.2f}x more samples than the smallest class.")

---

## 3.2 - Numerical Features Analysis

Let's analyze the distribution and statistics of numerical features to understand their behavior.

In [None]:
# Identify all numerical columns in the dataset
numerical_cols = df_merged.select_dtypes(include=[np.number]).columns.tolist()

print(f"Total numerical features: {len(numerical_cols)}\n")
print("List of numerical features:")
for i, col in enumerate(numerical_cols, 1):
    print(f"  {i}. {col}")

print("\n")
print("Statistical Summary of Numerical Features:")
print("\n")
df_merged[numerical_cols].describe()

In [None]:
# Visualize the distribution of numerical features using histograms
# This helps us understand if features are normally distributed, skewed, etc.

fig, axes = plt.subplots(4, 4, figsize=(18, 16))
axes = axes.ravel()  # Flatten the 2D array to 1D

# Plot histogram for each numerical column without ids
plot_idx = 0  # Separate counter for axes indexing
for idx, col in enumerate(numerical_cols[:19]):
    if col == 'review_id' or col == 'user_id' or col == 'hotel_id':
        continue
    axes[plot_idx].hist(df_merged[col], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
    axes[plot_idx].set_title(f'{col}', fontsize=10, fontweight='bold')
    axes[plot_idx].set_xlabel('Value', fontsize=9)
    axes[plot_idx].set_ylabel('Frequency', fontsize=9)
    axes[plot_idx].grid(axis='y', alpha=0.3)   # alpha = transparency
    plot_idx += 1  # Increment only when we actually plot

plt.suptitle('Distribution of Numerical Features', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# Use box plots to detect outliers in numerical features
# Box plots show: median (center line), quartiles (box edges), and outliers (dots beyond whiskers)

fig, axes = plt.subplots(4, 4, figsize=(18, 16))
axes = axes.ravel()

plot_idx = 0  # Separate counter for axes indexing
# Create box plot for each numerical column
for idx, col in enumerate(numerical_cols[:19]):
    if col == 'review_id' or col == 'user_id' or col == 'hotel_id':
        continue
    axes[plot_idx].boxplot(df_merged[col], vert=True)
    axes[plot_idx].set_title(f'{col}', fontsize=10, fontweight='bold')
    axes[plot_idx].set_ylabel('Value', fontsize=9)
    axes[plot_idx].grid(axis='y', alpha=0.3)
    plot_idx += 1

plt.suptitle('Box Plots for Outlier Detection', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\n Insights:")
print("- The box represents the interquartile range (IQR): 25th to 75th percentile")
print("- The line inside the box is the median (50th percentile)")
print("- Whiskers extend to 1.5 * IQR from the box")
print("- Points beyond whiskers are potential outliers")

## 3.3 - Correlation Analysis

Correlation analysis helps us identify relationships between numerical features. High correlation can indicate:
- Redundant features (multicollinearity)
- Features that move together
- Potential feature combinations

In [None]:
# Calculate correlation matrix (Pearson correlation coefficient)
# Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation)

numerical_cols = [col for col in numerical_cols
                           if col not in ['review_id', 'user_id', 'hotel_id']]

correlation_matrix = df_merged[numerical_cols].corr()

# Visualize using a heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(
    correlation_matrix,
    annot=True,
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=0.5,
    cbar_kws={'label': 'Correlation Coefficient'}
)
plt.title('Correlation Matrix Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Find and print highly correlated pairs (correlation > 0.8 or < -0.8)
print("\n")
print("Highly Correlated Feature Pairs (|correlation| > 0.8):")
print("\n")
print(f"{'Feature 1':<35} {'Feature 2':<35} {'Correlation':>10}")
print("-"*80)

highly_correlated_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.8:
            feat1 = correlation_matrix.columns[i]
            feat2 = correlation_matrix.columns[j]
            print(f"{feat1:<35} {feat2:<35} {corr_value:>10.3f}")
            highly_correlated_pairs.append((feat1, feat2, corr_value))


## 3.4 - Categorical Features Analysis

Let's examine the distribution of categorical features to understand user demographics and traveller characteristics.

In [None]:
# Define categorical features to analyze
categorical_features = ['user_gender', 'user_age_group', 'user_traveller_type','user_country']

print("CATEGORICAL FEATURES DISTRIBUTION")

for feat in categorical_features:
    print(f"\n{feat.upper().replace('_', ' ')}:")
    print("-" * 40)
    counts = df_merged[feat].value_counts()

    # Display counts and percentages
    for value, count in counts.items():
        percentage = (count / len(df_merged)) * 100
        print(f"  {value:<30} {count:>6} ({percentage:>5.2f}%)")

    print(f"\n  Total unique values: {df_merged[feat].nunique()}")
    print("="*80)

In [None]:
# Visualize categorical features using bar charts
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
axes = axes.ravel()

# Create a bar chart for each categorical feature
for idx, col in enumerate(categorical_features):
    # Get value counts and plot
    value_counts = df_merged[col].value_counts()
    value_counts.plot(kind='bar', ax=axes[idx], color='teal', alpha=0.7, edgecolor='black')

    # Formatting
    axes[idx].set_title(f'{col.replace("_", " ").title()} Distribution',
                        fontsize=13, fontweight='bold')
    axes[idx].set_xlabel(col.replace('_', ' ').title(), fontsize=11)
    axes[idx].set_ylabel('Count', fontsize=11)
    axes[idx].tick_params(axis='x', rotation=45)
    axes[idx].grid(axis='y', alpha=0.3)

    # Add value labels on top of bars
    for container in axes[idx].containers:
        axes[idx].bar_label(container, fmt='%d', padding=3, fontsize=9)

plt.suptitle('Distribution of Categorical Features', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

---

## 3.5 - EDA Summary & Key Insights

Based on the exploratory data analysis, here are the key findings:

### Target Variable (country_group)
 **Class imbalance**: The dataset shows a moderate class imbalance with a 6:1 ratio between largest and smallest country groups.
- *Largest class*: Western Europe (most reviews).
- *Smallest class*: Eastern Europe (fewest reviews).

This imbalance indicates the need for *stratified sampling* during the train/test split to ensure all regions are properly represented in the model.

### Data Leakage
The features *hotel_city* and *hotel_country* and *all base line scores of hotels* would allow the model to “cheat” by inferring the target from already-known information rather than learning genuine patterns [Since there are 25 unique hotels].

###  Numerical Features
**Scale**: Most numerical features (review and baseline scores) are on a 0–10 scale.

**Distribution**: They are roughly normally distributed with slight right skew (most hotels receive fairly high ratings).

**Outliers**: Only a few mild outliers were detected in the box plots.

**Variance**: Features show reasonable variance, which is helpful for modeling

###  Feature Correlations
**Review scores**: Highly correlated with each other (users who rate one aspect highly tend to rate others highly too).

**Hotel baseline scores**: Show moderate correlation.

###  Categorical Features

**User Gender:**
- Fairly balanced distribution across Male/Female/Other.
- No major gender bias in the dataset.

**User Age Group:**
- Most users fall in the 25-44 age range
- Represents the primary demographic for hotel reviews.

**Traveller Type:**
- *Couples* and *Families* are the most common (make up about ~60% of reviews).

- *Business* and *Solo* travelers are less common but still well-represented.



---

# Section 4: Feature Engineering

**Objective:** Create deviation features to capture how individual user experiences differ from hotel baselines.

**Approach:** We will use user demographics and deviation features to predict country groups.

## 4.1 - Deviation Features

**Justification:** Deviation features capture how individual user experiences differ from hotel baselines.

**Formula:** deviation = individual_review_score - hotel_baseline_score

This indicates:
- Whether the user's experience was better or worse than the hotel's average
- How user satisfaction compares to typical hotel performance
- Individual user preferences relative to hotel standards

### Computing the Deviation

In [None]:
# Formula: deviation = individual_review_score - hotel_baseline_score

df_merged['deviation_cleanliness'] = (
    df_merged['review_score_cleanliness'] - df_merged['hotel_cleanliness_base']
)

df_merged['deviation_comfort'] = (
    df_merged['review_score_comfort'] - df_merged['hotel_comfort_base']
)

df_merged['deviation_facilities'] = (
    df_merged['review_score_facilities'] - df_merged['hotel_facilities_base']
)

df_merged['deviation_location'] = (
    df_merged['review_score_location'] - df_merged['hotel_location_base']
)

df_merged['deviation_staff'] = (
    df_merged['review_score_staff'] - df_merged['hotel_staff_base']
)

df_merged['deviation_value_for_money'] = (
    df_merged['review_score_value_for_money'] - df_merged['hotel_value_for_money_base']
)

deviation_cols = [col for col in df_merged.columns if col.startswith('deviation_')]
print(df_merged[deviation_cols].head())

### Analysis

In [None]:
df_merged[deviation_cols].describe()

---

# Section 5: Data Preprocessing

**Objective:** Prepare data to be ready for our machine learning model through encoding, scaling, and splitting.

**Selected Features:**
- Categorical: *user_gender*, *user_age_group*, *user_traveller_type* (user demoghraphics)
- Numerical: score-based features and the new engineered features *deviation*
- Target: *country_group*

## 5.0 - Importing libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## 5.1 - Feature Selection

In [None]:
df_processed = df_merged.copy()

selected_columns = [
    # Categorical features
    'user_gender', 'user_age_group', 'user_traveller_type',

    # Review scores (excluding overall)
    'review_score_cleanliness', 'review_score_comfort', 
    'review_score_facilities', 'review_score_location',
    'review_score_staff', 'review_score_value_for_money',

    # Deviation features
    'deviation_cleanliness', 'deviation_comfort',
    'deviation_facilities', 'deviation_location',
    'deviation_staff', 'deviation_value_for_money',
    
    # Target
    'country_group'
]

df_processed = df_processed[selected_columns]

## 5.2 - Encoding

**Objective**: Converting categorical features into numerical form.

*One-hot encoding:* used for unordered (nominal) features.

In [None]:
df_processed = pd.get_dummies(
    df_processed, 
    columns=['user_gender', 'user_age_group', 'user_traveller_type'],
    drop_first=True
)

print("After encoding:")
print(f"\nColumn names:")
print(df_processed.columns.tolist())

## 5.3 - Split Features and Target

In [None]:
X = df_processed.drop('country_group', axis=1)
y = df_processed['country_group']

## 5.4 - Train-Test-Validation Split

Split the data into 70% training, 15% validation, and 15% test sets using stratified sampling.

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# split 30% into 15% val and 15% test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

print("Dataset Split:")
print(f"Train: {X_train.shape}")
print(f"Val:   {X_val.shape}")
print(f"Test:  {X_test.shape}")

## 5.5 - Encode Target Variable

Convert country_group labels to numerical format using LabelEncoder.

In [None]:
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)
y_test_encoded = label_encoder.transform(y_test)

print("Class Mapping:")
for i, class_name in enumerate(label_encoder.classes_):
    print(f"{i}: {class_name}")

---

# Section 6: Model Development

**Objective:** Train and compare multiple classification models to predict country groups.

We will train 4 different models:
1. Logistic Regression (baseline linear model)
2. Random Forest (ensemble tree-based)
3. XGBoost (gradient boosted trees)
4. Shallow Neural Network (deep learning)

## 6.1 - Import Required Libraries

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import tensorflow as tf
from tensorflow import keras

print("Libraries imported successfully!")

## 6.2 - Model 1: Logistic Regression

A linear model that serves as our baseline. Uses one-vs-rest strategy for multi-class classification.

In [None]:
print("Training Logistic Regression...")

# Initialize with class_weight to handle imbalance
lr_model = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

# Train on training set
lr_model.fit(X_train, y_train)

# Predict on validation set
y_val_pred_lr = lr_model.predict(X_val)

# Calculate accuracy
lr_accuracy = accuracy_score(y_val, y_val_pred_lr)

print(f"Logistic Regression Validation Accuracy: {lr_accuracy:.4f} ({lr_accuracy*100:.2f}%)")
print("Model trained successfully!")

## 6.3 - Model 2: Random Forest

An ensemble of decision trees that votes on the final prediction. Good at capturing non-linear patterns.

In [None]:
print("Training Random Forest...")

# Initialize with balanced class weights
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=2,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

# Train on training set
rf_model.fit(X_train, y_train)

# Predict on validation set
y_val_pred_rf = rf_model.predict(X_val)

# Calculate accuracy
rf_accuracy = accuracy_score(y_val, y_val_pred_rf)

print(f"Random Forest Validation Accuracy: {rf_accuracy:.4f} ({rf_accuracy*100:.2f}%)")
print("Model trained successfully!")

## 6.4 - Model 3: XGBoost

Gradient boosted trees that build sequentially, with each tree correcting errors of previous trees. Often the best performer for tabular data.

In [None]:
print("Training XGBoost...")

# Calculate scale_pos_weight for imbalanced classes
# This helps XGBoost handle the 6:1 class imbalance
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight('balanced', y_train)

# Initialize XGBoost
xgb_model = XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    eval_metric='mlogloss'
)

# Train with sample weights to handle imbalance
xgb_model.fit(X_train, y_train_encoded, sample_weight=sample_weights)

# Predict on validation set
y_val_pred_xgb = xgb_model.predict(X_val)

# Convert predictions back to class names
y_val_pred_xgb_labels = label_encoder.inverse_transform(y_val_pred_xgb)

# Calculate accuracy
xgb_accuracy = accuracy_score(y_val, y_val_pred_xgb_labels)

print(f"XGBoost Validation Accuracy: {xgb_accuracy:.4f} ({xgb_accuracy*100:.2f}%)")
print("Model trained successfully!")

## 6.5 - Model 4: Shallow Neural Network

A feedforward neural network with 2 hidden layers. Can learn complex non-linear patterns through activation functions.

In [None]:
print("Building and training Neural Network...")

# Build shallow neural network
nn_model = keras.Sequential([
    keras.layers.Input(shape=(X_train.shape[1],)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(11, activation='softmax')  # 11 country groups
])

# Compile model
nn_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Display model architecture
nn_model.summary()

# Train model (with early stopping to prevent overfitting)
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

history = nn_model.fit(
    X_train, y_train_encoded,
    validation_data=(X_val, y_val_encoded),
    epochs=50,
    batch_size=32,
    callbacks=[early_stop],
    verbose=0
)

# Evaluate on validation set
val_loss, val_accuracy = nn_model.evaluate(X_val, y_val_encoded, verbose=0)

print(f"\nNeural Network Validation Accuracy: {val_accuracy:.4f} ({val_accuracy*100:.2f}%)")
print("Model trained successfully!")

## 6.6 - Model Comparison

Compare all 4 models on validation set to identify the best performer.

In [None]:
# Create comparison dataframe
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost', 'Neural Network'],
    'Validation Accuracy': [lr_accuracy, rf_accuracy, xgb_accuracy, val_accuracy],
    'Type': ['Linear', 'Tree Ensemble', 'Gradient Boosting', 'Deep Learning']
})

# Sort by accuracy
results = results.sort_values('Validation Accuracy', ascending=False)

print("="*70)
print("MODEL COMPARISON - VALIDATION SET")
print("="*70)
print(results.to_string(index=False))
print("="*70)

# Visualize comparison
plt.figure(figsize=(10, 6))
bars = plt.bar(results['Model'], results['Validation Accuracy'], 
               color=['#3498db', '#2ecc71', '#e74c3c', '#f39c12'])

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2%}',
             ha='center', va='bottom', fontweight='bold')

plt.xlabel('Model', fontsize=12)
plt.ylabel('Validation Accuracy', fontsize=12)
plt.title('Model Performance Comparison', fontsize=14, fontweight='bold')
plt.xticks(rotation=15)
plt.ylim(0, max(results['Validation Accuracy']) * 1.15)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Identify best model
best_model_name = results.iloc[0]['Model']
best_accuracy = results.iloc[0]['Validation Accuracy']

print(f"\nBest Model: {best_model_name}")
print(f"Best Validation Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

---

# Section 7: Model Evaluation

**Objective:** Evaluate and compare all models using accuracy, precision, recall, and F1-score.

## 7.1 - Model Comparison

TODO: Evaluate all models on test set and create comparison table

In [None]:
# TODO: Model evaluation
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
#
# models = {
#     'Logistic Regression': lr_model,
#     'Random Forest': rf_model,
#     'XGBoost': xgb_model,
#     'Neural Network': nn_model
# }
#
# results = []
# for name, model in models.items():
#     y_pred = model.predict(X_test)
#     results.append({
#         'Model': name,
#         'Accuracy': accuracy_score(y_test, y_pred),
#         'Precision': precision_score(y_test, y_pred, average='weighted'),
#         'Recall': recall_score(y_test, y_pred, average='weighted'),
#         'F1-Score': f1_score(y_test, y_pred, average='weighted')
#     })
#
# results_df = pd.DataFrame(results)
# print(results_df)
#
# # Confusion matrix for best model
# best_model = rf_model  # Choose your best
# y_pred = best_model.predict(X_test)
# cm = confusion_matrix(y_test, y_pred)
# plt.figure(figsize=(10, 8))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
# plt.title('Confusion Matrix')
# plt.show()
#
# # Classification report
# print(classification_report(y_test, y_pred))

---

# Section 8: Model Explainability (XAI)

**Objective:** Use SHAP and LIME to explain model predictions and identify important features.

## 8.1 - SHAP Analysis

TODO: Global and local explanations using SHAP

In [None]:
# TODO: SHAP analysis
# import shap
#
# # Initialize explainer (use TreeExplainer for tree-based models)
# explainer = shap.TreeExplainer(best_model)
# shap_values = explainer.shap_values(X_test)
#
# # Global feature importance
# shap.summary_plot(shap_values, X_test)
# shap.summary_plot(shap_values, X_test, plot_type="bar")
#
# # Local explanations (pick 3-5 instances)
# for i in [0, 10, 50]:
#     shap.force_plot(explainer.expected_value, shap_values[i], X_test.iloc[i])
#     shap.waterfall_plot(shap.Explanation(values=shap_values[i],
#                                          base_values=explainer.expected_value,
#                                          data=X_test.iloc[i]))

## 8.2 - LIME Analysis

TODO: Local instance explanations using LIME

In [None]:
# TODO: LIME analysis
# import lime
# import lime.lime_tabular
#
# # Initialize explainer
# explainer = lime.lime_tabular.LimeTabularExplainer(
#     X_train.values,
#     feature_names=X_train.columns,
#     class_names=['North_America', 'Western_Europe', ...],  # all 11 classes
#     mode='classification'
# )
#
# # Explain same instances as SHAP for comparison
# for idx in [0, 10, 50]:
#     exp = explainer.explain_instance(X_test.iloc[idx].values,
#                                       best_model.predict_proba)
#     exp.show_in_notebook()
#     exp.as_pyplot_figure()

---

# Section 9: Inference Function

**Objective:** Create a deployable function that accepts raw input and returns predictions in natural language.

TODO: Build inference function that preprocesses input and returns human-readable predictions

In [None]:
# TODO: Create inference function
# def predict_country_group(user_gender, user_age_group, user_traveller_type,
#                           user_country, review_score_overall,
#                           review_score_cleanliness, review_score_comfort,
#                           review_score_facilities, review_score_location,
#                           review_score_staff, review_score_value_for_money,
#                           hotel_cleanliness_base, hotel_comfort_base,
#                           hotel_facilities_base, hotel_location_base,
#                           hotel_staff_base, hotel_value_for_money_base):
#     """
#     Predicts the country group of a hotel based on user and hotel features.
#
#     Returns:
#         str: Predicted country group in natural language
#     """
#
#     # 1. Create input dataframe
#     input_data = pd.DataFrame({
#         'user_gender': [user_gender],
#         'user_age_group': [user_age_group],
#         # ... add all features
#     })
#
#     # 2. Apply same preprocessing (encoding, scaling, feature engineering)
#     # ... your preprocessing code here
#
#     # 3. Make prediction
#     prediction = best_model.predict(processed_input)[0]
#
#     # 4. Convert to natural language
#     country_group_map = {
#         'North_America': 'North America (United States, Canada)',
#         'Western_Europe': 'Western Europe (Germany, France, UK, Netherlands, Spain, Italy)',
#         # ... rest of mappings
#     }
#
#     return country_group_map[prediction]
#
# # Test with examples
# example_1 = predict_country_group(
#     user_gender='Male',
#     user_age_group='25-34',
#     user_traveller_type='Business',
#     # ... provide all parameters
# )
# print(f"Prediction: {example_1}")