# Milestone 1: International Hotel Booking Analytics

---

## Project Overview

This notebook implements a comprehensive machine learning pipeline to predict the country group of hotel reviews based on user demographics, hotel characteristics, and review scores.

**Dataset:**
- Hotels Dataset: 25 hotels
- Reviews Dataset: 50,000 reviews
- Users Dataset: 2,000 users

**Deadline:** October 22, 2025 at 11:59 PM

---

## Import Libraries

In [3]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

# Model Explainability
import shap
# import lime

# Database
import sqlite3

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

Libraries imported successfully!


## Load Datasets

In [4]:
# Define paths
DATA_PATH = '../../Dataset [Original]/'

# Load datasets
hotels_df = pd.read_csv(DATA_PATH + 'hotels.csv')
reviews_df = pd.read_csv(DATA_PATH + 'reviews.csv')
users_df = pd.read_csv(DATA_PATH + 'users.csv')

print(f"Hotels Dataset Shape: {hotels_df.shape}")
print(f"Reviews Dataset Shape: {reviews_df.shape}")
print(f"Users Dataset Shape: {users_df.shape}")

Hotels Dataset Shape: (25, 13)
Reviews Dataset Shape: (50000, 12)
Users Dataset Shape: (2000, 6)


---

# ðŸ“¦ Objective 1: Data Cleaning

## Description
Remove unnecessary columns and handle null values/duplicates to prepare clean datasets for analysis.

---

### ðŸ“‹ 1.1: Initial Data Exploration

Examine the structure and quality of each dataset.

In [None]:
# Hotels Dataset
print("=" * 80)
print("HOTELS DATASET")
print("=" * 80)
print("\nFirst 5 rows:")
display(hotels_df.head())
print("\nDataset Info:")
print(hotels_df.info())
print("\nMissing Values:")
print(hotels_df.isnull().sum())
print("\nDuplicate Rows:", hotels_df.duplicated().sum())

In [None]:
# Reviews Dataset
print("=" * 80)
print("REVIEWS DATASET")
print("=" * 80)
print("\nFirst 5 rows:")
display(reviews_df.head())
print("\nDataset Info:")
print(reviews_df.info())
print("\nMissing Values:")
print(reviews_df.isnull().sum())
print("\nDuplicate Rows:", reviews_df.duplicated().sum())

In [None]:
# Users Dataset
print("=" * 80)
print("USERS DATASET")
print("=" * 80)
print("\nFirst 5 rows:")
display(users_df.head())
print("\nDataset Info:")
print(users_df.info())
print("\nMissing Values:")
print(users_df.isnull().sum())
print("\nDuplicate Rows:", users_df.duplicated().sum())

### ðŸ“‹ 1.2: Handle Missing Values

Identify and handle missing values appropriately for each dataset.

In [None]:
# TODO: Implement missing value handling
# Strategy: Analyze each column and decide on appropriate handling (drop, fill, etc.)


### ðŸ“‹ 1.3: Handle Duplicate Rows

Remove duplicate entries if any exist.

In [None]:
# TODO: Remove duplicates


### ðŸ“‹ 1.4: Remove Unnecessary Columns

Identify and remove columns that are not needed for analysis.

In [None]:
# TODO: Remove unnecessary columns
# Document which columns are removed and why


### ðŸ“‹ 1.5: Data Cleaning Summary

Summarize the cleaning operations performed.

In [None]:
# TODO: Print summary of cleaning operations
print("Cleaned Hotels Dataset Shape:", hotels_df.shape)
print("Cleaned Reviews Dataset Shape:", reviews_df.shape)
print("Cleaned Users Dataset Shape:", users_df.shape)

---

# ðŸ“¦ Objective 2: Data Engineering Questions

## Description
Analyze the dataset to answer specific business questions with visualizations.

---

### ðŸ“‹ 2.1: Question 1 - Which city is best for each traveler type?

**Task:** For each traveler type (Solo, Business, Family, Couple), recommend the best city based on the given reviews.

**Approach:**
- Merge datasets to get city and traveler type information
- Calculate average scores per city for each traveler type
- Visualize results

In [None]:
# TODO: Merge datasets


In [None]:
# TODO: Calculate best city for each traveler type


In [None]:
# TODO: Visualize results


**Analysis:**
- [TODO: Add insights and interpretation]

### ðŸ“‹ 2.2: Question 2 - Top 3 countries with best value-for-money per age group

**Task:** What are the top 3 countries with the best value-for-money score per traveler's age group?

**Approach:**
- Create age groups from user ages
- Calculate average value-for-money scores by country and age group
- Identify top 3 countries for each age group
- Visualize results

In [None]:
# TODO: Create age groups


In [None]:
# TODO: Calculate top 3 countries per age group


In [None]:
# TODO: Visualize results


**Analysis:**
- [TODO: Add insights and interpretation]

---

# ðŸ“¦ Objective 3: Predictive Modeling

## Description
Develop a statistical ML model or shallow FFNN to predict the country groups in the new column 'country_group'.

**Target:** country_group (11 classes)

**Features:**
- Score-Based Features from the hotel's info and the users' reviews
- Features about the user, including their age group, type, and gender
- Quality-Based Features representing overall score and value for money

**Evaluation Metrics:** Accuracy, Precision, Recall, F1-score

---

### ðŸ“‹ 3.1: Create Country Groups

Map countries to their respective country groups.

In [None]:
# Define country group mapping
country_group_mapping = {
    'United States': 'North_America',
    'Canada': 'North_America',
    'Germany': 'Western_Europe',
    'France': 'Western_Europe',
    'United Kingdom': 'Western_Europe',
    'Netherlands': 'Western_Europe',
    'Spain': 'Western_Europe',
    'Italy': 'Western_Europe',
    'Russia': 'Eastern_Europe',
    'China': 'East_Asia',
    'Japan': 'East_Asia',
    'South Korea': 'East_Asia',
    'Thailand': 'Southeast_Asia',
    'Singapore': 'Southeast_Asia',
    'United Arab Emirates': 'Middle_East',
    'Turkey': 'Middle_East',
    'Egypt': 'Africa',
    'Nigeria': 'Africa',
    'South Africa': 'Africa',
    'Australia': 'Oceania',
    'New Zealand': 'Oceania',
    'Brazil': 'South_America',
    'Argentina': 'South_America',
    'India': 'South_Asia',
    'Mexico': 'North_America_Mexico'
}

# TODO: Apply mapping to create country_group column


### ðŸ“‹ 3.2: Feature Engineering

Create and select features for the predictive model.

In [None]:
# TODO: Create age groups


In [None]:
# TODO: Merge datasets to create feature set


In [None]:
# TODO: Create score-based features
# - score_overall, score_cleanliness, score_comfort, etc.


In [None]:
# TODO: Create quality-based features


In [None]:
# TODO: Encode categorical features


**Feature Selection Justification:**
- [TODO: Document which features were selected and why]

### ðŸ“‹ 3.3: Data Preprocessing

Prepare data for model training.

In [None]:
# TODO: Split data into features (X) and target (y)


In [None]:
# TODO: Train-test split


In [None]:
# TODO: Scale features


### ðŸ“‹ 3.4: Model Training

Train the classification model(s).

In [None]:
# TODO: Train model(s)
# Options: Random Forest, XGBoost, Shallow Neural Network, etc.


**Model Selection Justification:**
- [TODO: Explain why this model was chosen]

### ðŸ“‹ 3.5: Model Evaluation

Evaluate model performance using accuracy, precision, recall, and F1-score.

In [None]:
# TODO: Make predictions


In [None]:
# TODO: Calculate evaluation metrics


In [None]:
# TODO: Display classification report


In [None]:
# TODO: Plot confusion matrix


**Performance Analysis:**
- [TODO: Analyze model performance and discuss results]

### ðŸ“‹ 3.6: Save Model and Processed Data

Save the trained model and cleaned dataset.

In [None]:
# TODO: Save model


In [None]:
# TODO: Save cleaned dataset with country_group column


---

# ðŸ“¦ Objective 4: Model Explainability

## Description
Apply explainable AI techniques (SHAP, LIME) to interpret predictions and identify the most influential features.

---

### ðŸ“‹ 4.1: SHAP Analysis

Use SHAP (SHapley Additive exPlanations) to understand feature importance and contributions.

In [None]:
# TODO: Initialize SHAP explainer


In [None]:
# TODO: Calculate SHAP values


In [None]:
# TODO: Plot SHAP summary plot


In [None]:
# TODO: Plot SHAP feature importance


**SHAP Analysis Insights:**
- [TODO: Interpret SHAP results and explain which features are most influential]

### ðŸ“‹ 4.2: LIME Analysis

Use LIME (Local Interpretable Model-agnostic Explanations) to explain individual predictions.

In [None]:
# TODO: Initialize LIME explainer


In [None]:
# TODO: Explain individual predictions


**LIME Analysis Insights:**
- [TODO: Interpret LIME results for sample predictions]

### ðŸ“‹ 4.3: Overall Feature Importance Summary

Summarize the most influential features across both explainability methods.

In [None]:
# TODO: Compare and summarize feature importance from SHAP and LIME


**Key Findings:**
- [TODO: Summarize the most influential features and their impact on predictions]

---

# ðŸ“¦ Inference Function

## Description
Create a function that takes raw input and returns model predictions.

---

In [None]:
def predict_country_group(input_data):
    """
    Predicts the country group for given input data.
    
    Parameters:
    -----------
    input_data : dict or pd.DataFrame
        Input features containing:
        - Score-based features (cleanliness, comfort, facilities, etc.)
        - User demographics (age_group, gender, traveler_type)
        - Quality features (overall_score, value_for_money)
    
    Returns:
    --------
    str : Predicted country group
    """
    # TODO: Implement inference function
    pass

### Test Inference Function

In [None]:
# TODO: Test the inference function with sample data


---

# ðŸ“Š Final Summary

## Project Completion Checklist

- [ ] **Objective 1: Data Cleaning** - Completed
- [ ] **Objective 2: Data Engineering Questions** - Completed
  - [ ] Question 1: Best city per traveler type
  - [ ] Question 2: Top 3 countries per age group for value-for-money
- [ ] **Objective 3: Predictive Modeling** - Completed
  - [ ] Model trained and evaluated
  - [ ] Metrics: Accuracy, Precision, Recall, F1-score
- [ ] **Objective 4: Model Explainability** - Completed
  - [ ] SHAP analysis
  - [ ] LIME analysis
- [ ] **Inference Function** - Implemented and tested

## Deliverables Status

1. âœ… Jupyter Notebook with complete workflow
2. âœ… Cleaned dataset with country_group column
3. âœ… Report answering data engineering questions
4. âœ… Trained predictive model
5. âœ… XAI outputs (SHAP/LIME)
6. âœ… Inference function

---

**Submission Deadline:** October 22, 2025 at 11:59 PM
