# Milestone 1: International Hotel Booking Analytics

---

## Project Overview

This notebook implements a comprehensive machine learning pipeline to predict the country group of hotel reviews based on user demographics, hotel characteristics, and review scores.

**Dataset:**
- Hotels Dataset: 25 hotels
- Reviews Dataset: 50,000 reviews
- Users Dataset: 2,000 users

**Deadline:** October 22, 2025 at 11:59 PM

---

## Import Libraries

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

# Model Explainability
import shap
# import lime

# Database
import sqlite3

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

Libraries imported successfully!


## Load Datasets

In [2]:
# Define paths
DATA_PATH = '../../Dataset [Original]/'

# Load datasets
hotels_df = pd.read_csv(DATA_PATH + 'hotels.csv')
reviews_df = pd.read_csv(DATA_PATH + 'reviews.csv')
users_df = pd.read_csv(DATA_PATH + 'users.csv')

print(f"Hotels Dataset Shape: {hotels_df.shape}")
print(f"Reviews Dataset Shape: {reviews_df.shape}")
print(f"Users Dataset Shape: {users_df.shape}")

Hotels Dataset Shape: (25, 13)
Reviews Dataset Shape: (50000, 12)
Users Dataset Shape: (2000, 6)


---

# ðŸ“¦ Objective 1: Data Cleaning

## Description
Remove unnecessary columns and handle null values/duplicates to prepare clean datasets for analysis.

---

### ðŸ“‹ 1.1: Initial Data Exploration

Examine the structure and quality of each dataset.

In [3]:
# Hotels Dataset
print("=" * 80)
print("HOTELS DATASET")
print("=" * 80)
print("\nFirst 5 rows:")
display(hotels_df.head())
print("\nDataset Info:")
print(hotels_df.info())
print("\nMissing Values:")
print(hotels_df.isnull().sum())
print("\nDuplicate Rows:", hotels_df.duplicated().sum())

HOTELS DATASET

First 5 rows:


Unnamed: 0,hotel_id,hotel_name,city,country,star_rating,lat,lon,cleanliness_base,comfort_base,facilities_base,location_base,staff_base,value_for_money_base
0,1,The Azure Tower,New York,United States,5,40.758,-73.9855,9.1,8.8,8.9,9.5,8.6,8.0
1,2,The Royal Compass,London,United Kingdom,5,51.5072,-0.1276,9.0,9.2,8.8,9.4,9.0,7.9
2,3,L'Ã‰toile Palace,Paris,France,5,48.8566,2.3522,8.8,9.4,8.7,9.6,9.3,8.1
3,4,Kyo-to Grand,Tokyo,Japan,5,35.6895,139.6917,9.6,9.0,9.3,8.5,9.5,8.2
4,5,The Golden Oasis,Dubai,United Arab Emirates,5,25.2769,55.2962,9.3,9.5,9.6,8.9,9.4,8.5



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   hotel_id              25 non-null     int64  
 1   hotel_name            25 non-null     object 
 2   city                  25 non-null     object 
 3   country               25 non-null     object 
 4   star_rating           25 non-null     int64  
 5   lat                   25 non-null     float64
 6   lon                   25 non-null     float64
 7   cleanliness_base      25 non-null     float64
 8   comfort_base          25 non-null     float64
 9   facilities_base       25 non-null     float64
 10  location_base         25 non-null     float64
 11  staff_base            25 non-null     float64
 12  value_for_money_base  25 non-null     float64
dtypes: float64(8), int64(2), object(3)
memory usage: 2.7+ KB
None

Missing Values:
hotel_id                0
hotel_n

In [4]:
# Reviews Dataset
print("=" * 80)
print("REVIEWS DATASET")
print("=" * 80)
print("\nFirst 5 rows:")
display(reviews_df.head())
print("\nDataset Info:")
print(reviews_df.info())
print("\nMissing Values:")
print(reviews_df.isnull().sum())
print("\nDuplicate Rows:", reviews_df.duplicated().sum())

REVIEWS DATASET

First 5 rows:


Unnamed: 0,review_id,user_id,hotel_id,review_date,score_overall,score_cleanliness,score_comfort,score_facilities,score_location,score_staff,score_value_for_money,review_text
0,1,1600,1,2022-10-07,8.7,8.6,8.7,8.5,9.0,8.8,8.7,Practice reduce young our because machine. Rec...
1,2,432,4,2020-03-24,9.1,10.0,9.1,9.0,8.6,9.4,8.6,Test cover traditional black. Process tell Mr ...
2,3,186,18,2023-12-18,8.8,9.7,8.8,8.3,8.7,8.1,8.6,Friend million student social study yeah. Grow...
3,4,1403,19,2022-06-22,8.9,9.0,8.8,8.5,9.6,9.1,8.3,Huge girl already remain truth behind card. Ap...
4,5,1723,17,2022-07-02,9.1,8.9,9.5,9.3,8.3,9.4,8.9,Cover feeling call community serve television ...



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   review_id              50000 non-null  int64  
 1   user_id                50000 non-null  int64  
 2   hotel_id               50000 non-null  int64  
 3   review_date            50000 non-null  object 
 4   score_overall          50000 non-null  float64
 5   score_cleanliness      50000 non-null  float64
 6   score_comfort          50000 non-null  float64
 7   score_facilities       50000 non-null  float64
 8   score_location         50000 non-null  float64
 9   score_staff            50000 non-null  float64
 10  score_value_for_money  50000 non-null  float64
 11  review_text            50000 non-null  object 
dtypes: float64(7), int64(3), object(2)
memory usage: 4.6+ MB
None

Missing Values:
review_id                0
user_id                  0
hotel_id  

In [5]:
# Users Dataset
print("=" * 80)
print("USERS DATASET")
print("=" * 80)
print("\nFirst 5 rows:")
display(users_df.head())
print("\nDataset Info:")
print(users_df.info())
print("\nMissing Values:")
print(users_df.isnull().sum())
print("\nDuplicate Rows:", users_df.duplicated().sum())

USERS DATASET

First 5 rows:


Unnamed: 0,user_id,user_gender,country,age_group,traveller_type,join_date
0,1,Female,United Kingdom,35-44,Solo,2024-09-29
1,2,Male,United Kingdom,25-34,Solo,2023-11-29
2,3,Female,Mexico,25-34,Family,2022-04-03
3,4,Male,India,35-44,Family,2023-12-02
4,5,Other,Japan,25-34,Solo,2021-12-18



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   user_id         2000 non-null   int64 
 1   user_gender     2000 non-null   object
 2   country         2000 non-null   object
 3   age_group       2000 non-null   object
 4   traveller_type  2000 non-null   object
 5   join_date       2000 non-null   object
dtypes: int64(1), object(5)
memory usage: 93.9+ KB
None

Missing Values:
user_id           0
user_gender       0
country           0
age_group         0
traveller_type    0
join_date         0
dtype: int64

Duplicate Rows: 0


### ðŸ“‹ 1.2: Handle Missing Values

Identify and handle missing values appropriately for each dataset.

In [6]:
# Check for missing values across all datasets
print("Missing Values Summary:")
print("=" * 80)
print(f"Hotels Dataset - Total missing values: {hotels_df.isnull().sum().sum()}")
print(f"Reviews Dataset - Total missing values: {reviews_df.isnull().sum().sum()}")
print(f"Users Dataset - Total missing values: {users_df.isnull().sum().sum()}")
print("\nâœ“ No missing values found in any dataset - no imputation needed!")

Missing Values Summary:
Hotels Dataset - Total missing values: 0
Reviews Dataset - Total missing values: 0
Users Dataset - Total missing values: 0

âœ“ No missing values found in any dataset - no imputation needed!


### ðŸ“‹ 1.3: Handle Duplicate Rows

Remove duplicate entries if any exist.

In [7]:
# Check for duplicate rows across all datasets
print("Duplicate Rows Summary:")
print("=" * 80)
print(f"Hotels Dataset - Duplicate rows: {hotels_df.duplicated().sum()}")
print(f"Reviews Dataset - Duplicate rows: {reviews_df.duplicated().sum()}")
print(f"Users Dataset - Duplicate rows: {users_df.duplicated().sum()}")
print("\nâœ“ No duplicate rows found in any dataset - no removal needed!")

Duplicate Rows Summary:
Hotels Dataset - Duplicate rows: 0
Reviews Dataset - Duplicate rows: 0
Users Dataset - Duplicate rows: 0

âœ“ No duplicate rows found in any dataset - no removal needed!


### ðŸ“‹ 1.4: Remove Unnecessary Columns

Identify and remove columns that are not needed for analysis.

In [8]:
# Remove unnecessary columns from each dataset

# Hotels Dataset
hotels_columns_to_remove = [
    'hotel_name',  # Text identifier - not useful for numerical modeling
    'lat',         # Geographic coordinates - not mentioned in modeling requirements
    'lon'          # Geographic coordinates - city/country already captured
]
# Only drop columns that actually exist
hotels_to_drop = [col for col in hotels_columns_to_remove if col in hotels_df.columns]
if hotels_to_drop:
    hotels_df = hotels_df.drop(columns=hotels_to_drop)
    print(f"Hotels Dataset - Removed {len(hotels_to_drop)} columns: {hotels_to_drop}")
else:
    print("Hotels Dataset - No columns to remove (already cleaned or don't exist)")

# Reviews Dataset
reviews_columns_to_remove = [
    'review_id',    # Just an identifier for tracking - no predictive value
    'review_date',  # Temporal analysis not in scope for Milestone 1
    'review_text'   # NLP/text analysis not required for this milestone
]
# Only drop columns that actually exist
reviews_to_drop = [col for col in reviews_columns_to_remove if col in reviews_df.columns]
if reviews_to_drop:
    reviews_df = reviews_df.drop(columns=reviews_to_drop)
    print(f"Reviews Dataset - Removed {len(reviews_to_drop)} columns: {reviews_to_drop}")
else:
    print("Reviews Dataset - No columns to remove (already cleaned or don't exist)")

# Users Dataset
users_columns_to_remove = [
    'join_date'  # Temporal data not relevant to country group prediction
]
# Only drop columns that actually exist
users_to_drop = [col for col in users_columns_to_remove if col in users_df.columns]
if users_to_drop:
    users_df = users_df.drop(columns=users_to_drop)
    print(f"Users Dataset - Removed {len(users_to_drop)} column(s): {users_to_drop}")
else:
    print("Users Dataset - No columns to remove (already cleaned or don't exist)")

print("\n" + "="*80)
print("Remaining Columns:")
print("="*80)
print(f"\nHotels ({len(hotels_df.columns)} columns): {list(hotels_df.columns)}")
print(f"\nReviews ({len(reviews_df.columns)} columns): {list(reviews_df.columns)}")
print(f"\nUsers ({len(users_df.columns)} columns): {list(users_df.columns)}")

Hotels Dataset - Removed 3 columns: ['hotel_name', 'lat', 'lon']
Reviews Dataset - Removed 3 columns: ['review_id', 'review_date', 'review_text']
Users Dataset - Removed 1 column(s): ['join_date']

Remaining Columns:

Hotels (10 columns): ['hotel_id', 'city', 'country', 'star_rating', 'cleanliness_base', 'comfort_base', 'facilities_base', 'location_base', 'staff_base', 'value_for_money_base']

Reviews (9 columns): ['user_id', 'hotel_id', 'score_overall', 'score_cleanliness', 'score_comfort', 'score_facilities', 'score_location', 'score_staff', 'score_value_for_money']

Users (5 columns): ['user_id', 'user_gender', 'country', 'age_group', 'traveller_type']


### ðŸ“‹ 1.5: Data Cleaning Summary

Summarize the cleaning operations performed.

In [43]:
# Data Cleaning Summary
print("Data Cleaning Complete!")
print("="*80)
print("\nFinal Dataset Shapes:")
print(f"  Hotels Dataset:  {hotels_df.shape[0]} rows Ã— {hotels_df.shape[1]} columns")
print(f"  Reviews Dataset: {reviews_df.shape[0]} rows Ã— {reviews_df.shape[1]} columns")
print(f"  Users Dataset:   {users_df.shape[0]} rows Ã— {users_df.shape[1]} columns")
print("\nâœ“ All datasets cleaned and ready for analysis")

Data Cleaning Complete!

Final Dataset Shapes:
  Hotels Dataset:  25 rows Ã— 10 columns
  Reviews Dataset: 50000 rows Ã— 9 columns
  Users Dataset:   2000 rows Ã— 5 columns

âœ“ All datasets cleaned and ready for analysis


---

# ðŸ“¦ Objective 2: Data Engineering Questions

## Description
Analyze the dataset to answer specific business questions with visualizations.

---

### ðŸ“‹ 2.1: Question 1 - Which city is best for each traveler type?

**Task:** For each traveler type (Solo, Business, Family, Couple), recommend the best city based on the given reviews.

**Approach:**
- Merge datasets to get city and traveler type information
- Calculate average scores per city for each traveler type
- Visualize results

In [10]:
# TODO: Merge datasets


In [11]:
# TODO: Calculate best city for each traveler type


In [12]:
# TODO: Visualize results


**Analysis:**
- [TODO: Add insights and interpretation]

### ðŸ“‹ 2.2: Question 2 - Top 3 countries with best value-for-money per age group

**Task:** What are the top 3 countries with the best value-for-money score per traveler's age group?

**Approach:**
- Create age groups from user ages
- Calculate average value-for-money scores by country and age group
- Identify top 3 countries for each age group
- Visualize results

In [13]:
# TODO: Create age groups


In [14]:
# TODO: Calculate top 3 countries per age group


In [15]:
# TODO: Visualize results


**Analysis:**
- [TODO: Add insights and interpretation]

---

# ðŸ“¦ Objective 3: Predictive Modeling

## Description
Develop a statistical ML model or shallow FFNN to predict the country groups in the new column 'country_group'.

**Target:** country_group (11 classes)

**Features:**
- Score-Based Features from the hotel's info and the users' reviews
- Features about the user, including their age group, type, and gender
- Quality-Based Features representing overall score and value for money

**Evaluation Metrics:** Accuracy, Precision, Recall, F1-score

---

### ðŸ“‹ 3.1: Create Country Groups

Map countries to their respective country groups.

In [16]:
# Define country group mapping
country_group_mapping = {
    'United States': 'North_America',
    'Canada': 'North_America',
    'Germany': 'Western_Europe',
    'France': 'Western_Europe',
    'United Kingdom': 'Western_Europe',
    'Netherlands': 'Western_Europe',
    'Spain': 'Western_Europe',
    'Italy': 'Western_Europe',
    'Russia': 'Eastern_Europe',
    'China': 'East_Asia',
    'Japan': 'East_Asia',
    'South Korea': 'East_Asia',
    'Thailand': 'Southeast_Asia',
    'Singapore': 'Southeast_Asia',
    'United Arab Emirates': 'Middle_East',
    'Turkey': 'Middle_East',
    'Egypt': 'Africa',
    'Nigeria': 'Africa',
    'South Africa': 'Africa',
    'Australia': 'Oceania',
    'New Zealand': 'Oceania',
    'Brazil': 'South_America',
    'Argentina': 'South_America',
    'India': 'South_Asia',
    'Mexico': 'North_America_Mexico'
}

# TODO: Apply mapping to create country_group column


### ðŸ“‹ 3.2: Feature Engineering

Create and select features for the predictive model.

In [17]:
# TODO: Create age groups


In [18]:
# TODO: Merge datasets to create feature set


In [19]:
# TODO: Create score-based features
# - score_overall, score_cleanliness, score_comfort, etc.


In [20]:
# TODO: Create quality-based features


In [21]:
# TODO: Encode categorical features


**Feature Selection Justification:**
- [TODO: Document which features were selected and why]

### ðŸ“‹ 3.3: Data Preprocessing

Prepare data for model training.

In [22]:
# TODO: Split data into features (X) and target (y)


In [23]:
# TODO: Train-test split


In [24]:
# TODO: Scale features


### ðŸ“‹ 3.4: Model Training

Train the classification model(s).

In [25]:
# TODO: Train model(s)
# Options: Random Forest, XGBoost, Shallow Neural Network, etc.


**Model Selection Justification:**
- [TODO: Explain why this model was chosen]

### ðŸ“‹ 3.5: Model Evaluation

Evaluate model performance using accuracy, precision, recall, and F1-score.

In [26]:
# TODO: Make predictions


In [27]:
# TODO: Calculate evaluation metrics


In [28]:
# TODO: Display classification report


In [29]:
# TODO: Plot confusion matrix


**Performance Analysis:**
- [TODO: Analyze model performance and discuss results]

### ðŸ“‹ 3.6: Save Model and Processed Data

Save the trained model and cleaned dataset.

In [30]:
# TODO: Save model


In [31]:
# TODO: Save cleaned dataset with country_group column


---

# ðŸ“¦ Objective 4: Model Explainability

## Description
Apply explainable AI techniques (SHAP, LIME) to interpret predictions and identify the most influential features.

---

### ðŸ“‹ 4.1: SHAP Analysis

Use SHAP (SHapley Additive exPlanations) to understand feature importance and contributions.

In [32]:
# TODO: Initialize SHAP explainer


In [33]:
# TODO: Calculate SHAP values


In [34]:
# TODO: Plot SHAP summary plot


In [35]:
# TODO: Plot SHAP feature importance


**SHAP Analysis Insights:**
- [TODO: Interpret SHAP results and explain which features are most influential]

### ðŸ“‹ 4.2: LIME Analysis

Use LIME (Local Interpretable Model-agnostic Explanations) to explain individual predictions.

In [36]:
# TODO: Initialize LIME explainer


In [37]:
# TODO: Explain individual predictions


**LIME Analysis Insights:**
- [TODO: Interpret LIME results for sample predictions]

### ðŸ“‹ 4.3: Overall Feature Importance Summary

Summarize the most influential features across both explainability methods.

In [38]:
# TODO: Compare and summarize feature importance from SHAP and LIME


**Key Findings:**
- [TODO: Summarize the most influential features and their impact on predictions]

---

# ðŸ“¦ Inference Function

## Description
Create a function that takes raw input and returns model predictions.

---

In [39]:
def predict_country_group(input_data):
    """
    Predicts the country group for given input data.
    
    Parameters:
    -----------
    input_data : dict or pd.DataFrame
        Input features containing:
        - Score-based features (cleanliness, comfort, facilities, etc.)
        - User demographics (age_group, gender, traveler_type)
        - Quality features (overall_score, value_for_money)
    
    Returns:
    --------
    str : Predicted country group
    """
    # TODO: Implement inference function
    pass

### Test Inference Function

In [40]:
# TODO: Test the inference function with sample data


---

# ðŸ“Š Final Summary

## Project Completion Checklist

- [ ] **Objective 1: Data Cleaning** - Completed
- [ ] **Objective 2: Data Engineering Questions** - Completed
  - [ ] Question 1: Best city per traveler type
  - [ ] Question 2: Top 3 countries per age group for value-for-money
- [ ] **Objective 3: Predictive Modeling** - Completed
  - [ ] Model trained and evaluated
  - [ ] Metrics: Accuracy, Precision, Recall, F1-score
- [ ] **Objective 4: Model Explainability** - Completed
  - [ ] SHAP analysis
  - [ ] LIME analysis
- [ ] **Inference Function** - Implemented and tested

## Deliverables Status

1. âœ… Jupyter Notebook with complete workflow
2. âœ… Cleaned dataset with country_group column
3. âœ… Report answering data engineering questions
4. âœ… Trained predictive model
5. âœ… XAI outputs (SHAP/LIME)
6. âœ… Inference function

---

**Submission Deadline:** October 22, 2025 at 11:59 PM
