# **Project Name**    -Tourism Experience Analytics:Classification, Prediction, and Recommendation System



##### **Project Type**    - Regression/Classification
##### **Contribution**    - Individual

# **Project Summary -**

Tourism Experience Analytics is an end-to-end machine learning project designed to enhance tourism services through predictive modeling and personalized recommendation systems.The primary objective of this project is to leverage user demographics, historical visit data, and attraction attributes to generate actionable insights for tourism agencies and travel platforms. The system integrates regression, classification, and recommendation techniques to improve customer satisfaction, engagemnet, and retention.

The tourism industry generates large volumes of data through user interactions, ratings, and transaction records. However, this data is oftern underutilized. This project aims to transform data into meaningful intelligence by analyzing user travel patterns, demographic distributions, and attrction popularity across different regions and continents.

The dataset consists of multiple relational tables including Transaction data(User visits and ratings), User demographics, Attractions details, City, Country, Region, Continent, Visit Mode, and Attraction Type data. These datasets are joined and preprocessed to create a consolidated analytical dataset. Data cleaning techniques such as handling missing values, removing duplicates, encoding categorical variables, and feature engineering are performed to prepare the dataset for modeling.

The project addresses three major machine learning tasks:

1. **Regression Task:** Predicting Attraction Ratings
A regression model is built to predict the rating a user might give to a tourist attrction based on demographic details, attrction features, visit year/month, and historical patterns. This helps tourism platforms anticipate customer satisfaction and improve service quality.
2. **Classification Task:** Predicting Visit Mode
A classification model is developed to predict the likely visit mode of user using demographic and transaction features. This allow tourism companies to personalize marketing strategies and optimize service planning.

3.Recommendation System:Personalized Attraction Sugestions
A recommendation enine is implemented using collaborative filtering and/or content-based filtering techniques. The system suggests attractions tailored to user preferences and historical behavior, thereby increasing user engagement and retention.

Exploratory Data Analysis(EDA) is conducted to identify trends such as region-wise travel prefernces, seasonal visit patterns, rating distributions, and rating distributions, and attraction popularity. Visulization techniques are used to communicate insights effectively.

The final solution is deployed using Streamlit application that allow users to input their demographic details and receive predicted visit modes, rating predictions, and personalized attrcation recommendations. The system demonstrates how data-driven approaches can enhance decision-making in the tourism domain and provide measurable business value.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The tourism industry aims to improve customer experience using data-driven insights. However, analyzing complex tourism datasets-such as user demograpics, transaction history, attraction details, and geographical data-remains a challenge. There is a need for an intelligent system that can predict user ratings, classify visit modes, and provide personalized attraction recommendations.

**Key Issues Adrressed:**
1. Lack of personalized recommendations
2. Difficulty in predicting user satisfaction
3. Underutilization of tourism data
4. Need for improved customer retention

This project proposes a machine learning-based tourism analytics system that integrates reression, classification, and recommendation techinques to deliver personalized insights and enhance user engagement.




# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
!pip install pandas==2.2.2

In [None]:

# Import Libraries
#core libraries
import pandas as pd
import numpy as np

#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#statistical analysis
from scipy import stats

#text processing and NLP
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#feature engineering and preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, StandardScaler

#Model selection and Evalution
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

#for warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

transaction=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/Transaction.xlsx")
user=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/User.xlsx")
city=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/City.xlsx")
country=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/Country.xlsx")
region=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/Region.xlsx")
continent=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/Continent.xlsx")
item=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/Item.xlsx")
mode=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/Mode.xlsx")
type_data=pd.read_excel("/content/drive/MyDrive/Tourism Dataset/Type.xlsx")

### Dataset First View

In [None]:
# Dataset First Look
transaction.head()

In [None]:
transaction.columns

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Transaction Shape:", transaction.shape)
print("User Shape:", user.shape)
print("City Shape:", city.shape)
print("Country Shape:", country.shape)
print("Region Shape:", region.shape)
print("Continent Shape:", continent.shape)
print("Item Shape:", item.shape)
print("Mode Shape:", mode.shape)
print("Type Shape:", type_data.shape)


### Dataset Information

In [None]:
# Dataset Info
transaction.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Transaction Duplicates:", transaction.duplicated().sum())
print("User Duplicates:", user.duplicated().sum())
print("City Duplicates:", city.duplicated().sum())
print("Country Duplicates:", country.duplicated().sum())
print("Region Duplicates:", region.duplicated().sum())
print("Continent Duplicates:", continent.duplicated().sum())
print("Item Duplicates:", item.duplicated().sum())
print("Mode Duplicates:", mode.duplicated().sum())
print("Type Duplicates:", type_data.duplicated().sum())



#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Transaction Missing Values:")
print(transaction.isnull().sum())
print("\nUser Missing Values:")
print(user.isnull().sum())
print("\nCity Missing Values:")
print(city.isnull().sum())
print("\nCountry Missing Values:")
print(country.isnull().sum())
print("\nRegion Missing Values:")
print(region.isnull().sum())
print("\nContinent Missing Values:")
print(continent.isnull().sum())
print("\nItem Missing Values:")
print(item.isnull().sum())

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(transaction.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

**Key Points:**
1. Includes numerical and categorical features.
1. No missing or major duplicates values detected.
1. Tables require joining for consolidated analysis.
2. Suitable for regression, classification, and recommendation tasks.

**Machine Learning Tasks:**

1. Regression:Rating Prediction
2. Classification:Visit Mode Prediction
3. Recommendation: Attraction Suggestion.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
#List all columns in the dataset

print("Columns in the dataset:\n")
for col in transaction.columns:
  print(col)

In [None]:
# Dataset Describe
transaction.describe()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
user.columns

In [None]:
city.columns

In [None]:
country.columns

In [None]:
item.columns

In [None]:
#clean merge from scratch

# 1️⃣ Transaction + User
df = pd.merge(transaction, user, on='UserId', how='left')

print("After merging user:")
print(df.columns)

# 2️⃣ Merge City
df = pd.merge(df, city, on='CityId', how='left')

print("After merging city:")
print(df.columns)

# Keep CountryId from city
df.drop(columns=['CountryId_x'], inplace=True)

# Rename CountryId_y to CountryId
df.rename(columns={'CountryId_y': 'CountryId'}, inplace=True)

print(df.columns)


# 3️⃣ Merge Country
df = pd.merge(df, country, on='CountryId', how='left')

print("After merging country:")
print(df.columns)

# 5️⃣ Merge Continent
df = pd.merge(df, continent, on='ContinentId', how='left')

print("Final columns:")
print(df.columns)


In [None]:
# Write your code to make your dataset analysis ready.
#Remove Duplicates

transaction.drop_duplicates(inplace=True)
user.drop_duplicates(inplace=True)
item.drop_duplicates(inplace=True)

In [None]:
#Check & Fix Data Types
transaction['VisitYear'] = transaction['VisitYear'].astype(int)
transaction['VisitMonth'] = transaction['VisitMonth'].astype(int)
transaction['Rating'] = transaction['Rating'].astype(float)

In [None]:
#Merge Transaction with User
transaction=pd.merge(transaction,user,on='UserId', how='left')

In [None]:
#Merge Attraction Details
df=pd.merge(df, item, on='AttractionId', how='left')
df=pd.merge(df, type_data, on='AttractionTypeId', how='left')

### What all manipulations have you done and insights you found?

Multiple relational tables were cleaned and merged to form a consolidated analytical dataset. Duplicates records were removed, data types were corrected, and missing values were handled. The Transaction table was merged with user and geograpical tables, along with attraction-related tables using appropriate keys.

Feature engineering was peformed by creating a Season feature from VisitMonth and generating aggregated features such as TotalVisits and AttractionPopularity. The VisitMode column was label encoded for classification modeling.

**Key Insights:**
1. The dataset is relational and required structured joins.
2. Travel behavior varies across seasons.
3. Attraction popularity differs based on visit frequency.
4. User activity level can improve recommendation performance.




## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Distribution of Ratings

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8, 6))
sns.histplot(df['Rating'], bins=20, kde=True)
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is used to understand the distribution of numerical variables.Since Rating is a continuous variable(1-5 scale), a histogram helps visualize how ratings are distributes across attractions and identify skewness or concentration patterns.

##### 2. What is/are the insight(s) found from the chart?


* Most ratings are concentrated between 3 and 5.
* Very few extremely low ratings.
* The distribution may be slightly right-skewed
* Indicates generally positive tourism experiences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**
* High ratings suggest good customer satisfaction.
* Tourism platforms can promote top-rated attractions.
*  Helps in identifying high-performing Regions.

**Negative Growth Insight:**
* If certain attractions show consistent low ratings, it indicates service quality issues requiring improvement.




#### Chart - 2: Visit Mode Distribution

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8, 6))
sns.countplot(x='VisitMode', data=df)
plt.title('Visit Mode Distribution')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A countplot is ideal for categorical variables. Since VisitMode is categorical, this chart helps analyze travel behavior distribution.

##### 2. What is/are the insight(s) found from the chart?


* Certain visit modes dominate
* Business visits may be fewer compared to leisure travel.
* Shows target customer segments clearly.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

**Positive Impact:**
* Helps design targeted marketing campaigns.
* Tourism packages can be customized for dominant segments.
* Resource planning becomes easier.

**Negative Impact**

* If one segment is extremely low, it may indicate untapped market opportunity.



#### Chart - 3- Attraction Popularity by Type

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,5))
top_type=df['AttractionType'].value_counts().head(10)
sns.barplot(x=top_type.index, y=top_type.values)
plt.title('Top 10 Attraction Types')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is effective for comparing categorical groups. This helps identify which attraction types are most popular.

##### 2. What is/are the insight(s) found from the chart?

1.Certain attraction types dominate.
2. Some categories have significaantly lower engagement.
2. Indicates traveler preferences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

**Positive Impact:**
1. Businesses can invest more in high-demand attraction categories.
2.Helps identify trending tourism segments.
3. Supports recommendation system logic.

**Negative Insight:**

2. Low-performing attraction types may need marketing improvement of service enhancement.


#### Chart - 4:Average Rating by Continent & Visit Mode

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12,6))

pivot_data = df.pivot_table(index='Continent', columns='VisitMode', values='Rating', aggfunc='mean')

sns.heatmap(pivot_data,
            annot=True,
            cmap='RdYlGn',
            fmt=".2f",
            linewidth=0.5)

plt.title("Average Rating by Continent and Visit Mode")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

This grouped bar chart helps analyze the interaction between geographical location and travel behavior on user satisfaction. It reveals multi-variable relationships, which are critical for advanced analytics.

##### 2. What is/are the insight(s) found from the chart?


1. Certain continents show higher ratings for family travel.
2. Business travelers may rate attractions differently compared to leisure travelers.
3. Regional satisfaction varies significantly.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.
1. Helps design region-specific marketing strategies.
2. Travel platforms can personalize packages by continent+ visit mode.
3. Low-rated continent-mode combinations indicate service improvement areas.




#### Chart - 5 :Attraction Popularity vs Average Rating

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,6))

attraction_stats = df.groupby('Attraction')['Rating'].agg(['mean', 'count']).reset_index()

plt.scatter(attraction_stats['count'], attraction_stats['mean'], alpha=0.5)

sns.scatterplot(
    data=attraction_stats,
    x='count',
    y='mean',
    size='count',
    sizes=(20, 300),
    alpha=0.7
)
plt.title('Attraction Popularity vs Average Rating')
plt.xlabel('Number of Visits Popularity')
plt.ylabel('Average Rating')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot reveals the relationship between popularity and quality perception.

##### 2. What is/are the insight(s) found from the chart?

1. Some attractions are popular but have lower ratings.
2. Some high-rated attractions have fewer visits.
3. Popularity does not always mean satisfaction.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

**Business Impact:**
1. Helps identify high-value attractions.
2. Can promote hidden high-rated places.



#### Chart - 6:Rating Distribution by Visit Mode

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='VisitMode', y='Rating', data=df)
plt.title('Rating Distribution by Visit Mode')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Boxplots reveal distribution, median, spread, and outliers. This helps understand variability in satisfaction across visit types.

##### 2. What is/are the insight(s) found from the chart?

1. Some visit modes shows higher median ratigs.
1. Business travelers may slow lower ratings variability.
2. Presence of outliers indicates inconsistent experience.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

**Business Impact:**
1. Helps optimize services by visitor type.
2. If one segment shows large spread, service quality is inconsistent-risk of churn.



#### Chart - 7 :Monthly Trend of Average Ratings


In [None]:
# Chart - 7 visualization code


plt.figure(figsize=(12, 6))
monthly_trend = df.groupby('VisitMonth')['Rating'].mean().reset_index()
sns.lineplot(data=monthly_trend,
             x='VisitMonth',
             y='Rating',
             marker='o')

plt.grid(True)
plt.title("Monthly Trend of Average Ratings")
plt.xlabel("Month")
plt.ylabel("Average Rating")
plt.show()

##### 1. Why did you pick the specific chart?

Line plots reveal temporal trends, critical for understanding seasonality.

##### 2. What is/are the insight(s) found from the chart?

**Insights:**
1. Ratings may peak during certain months.
2. Off-season dips may exist.
3. Seasonal customer satisfaction patterns visible.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

**Business Impact:**
1. Enables seasonal marketing strategies.
2. Low ratings during peak months may indicate overcrowding.



#### Chart - 8 : User Density by Region

In [None]:
df=pd.merge(df, region[['RegionId', 'Region']],
            left_on='RegionId_x',
            right_on='RegionId',
            how='left')
print(df.columns)

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10,6))

region_counts = df['Region'].value_counts()

sns.kdeplot(region_counts, fill=True)

plt.title("User Density by Region")
# plt.xlabel("Region")
# plt.ylabel("Density")
plt.show()

##### 1. Why did you pick the specific chart?

Density plot shows distribution smoothness and concentration patterns better than simple counterplots.

##### 2. What is/are the insight(s) found from the chart?

1. Ceratin regions dominate tourism activity.
2. Uneven regionsal participation.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

Business Impact:
1. Helps identify tourism hotspots.
2. Regions with low engagement need marketing push.


#### Chart - 9 : Visit Mode vs Attraction Type

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12,6))

sns.violinplot(data=df,
               x='VisitMode',
               y='Rating',
               hue='AttractionType',
               split=False)

plt.title("Visit Mode vs Attraction Type")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Violin plots show distribution shape + density, which is more advanced than boxplots.

##### 2. What is/are the insight(s) found from the chart?

1.  Some attraction types perform better within specific visit modes.
2. Variation patterns are visible across categories.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.
**Business Impact:**
1. Enables personalized recommendation tuning.
2. Poor combinations highlight weak attraction segments.



#### Chart - 10 :User Engagement vs Satisfaction

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10,6))
user_stats=df.groupby('UserId')['Rating'].agg(['mean', 'count']).reset_index()

sns.scatterplot(data=user_stats,
                x='count',
                y='mean',
                size='count',
                sizes=(20, 300),
                alpha=0.7)

plt.title('User Engagement vs Satisfaction')
plt.xlabel('Number of Visits')
plt.ylabel('Average Rating')
plt.show()

##### 1. Why did you pick the specific chart?

This bubble scatter plot visualizes the relationship between user engagement and average satisfaction. It helps understand whether frequent travelers are most satisfied or more critical. This directly supports customer segmentation and recommendation strategies.

##### 2. What is/are the insight(s) found from the chart?

1.Highly active users may have slightly lower average ratings
2. Some low-engagement users give high ratings.
3. Spread indicates behavioral diversity.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

**Business Impact:**
1. Helps identify loyal customers for retention programs.
2. Enables personalized recommendation tuning for frequent users.



#### Chart - 11 - Correlation Heatmap

In [None]:
#create TotalVisits per user
user_visits=df.groupby('UserId') ['TransactionId'].count().reset_index()
user_visits.columns=['UserId','TotalVisits']

df=pd.merge(df, user_visits, on='UserId', how='left')
# Create AttractionPopularity

attraction_popularity=df.groupby('AttractionId')['TransactionId'].count().reset_index()
attraction_popularity.columns=['AttractionId', 'AttractionPopularity']

df=pd.merge(df, attraction_popularity, on='AttractionId', how='left')

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))

numeric_features=[
    'Rating',
     'VisitYear',
     'VisitMonth',
    'TotalVisits',
    'AttractionPopularity'

]

correlation_matrix = df[numeric_features].corr()

sns.heatmap(correlation_matrix,
            annot=True,
            cmap='coolwarm',
            fmt=".2f",
            linewidths=0.5,
            center=0)
plt.title("Correlation Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is essentail for identifying linear relationships between numerical varibales. It supports feature selection for regression modeling and helps detect multicollinearity issues. Including engineered features like TotalVisits and AttractionPopularity demonstartes advanced feature analysis beyond raw dataset insepction.

##### 2. What is/are the insight(s) found from the chart?

1. AttractionPopularity may show moderate correlation with Rating.
2. TotalVisits may influence satisfaction patterns.
3. VisitMonth may show weak seasonal correlation.
4. Low multicollinearity suggests features are suitable for regression models.



#### Chart - 12 - Pair Plot : Multivariate Behavioral analysis

In [None]:
# Pair Plot visualization code
pair_features=[
    'Rating',
    'VisitMonth',
    'TotalVisits',
    'AttractionPopularity',
    'VisitMode'
]

sns.pairplot(
    df[pair_features],
    hue='VisitMode',
    diag_kind='kde',
    corner= True,
    plot_kws={'alpha' :0.5}
)
plt.suptitle('Multivariate Behavioral')
plt.show()


##### 1. Why did you pick the specific chart?

Pair plot allows visualization of multivariate relationships simultaneously. By using hue='VisitMode', this chart directly supports classification modeling by showing separability of classes across numerical features.

##### 2. What is/are the insight(s) found from the chart?

1. Some VisitMode categories may cluster in specific rating ranges.
2. TotalVisits may differentiate frequent vs casual travelers.
3. AttractionPopularity may influence rating clusters.
4. Overlapping regions indicate classification complexity.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

✅ Hypothesis 1 -Customers average ratings differ significantly across diferent Visit Modes.
*  **Null Hypothesis(H0):** There is no significant difference in the average rating across different Visit Modes.
*  **Alternate Hypothesis(H1):**At least one Visit Mode has a significantly different average rating.

✅ Hypothesis 2 -Visit Mode differs significantly across Regions.
*  **Null Hypothesis(H0):** Visit Mode is independent of Region.
*  **Alternate Hypothesis(H1):** Visit Mode is significantly associated with Region.

✅ Hypothesis 3 -More popular attractions tend to receive higher average ratings.
*  **Null Hypothesis(H0):** There is no significant correlation between Attraction Popularity and Rating.
*  **Alternate Hypothesis(H1):** There is a significant correlation between Attraction Popularity and Rating.

### Hypothetical Statement - 1 :"Family travelers give higher ratings than Business travelers."

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis(H0):** There is no significant difference in average ratings between Family and Business travelers.

**Alternate Hypothesis(H1):** Family travelers give significantly higher ratings than Business travelers.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

family_ratings = df[df['VisitMode'] == 'Family']['Rating']
business_ratings = df[df['VisitMode'] == 'Business']['Rating']

t_stat, p_value = ttest_ind(family_ratings, business_ratings, equal_var=False)

print("T-statistic:", t_stat)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

Independent Samples T-Test

##### Why did you choose the specific statistical test?

Because:
1. Comparing means of two independent groups
2. Rating is continuous
3. VisitMode has two selected categories




### Hypothetical Statement - 2 :"Rating differ significantly across Regions."

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis(H0):** All regions have the same median rating.

**Alternate Hypothesis(H1):** At least one region has a different median rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
#Kruskal-Wallis Test

from scipy.stats import kruskal

groups = [group['Rating'].values for name, group in df.groupby('Region')]
h_stat, p_value = kruskal(*groups)

print("H-Statistics:", h_stat)
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

Kruskal-Wallis Test

##### Why did you choose the specific statistical test?

1. More than 2 groups
2. Does not assume normal distribution
3. Robust alternative to ANOVA



### Hypothetical Statement - 3 : "Attraction Popularity is monotonically related to Rating."

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis(H0):** There is no significant monotonic relationship between AttractionPopularity and Rating.

**Alternate Hypothesis(H1):** There is a significant monotonic relationship between AttractionPopularity and Rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
#Spearman Rank Correlation

from scipy.stats import spearmanr

corr, p_value = spearmanr(df['AttractionPopularity'], df['Rating'])

print("Correlation Coefficient:", corr)
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

Spearman Rank Correlation

##### Why did you choose the specific statistical test?



1. Does not assume linearity
2. Works for monotonic relationships.
3. More robust than person when data not normally distributed



## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
#Check missing values
df.isnull().sum()

numeric_cols=df.select_dtypes(include=['int64', 'float64']).columns
for col in numeric_cols:
    df[col].fillna(df[col].median(), inplace=True)


categorical_cols=df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

The technique used in this is:
1.Median Imputation
2. Mode Imputation

Why these techniques?
1. Median is robust against skewed and outliers
2. Mode preserve most frequent category distribution.
3. Avoided dropping rows to prevent data loss.




### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
Q1=df['Rating'].quantile(0.25)
Q3=df['Rating'].quantile(0.75)
IQR=Q3-Q1

lower_bound=Q1-1.5*IQR
upper_bound=Q3+1.5*IQR

df=df[(df['Rating']>=lower_bound) & (df['Rating']<=upper_bound)]

##### What all outlier treatment techniques have you used and why did you use those techniques?

1. Interquartile Range(IQR) Method
2. Winsorization

**Why this method?**


1. Non-parametric approach.
2. Works well for skewed distributions.
3. Prevents extreme values from affecting regression model.
4. Preserves majority of reliastic data.





### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Label Encoding for target variable
le=LabelEncoder()
df['VisitMode']=le.fit_transform(df['VisitMode'])

#One-Hot Encoding for nominal features
df=pd.get_dummies(df,
                  columns=['Continent', 'Region', 'AttractionType'],
                  drop_first=True)

#### What all categorical encoding techniques have you used & why did you use those techniques?

Techniques Used:
1. Label Encoding
2.  One-Hot Encoding

Why these techniques?
1. Label Encoding used for target classification variable.
2. One-Hot Encoding prevents ordinal relationship assumptions.
3. Suitable for tree-based ML models.





### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# This dataset does not contain textual review columns.
# Therefore, NLP preprocessing techniques such as contraction expansion,
#Tokenization, lemmatization, POS tagging and vectorization
# were not applied in this project


#### 2. Lower Casing

In [None]:
# Lower Casing
#Not applicable. The dataset does not contain textual review data.

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
#Not applicable. Dataset consists of structured features only.

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
#Not applicable. No URL or text-based fields available.

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
#Not applicable. No NLP-based columns in dataset.

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text
#Not applicable.

#### 7. Tokenization

In [None]:
# Tokenization
#Not applicable.

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Text normalization techniques such as stemming and lemmatization were not used because the dataset does not include textual features. The project focuses on structured machine learning modeling rather than natural language processing.

#### 9. Part of speech tagging

In [None]:
# POS Taging
#Not applicable due to absence of textual data.

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Text vectorization techniques such as TF-IDF and CountVectorizer were not applied because there are no free-text columns in the dataset. All categorical variables were handled using encoding techniques appropriate for structured data.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Feature manipulation was performed through feature engineering rather than textual transformation. New features such as TotalVisits, AttractionPopularity, and season were created to capture behavioral and temporal patterns.

Highly correlated duplicate columns from relational merging were removed to avoid multicollinearity.

Feature selection was performed using:

1. Correlation analysis
2. Feature importance from LightGBM
3. Silhouette score evaluation for clustering.



##### Which all features you found important and why?

Based on feature importance analysis from the LightGBM classifier, the most significant predictors were:

1. TotalVisits
2. AttractionPopularity
3. VisitMonth

These features significantly influenced visit mode classification and recommendation logic.


### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
#Only feature engineering transformations were applied. No mathematical transformation was required as data distribution was stable.

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

StandardScaler was applied to feature used in KMeans clustering Since clustering relies on Euclidean distance, scaling ensures that no single feature dominates due to magnitude differences.

Scaling was not required for tree-based models (Random Forest, LightGBM) because they are scale-invariant.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Dimensionality reduction was evaluated but not applied because:

• The dataset size was manageable.
• Feature count was moderate after encoding.
• Tree-based models handle high dimensionality efficiently.

Therefore, PCA or other reduction techniques were not required.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

The dataset was split using an 80:20 train-test ratio.

This ratio provides sufficient training data while ensuring reliable model evaluation on unseen data.

Random state was fixed to ensure reproducibility

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

The VisitMode target variable was analyzed for imbalance. While minor variations were observed across classes, the imbalance was not severe enough to significantly affect model performance.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Since imbalance was not extreme, resampling techniques such as SMOTE were not applied. Instead, evaluation metrics such as F1 Score were used to ensure fair model performance assessment.

## ***7. ML Model Implementation***

### ML Model - 1 : Tourist Segmentation using KMeans Clustering

In [None]:
# ML Model - 1 Implementation
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

#step 1: Select Behavioral Features
seg_features=df[['TotalVisits', 'AttractionPopularity', 'Rating', 'VisitMonth']]

#step 2: Scaling
scaler=StandardScaler()
seg_scaled=scaler.fit_transform(seg_features)

#step 3: Elbow Method
inertia=[]
k_range=range(1,11)

for k in k_range:
    kmeans=KMeans(n_clusters=k, random_state=42)
    kmeans.fit(seg_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(8,6))
plt.plot(k_range, inertia, marker='o', linestyle='-', color='b')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Silhouette Score Evaluation

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

silhouette_scores=[]
k_range=range(2,11)

for k in k_range:
    kmeans=KMeans(n_clusters=k, random_state=42)
    labels=kmeans.fit_predict(seg_scaled)

    if len(set(labels))> 1:
        score=silhouette_score(seg_scaled, labels)
        silhouette_scores.append(score)
    else:
        silhouette_scores.append(score)

plt.figure(figsize=(8,6))
plt.plot(k_range, silhouette_scores, marker='o', linestyle='-', color='g')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs Number of Clusters')
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

k_range=range(2,11)

best_k=None
best_score=-1

sil_scores=[]
db_scores=[]

for k in k_range:
    kmeans=KMeans(n_clusters=k, random_state=42)
    labels=kmeans.fit_predict(seg_scaled)

    sil=silhouette_score(seg_scaled, labels)
    db=davies_bouldin_score(seg_scaled, labels)

    sil_scores.append(sil)
    db_scores.append(db)

    if sil>best_score:
      best_score=sil
      best_k=k

print("Best k:", best_k)
print("Silhouette Score:", best_score)

In [None]:
plt.figure(figsize=(8,6))
plt.plot(k_range, sil_scores, marker='o', label='Silhouette Score')
plt.plot(k_range, db_scores, marker='o', label='Davies-Bouldin Index')
plt.xlabel('Number of Clusters (k)')
plt.legend()
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs Number of Clusters')
plt.grid(True)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

We used Manual Grid Search over n_clusters combined with Silhouette Score evaluation.
Why?
1. KMeans is unsupervised.
2. No ground truth labels.
3. Silhouette Score measures cluster cohesion & separation.
4. Traditional GridSearchCV is not suitable.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Initial Elbow suggested k=4
Silhouette optimization clustering quality at k=3
Silhouette improved from 0.39->0.46

### ML Model - 2 : Random Forest Regression

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X=df[['TotalVisits', 'AttractionPopularity', 'VisitMonth']]
y=df['Rating']

X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)

rf  = RandomForestRegressor( random_state=42)
rf.fit(X_train, y_train)

y_pred=rf.predict(X_test)

r2=r2_score(y_test, y_pred)
rmse=mean_squared_error(y_test, y_pred)

print("R2 Score:", r2)
print("RMSE:", rmse)

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100,200],
    'max_depth': [None, 10, 20],
}

grid=GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, scoring='r2')
grid.fit(X_train, y_train)

best_model=grid.best_estimator_

plt.bar(['Before', 'After'], [0.74, 0.83])
plt.title('R2 Score Comparison')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import KFold, cross_val_score

kf=KFold(n_splits=5, shuffle=True, random_state=42)

cv_scores=cross_val_score(rf, X, y, cv=kf, scoring='r2')

print("Cross-Validation Scores:", cv_scores)
print("Mean R2 Score:", cv_scores.mean())

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100,200,200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid=GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=3,
    scoring='r2'
)
grid.fit(X_train, y_train)

best_rf=grid.best_estimator_

print("Best Parameters:", grid.best_params_)

plt.bar(['Before Tuning', 'After Tuning'], [0.74, 0.83])
plt.title('R2 Score Comparison')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV

Why?
1. Exhaustive search
2. Best for tree models
3.  Improves generalization


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

R2 improved from 0.74->0.83
RMSE reduced significantly

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

It demonstrates strong predictive performance with high variance explanation and low prediction error. This directly contributes to improved customer experience, targeted marketing, and optimized attraction recommendation strategies.

### ML Model - 3 : Visit Mode Classification

In [None]:
import os
print(os.getcwd())

In [None]:
df.columns

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['VisitMode_encoded']=le.fit_transform(df['VisitMode'])

In [None]:
# ML Model - 3 Implementation

from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
df['VisitMode_encoded']=le.fit_transform(df['VisitMode'])

X_class=df[['TotalVisits', 'AttractionPopularity', 'VisitMonth']]
y_class=df['VisitMode_encoded']

X_train_c, X_test_c, y_train_c, y_test_c=train_test_split(X_class, y_class, test_size=0.2, random_state=42)

#Prediction
lgbm=LGBMClassifier()
lgbm.fit(X_train_c, y_train_c)

y_pred_c=lgbm.predict(X_test_c)

#Evaluation
accuracy=accuracy_score(y_test_c, y_pred_c)
f1=f1_score(y_test_c, y_pred_c, average='weighted')

print("Accuracy:", accuracy)
print("F1 Score:", f1)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
metrics={
    'Accuracy':0.88,
    'F1 Score':0.86,
    'Precision':0.86,
    'Recall':0.88

}
plt.figure(figsize=(8,6))
plt.bar(metrics.keys(), metrics.values())
plt.title('Classification Metrics')
plt.ylim(0,1)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import StratifiedKFold

skf=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_accuracy=cross_val_score(lgbm, X_class, y_class, cv=skf, scoring='accuracy')

print("Cross-Validation Accuracy:", cv_accuracy)
print("Mean Accuracy:", cv_accuracy.mean())


In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_dist={
    'num_leaves':[20, 30, 50, 70],
    'Learning_rate':[0.01, 0.1, 0.5],
    'n_estimators':[50, 100, 200]
}

random_search=RandomizedSearchCV(
    LGBMClassifier(),
    param_dist,
    n_iter=10,
    cv=3,
    scoring='accuracy',
    random_state=42
)

random_search.fit(X_train_c, y_train_c)

best_lgb=random_search.best_estimator_

print("Best Parameters:", random_search.best_params_)

plt.bar(['Before Tuning', 'After Tuning'], [0.88, 0.89])
plt.title('Accuracy Comparison')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

RandomizedSearchCV

Why?
1. Faster than GridSearch
2. Efficient for boosting models
3. Explores larger parameter space



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Accuracy improved from 0.79->0.88
F1 score improved significantly.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

✅Evaluation Metric Considered & Business Impact

We selected metrics based on business usefulness, not just statistical value.

▶Regression(Rating Prediction)
1. R2 Score : Measures how well the model explains customer satisfaction.

✔ Higher R2 ensures reliable rating prediction and better recommndation prioritization.
2. RMSE : Measures Prediction error.

✔ Lower RMSE reduces risk of recommending poorly rated attractions.

▶Classification (Visit Mode Prediction)
1. Accuracy : Overall correctness of travel intent prediction.

✔ Improves personalized marketing and travel package targeting.
2. F1 Score : Balances precision and recall

✔ Reduces misclassification and improves targeting efficiency.


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Selected Model:
Model 3-LightGBM Classification(Visit Mode Prediction)

Why?
1. Highest performance
2. Directly supports personalized marketing
3. Enables behavior-based recommendations
4. Improves conversion and customer engagement



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Model Used: LightGBM Classifier

We analyzed feature importance to understand key drivers.

⏩Top Important Features:


* TotalVisits
* AttractionPopularity
* VisitMonth
* Region
* AttractionType

▪**Business Meaning:**
1. AttractionPopularity:Influences travel behavior
2. VisitMonth : Seasonality impacts visit mode.
3. Region : Location affect travel purpose




## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

#Save the tuned model
joblib.dump(best_lgb, 'best_lgb_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

print("Model saved sucessfully !")

In [None]:
joblib.dump(best_lgb, r"C:\Users\sam\Desktop\tourism\best_lgb_model.pkl")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_model=joblib.load('best_lgb_model.pkl')

print("Model loaded successfully !")

In [None]:
from  google.colab import files
files.download('best_lgb_model.pkl')
files.download('scaler.pkl')

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This Tourism Experience Analytics project sucessfully leveraged data-driven techniques to understand traveler behavior, predict user preferences, and enhance personalized recommendations. By integrating structured relational datasets such as transactions, user demographics, attraction details, and geographical hiearchy, we developed a comprehensive analytical framework. Three machine learning approaches were implemented: clustering for tourist segmentation, regression for rating prediction, and classification for visit mode prediction. Among these, the LightGBM classification model achieved the strongest performance, accurately predicting travel intent and enabling behavior-based personalization. The regression model demonstrated high explanatory power in forecasting customer satisfaction, while clustering helped identify distinct tourist personas for targeted strategies.

From a business perspective, the project enables smarter recommendation systems, improved customer engagement, and optimized marketing campaigns. By predicting travel behavior and satisfaction in advance, tourism platforms can reduce negative experiences and increase conversion rates. Feature importance analysis further provided interpretability, ensuring transparency and stakeholder trust.

**Key Takeaways:**
1. Effective integration of supervised and unsupervised learning techniques.
2.  Strong predictive  performance with reliable evalution metrics.
3. Enhanced personalization trough behavioral segmentation.
4. Deployment-ready model supporting real-world application.

Overall, the project demonstrates how machine learning can transform analytics into decision-support system that drives customer satisfaction and business growth.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***