<a href="https://colab.research.google.com/github/Sudu2025/TourPkg_Prediction/blob/main/Tourism_Pkg_Predict_V1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

**Business Context**

"Visit with Us," a leading travel company, is revolutionizing the tourism industry by leveraging data-driven strategies to optimize operations and customer engagement. While introducing a new package offering, such as the Wellness Tourism Package, the company faces challenges in targeting the right customers efficiently. The manual approach to identifying potential customers is inconsistent, time-consuming, and prone to errors, leading to missed opportunities and suboptimal campaign performance.

To address these issues, the company aims to implement a scalable and automated system that integrates customer data, predicts potential buyers, and enhances decision-making for marketing strategies. By utilizing an MLOps pipeline, the company seeks to achieve seamless integration of data preprocessing, model development, deployment, and CI/CD practices for continuous improvement. This system will ensure efficient targeting of customers, timely updates to the predictive model, and adaptation to evolving customer behaviors, ultimately driving growth and customer satisfaction.

**Objective**

As an MLOps Engineer at "Visit with Us," your responsibility is to **design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases**. The primary objective is to build a model that predicts whether a customer will purchase the newly introduced Wellness Tourism Package before contacting them. The pipeline will include data cleaning, preprocessing, transformation, model building, training, evaluation, and deployment, ensuring consistent performance and scalability. By leveraging GitHub Actions for CI/CD integration, the system will enable automated updates, streamline model deployment, and improve operational efficiency. This robust predictive solution will empower policymakers to make data-driven decisions, enhance marketing strategies, and effectively target potential customers, thereby driving customer acquisition and business growth.

**Data Description**

The dataset contains customer and interaction data that serve as key attributes for predicting the likelihood of purchasing the Wellness Tourism Package. The detailed attributes are:


**Customer Details**

**CustomerID:** Unique identifier for each customer.

**ProdTaken:** Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).

**Age:** Age of the customer.

**TypeofContact:** The method by which the customer was contacted (Company Invited or Self Inquiry).

**CityTier:** The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).

**Occupation:** Customer's occupation (e.g., Salaried, Freelancer).

**Gender:** Gender of the customer (Male, Female).

**NumberOfPersonVisiting:** Total number of people accompanying the customer on the trip.

**PreferredPropertyStar:** Preferred hotel rating by the customer.

**MaritalStatus:** Marital status of the customer (Single, Married, Divorced).

**NumberOfTrips:** Average number of trips the customer takes annually.

**Passport:** Whether the customer holds a valid passport (0: No, 1: Yes).

**OwnCar:** Whether the customer owns a car (0: No, 1: Yes).

**NumberOfChildrenVisiting:** Number of children below age 5 accompanying the customer.

**Designation:** Customer's designation in their current organization.

**MonthlyIncome:** Gross monthly income of the customer.

**Customer Interaction Data**

**PitchSatisfactionScore:** Score indicating the customer's satisfaction with the sales pitch.

**ProductPitched:** The type of product pitched to the customer.

**NumberOfFollowups:** Total number of follow-ups by the salesperson after the sales pitch.

**DurationOfPitch:** Duration of the sales pitch delivered to the customer.

**Prerequisites**

- Create a Github repo
  - Go to **Github Profile**
  - Click on **Your repositories** then select ***New***
    - Repository Name: **TourPkg_Prediction**
    - Check the box **README.md** file
    - Click on **Create repository**
- Adding hugging face space secrets to Github Actions to execute the workflow
  - Go to Hugging Face **Profile**
  - Navigate to **Access Token**
  - Create a **New token**
    - Token type **Write**
    - Token Name **MLOps**
    - Click on **Create Token**
    - Copy the generated Token
  - Now, go to Github repo
    - Click on **Settings**
    - Navigate to **Secrets and Variables**
    - Click on **Actions**
    - Add a **Repository secerts**
      - Name **HF_TOKEN**
      - Secret: **Paste the token created from the hugging face access tokens**
      - Click on **Add secret**
      
- Create a Hugging Face space
  - Go to **Hugging Face**
  - Open your **Profile**
  - Click on **New Space**
      - Under the space creation, enter the below details
        - Space name: **TourPkg-Prediction**
          - Select the space **SDK: Docker** - Choose a Docker template: **Streamlit** - Click on **Create Space**

## **Model Building**

In [1]:
# Create a master folder to keep all files created when executing the below code cells
import os
os.makedirs("tourism_project", exist_ok=True)

In [2]:
# Create a folder for storing the model building files
os.makedirs("tourism_project/model_building", exist_ok=True)

**Data Registration**

Create a master folder and create a subfolder "data" - Register the data on the Hugging Face dataset space




In [3]:
os.makedirs("tourism_project/data", exist_ok=True)

Once the data folder created after executing the above cell, upload the tourism.csv in to the folder

In [4]:
%%writefile tourism_project/model_building/data_register.py
from huggingface_hub.utils import RepositoryNotFoundError, HfHubHTTPError
from huggingface_hub import HfApi, create_repo
import os


repo_id = "<Sudu2025>/TourPkg_Prediction"
repo_type = "dataset"

# Initialize API client
api = HfApi(token=os.getenv("HF_TOKEN"))

# Step 1: Check if the space exists
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f"Space '{repo_id}' already exists. Using it.")
except RepositoryNotFoundError:
    print(f"Space '{repo_id}' not found. Creating new space...")
    create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
    print(f"Space '{repo_id}' created.")

api.upload_folder(
    folder_path="tourism_project/data",
    repo_id=repo_id,
    repo_type=repo_type,
)

Writing tourism_project/model_building/data_register.py


**Data Preparation**
- Load the dataset directly from the Hugging Face data space.
- Perform data cleaning and remove any unnecessary columns.
- Split the cleaned dataset into training and testing sets, and save them locally.
- Upload the resulting train and test datasets back to the Hugging Face data space

In [6]:
%%writefile tourism_project/model_building/prep.py
# for data manipulation
import pandas as pd
import sklearn
# for creating a folder
import os
# for data preprocessing and pipeline creation
from sklearn.model_selection import train_test_split
# for converting text data in to numerical representation
from sklearn.preprocessing import LabelEncoder
# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

# Define constants for the dataset and output paths
api = HfApi(token=os.getenv("HF_TOKEN"))
DATASET_PATH = "hf://datasets/<Sudu2025>/TourPkg_Prediction/tourism.csv"
df = pd.read_csv(DATASET_PATH)
print("Dataset loaded successfully.")

# Drop unique identifier column (not useful for modeling)
df.drop(columns=['CustomerID', 'OwnCar', 'NumberOfChildrenVisiting'], inplace=True)

# Encode categorical columns
label_encoder = LabelEncoder()
df['TypeofContact'] = label_encoder.fit_transform(df['TypeofContact'])
df['Occupation'] = label_encoder.fit_transform(df['Occupation'])
df['Gender'] = label_encoder.fit_transform(df['Gender'])
df['MaritalStatus'] = label_encoder.fit_transform(df['MaritalStatus'])
df['ProductPitched'] = label_encoder.fit_transform(df['ProductPitched'])


# Define target variable
target_col = 'ProdTaken'

# Split into X (features) and y (target)
X = df.drop(columns=[target_col])
y = df[target_col]

# Perform train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Xtrain.to_csv("Xtrain.csv",index=False)
Xtest.to_csv("Xtest.csv",index=False)
ytrain.to_csv("ytrain.csv",index=False)
ytest.to_csv("ytest.csv",index=False)


files = ["Xtrain.csv","Xtest.csv","ytrain.csv","ytest.csv"]

for file_path in files:
    api.upload_file(
        path_or_fileobj=file_path,
        path_in_repo=file_path.split("/")[-1],  # just the filename
        repo_id="<Sudu2025>/TourPkg_Prediction",
        repo_type="dataset",
    )

Overwriting tourism_project/model_building/prep.py


**Model Building with Experimentation Tracking-**
- Load the train and test data from the Hugging Face data space
- Define a model and parameters
- Tune the model with the defined parameters
- Log all the tuned parameters
- Evaluate the model performance
- Register the best model in the Hugging Face model hub
* The ML models to be built can be any of the following algorithms, such as Decision Tree, Bagging, Random Forest, AdaBoost, Gradient Boosting, and XGBoost

In [7]:
%%writefile tourism_project/model_building/train.py
# for data manipulation
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# for model training, tuning, and evaluation
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# for model serialization
import joblib
import os
# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi, create_repo
from huggingface_hub.utils import RepositoryNotFoundError, HfHubHTTPError

api = HfApi()

Xtrain_path = "hf://datasets/<Sudu2025>/TourPkg_Prediction/Xtrain.csv"
Xtest_path = "hf://datasets/<Sudu2025>/TourPkg_Prediction/Xtest.csv"
ytrain_path = "hf://datasets/<Sudu2025>/TourPkg_Prediction/ytrain.csv"
ytest_path = "hf://datasets/<Sudu2025>/TourPkg_Prediction/ytest.csv"

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path)
ytest = pd.read_csv(ytest_path)

# Define features
numeric_features = ['ProdTaken', 'Age', 'CityTier', 'NumberOfPersonVisiting', 'PreferredPropertyStar', 'NumberOfTrips','Passport','MonthlyIncome','PitchSatisfactionScore','NumberOfFollowups','DurationOfPitch']
categorical_features = ['TypeofContact', 'Occupation', 'Gender', 'MaritalStatus', 'Designation', 'ProductPitched']

# Preprocessing pipeline
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

# Define XGBoost Regressor
xgb_model = xgb.XGBRegressor(random_state=42, objective="reg:squarederror")

# Define hyperparameter grid
param_grid = {
    'xgbregressor__n_estimators': [50, 100],
    'xgbregressor__max_depth': [2, 3],
    'xgbregressor__learning_rate': [0.01, 0.05],
    'xgbregressor__colsample_bytree': [0.6, 0.8],
    'xgbregressor__subsample': [0.6, 0.8],
    'xgbregressor__reg_lambda': [0.5, 1],
}

# Create pipeline
model_pipeline = make_pipeline(preprocessor, xgb_model)

# Grid search with cross-validation
grid_search = GridSearchCV(
    model_pipeline, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1
)
grid_search.fit(Xtrain, ytrain)

# Best model
best_model = grid_search.best_estimator_
print("Best Params:\n", grid_search.best_params_)

# Predictions
y_pred_train = best_model.predict(Xtrain)
y_pred_test = best_model.predict(Xtest)

# Evaluation
print("\nTraining Performance:")
print("MAE:", mean_absolute_error(ytrain, y_pred_train))
print("RMSE:", np.sqrt(mean_squared_error(ytrain, y_pred_train)))
print("R²:", r2_score(ytrain, y_pred_train))

print("\nTest Performance:")
print("MAE:", mean_absolute_error(ytest, y_pred_test))
print("RMSE:", np.sqrt(mean_squared_error(ytest, y_pred_test)))
print("R²:", r2_score(ytest, y_pred_test))

# Save best model
joblib.dump(best_model, "tourismpkg_prediction_model_v1.joblib")


# Upload to Hugging Face
repo_id = "<Sudu2025>/TourPkg_Prediction/tourismpkg_prediction_model"
repo_type = "model"

api = HfApi(token=os.getenv("HF_TOKEN"))

# Step 1: Check if the space exists
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f"Model Space '{repo_id}' already exists. Using it.")
except RepositoryNotFoundError:
    print(f"Model Space '{repo_id}' not found. Creating new space...")
    create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
    print(f"Model Space '{repo_id}' created.")

api.upload_file(
    path_or_fileobj="tourismpkg_prediction_model_v1.joblib",
    path_in_repo="tourismpkg_prediction_model_v1.joblib",
    repo_id=repo_id,
    repo_type=repo_type,
)

Writing tourism_project/model_building/train.py


**Model Deployment**
- Define a Dockerfile and list all configurations
- Load the saved model from the Hugging Face model hub
- Get the inputs and save them into a dataframe
- Define a dependencies file for the deployment
- Define a hosting script that can push all the deployment files into the Hugging Face space

In [8]:
os.makedirs("tourism_project/deployment", exist_ok=True)

In [9]:
%%writefile tourism_project/deployment/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
	PATH=/home/user/.local/bin:$PATH

WORKDIR $HOME/app

COPY --chown=user . $HOME/app

# Define the command to run the Streamlit app on port "8501" and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

Writing tourism_project/deployment/Dockerfile


**Streamlit App**

Please ensure that the web app script is named app.py.

In [None]:
%%writefile tourism_project/deployment/app.py
import streamlit as st
import pandas as pd
from huggingface_hub import hf_hub_download
import joblib

# Download and load the model from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="<Sudu2025>/TourPkg_Prediction/tourismpkg_prediction_model",
    filename="tourismpkg_prediction_model_v1.joblib"
)
model = joblib.load(model_path)

# Streamlit UI for Insurance Charges Prediction
st.title("Tourism Package Prediction App")
st.write("""
This application predicts the **Best Tourism Package** based on personal and lifestyle details.
Please enter the required information below to get a prediction.
""")


CustomerID: Unique identifier for each customer.

ProdTaken: Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).

Age: Age of the customer.

TypeofContact: The method by which the customer was contacted (Company Invited or Self Inquiry).

CityTier: The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).

Occupation: Customer's occupation (e.g., Salaried, Freelancer).

Gender: Gender of the customer (Male, Female).

NumberOfPersonVisiting: Total number of people accompanying the customer on the trip.

PreferredPropertyStar: Preferred hotel rating by the customer.

MaritalStatus: Marital status of the customer (Single, Married, Divorced).

NumberOfTrips: Average number of trips the customer takes annually.

Passport: Whether the customer holds a valid passport (0: No, 1: Yes).

OwnCar: Whether the customer owns a car (0: No, 1: Yes).

NumberOfChildrenVisiting: Number of children below age 5 accompanying the customer.

Designation: Customer's designation in their current organization.

MonthlyIncome: Gross monthly income of the customer.

Customer Interaction Data

PitchSatisfactionScore: Score indicating the customer's satisfaction with the sales pitch.

ProductPitched: The type of product pitched to the customer.

NumberOfFollowups: Total number of follow-ups by the salesperson after the sales pitch.

DurationOfPitch: Duration of the sales pitch delivered to the customer.



# User input
age = st.number_input("Age", min_value=18, max_value=100, value=30, step=1)
typeofcontact = st.selectbox("TypeofContact", ["Company Invited", "Self Enquiry"])
citytier = st.number_input("CityTier", min_value=1, max_value=3, value=1, step=1)
occupation = st.selectbox("Occupation", ["Salaried", "Free Lancer", "Small Business", "Large Business"])
gender = st.selectbox("Gender", ["male", "female"])
nrofpersonvisiting = st.number_input("NumberOfPersonVisiting", min_value=1, max_value=8, value=2, step=1)
prfpropertystar = st.number_input("PreferredPropertyStar", min_value=3, max_value=5, value=3, step=1)
maritalstatus = st.selectbox("MaritalStatus", ["Single", "Married", "Unmarried", "Divorced"])
nroftrips = st.number_input("NumberOfTrips", min_value=1, max_value=20, value=3, step=1)
passport =  st.number_input("Passport", min_value=0, max_value=1, value=1, step=1)
designation = st.selectbox("Designation", ["Manager", "Senior Manager", "Executive", "AVP", "VP"])
monthlyincome = st.number_input("MonthlyIncome", min_value=1000, max_value=40000, value=5000, step=100)
csi = st.number_input("PitchSatisfactionScore", min_value=1, max_value=5, value=2, step=1)
productpitched = st.selectbox("ProductPitched", ["Basic", "Standard", "Deluxe", "Super Deluxe", "King"])
nroffups = st.number_input("NumberOfFollowups", min_value=1, max_value=6, value=2, step=1)
pitchduration = st.number_input("DurationOfPitch", min_value=5, max_value=40, value=10, step=1)

# Assemble input into DataFrame
input_data = pd.DataFrame([{
    'age': age,
    'typeofcontact': typeofcontact
    'sex': sex,
    'bmi': bmi,
    'children': children,
    'smoker': smoker,
    'region': region
}])

# Prediction
if st.button("Predict Charges"):
    prediction = model.predict(input_data)[0]
    st.subheader("Prediction Result:")
    st.success(f"Estimated Insurance Charges: **${prediction:,.2f}**")