<a href="https://colab.research.google.com/github/Sudu2025/TourPkg_Prediction/blob/main/Tourism_Pkg_Predict_V1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement**

**Business Context**

"Visit with Us," a leading travel company, is revolutionizing the tourism industry by leveraging data-driven strategies to optimize operations and customer engagement. While introducing a new package offering, such as the Wellness Tourism Package, the company faces challenges in targeting the right customers efficiently. The manual approach to identifying potential customers is inconsistent, time-consuming, and prone to errors, leading to missed opportunities and suboptimal campaign performance.

To address these issues, the company aims to implement a scalable and automated system that integrates customer data, predicts potential buyers, and enhances decision-making for marketing strategies. By utilizing an MLOps pipeline, the company seeks to achieve seamless integration of data preprocessing, model development, deployment, and CI/CD practices for continuous improvement. This system will ensure efficient targeting of customers, timely updates to the predictive model, and adaptation to evolving customer behaviors, ultimately driving growth and customer satisfaction.

**Objective**

As an MLOps Engineer at "Visit with Us," your responsibility is to **design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases**. The primary objective is to build a model that predicts whether a customer will purchase the newly introduced Wellness Tourism Package before contacting them. The pipeline will include data cleaning, preprocessing, transformation, model building, training, evaluation, and deployment, ensuring consistent performance and scalability. By leveraging GitHub Actions for CI/CD integration, the system will enable automated updates, streamline model deployment, and improve operational efficiency. This robust predictive solution will empower policymakers to make data-driven decisions, enhance marketing strategies, and effectively target potential customers, thereby driving customer acquisition and business growth.

**Data Description**

The dataset contains customer and interaction data that serve as key attributes for predicting the likelihood of purchasing the Wellness Tourism Package. The detailed attributes are:


**Customer Details**

**CustomerID:** Unique identifier for each customer.

**ProdTaken:** Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).

**Age:** Age of the customer.

**TypeofContact:** The method by which the customer was contacted (Company Invited or Self Inquiry).

**CityTier:** The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).

**Occupation:** Customer's occupation (e.g., Salaried, Freelancer).

**Gender:** Gender of the customer (Male, Female).

**NumberOfPersonVisiting:** Total number of people accompanying the customer on the trip.

**PreferredPropertyStar:** Preferred hotel rating by the customer.

**MaritalStatus:** Marital status of the customer (Single, Married, Divorced).

**NumberOfTrips:** Average number of trips the customer takes annually.

**Passport:** Whether the customer holds a valid passport (0: No, 1: Yes).

**OwnCar:** Whether the customer owns a car (0: No, 1: Yes).

**NumberOfChildrenVisiting:** Number of children below age 5 accompanying the customer.

**Designation:** Customer's designation in their current organization.

**MonthlyIncome:** Gross monthly income of the customer.

**Customer Interaction Data**

**PitchSatisfactionScore:** Score indicating the customer's satisfaction with the sales pitch.

**ProductPitched:** The type of product pitched to the customer.

**NumberOfFollowups:** Total number of follow-ups by the salesperson after the sales pitch.

**DurationOfPitch:** Duration of the sales pitch delivered to the customer.

## **Model Building**

In [1]:
# Create a master folder to keep all files created when executing the below code cells
import os
os.makedirs("tourism_project", exist_ok=True)

In [3]:
# Create a folder for storing the model building files
os.makedirs("tourism_project/model_building", exist_ok=True)

**Data Registration**

Create a master folder and create a subfolder "data" - Register the data on the Hugging Face dataset space




In [4]:
os.makedirs("tourism_project/data", exist_ok=True)

Once the data folder created after executing the above cell, upload the tourism.csv in to the folder

In [5]:
%%writefile tourism_project/model_building/data_register.py
from huggingface_hub.utils import RepositoryNotFoundError, HfHubHTTPError
from huggingface_hub import HfApi, create_repo
import os


repo_id = "<Sudu2025>/TourPkg_Prediction"
repo_type = "dataset"

# Initialize API client
api = HfApi(token=os.getenv("HF_TOKEN"))

# Step 1: Check if the space exists
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f"Space '{repo_id}' already exists. Using it.")
except RepositoryNotFoundError:
    print(f"Space '{repo_id}' not found. Creating new space...")
    create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
    print(f"Space '{repo_id}' created.")

api.upload_folder(
    folder_path="tourism_project/data",
    repo_id=repo_id,
    repo_type=repo_type,
)

Writing tourism_project/model_building/data_register.py


**Data Preparation**
- Load the dataset directly from the Hugging Face data space.
- Perform data cleaning and remove any unnecessary columns.
- Split the cleaned dataset into training and testing sets, and save them locally.
- Upload the resulting train and test datasets back to the Hugging Face data space

In [6]:
%%writefile tourism_project/model_building/prep.py
# for data manipulation
import pandas as pd
import sklearn
# for creating a folder
import os
# for data preprocessing and pipeline creation
from sklearn.model_selection import train_test_split
# for converting text data in to numerical representation
from sklearn.preprocessing import LabelEncoder
# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

# Define constants for the dataset and output paths
api = HfApi(token=os.getenv("HF_TOKEN"))
DATASET_PATH = "hf://datasets/<Sudu2025>/TourPkg_Prediction/tourism.csv"
df = pd.read_csv(DATASET_PATH)
print("Dataset loaded successfully.")

# Drop unique identifier column (not useful for modeling)
df.drop(columns=['CustomerID', 'OwnCar', 'NumberOfChildrenVisiting'], inplace=True)

# Encode categorical columns
label_encoder = LabelEncoder()
df['TypeofContact'] = label_encoder.fit_transform(df['TypeofContact'])
df['Occupation'] = label_encoder.fit_transform(df['Occupation'])
df['Gender'] = label_encoder.fit_transform(df['Gender'])
df['MaritalStatus'] = label_encoder.fit_transform(df['MaritalStatus'])
df['ProductPitched'] = label_encoder.fit_transform(df['ProductPitched'])


# Define target variable
target_col = 'ProdTaken'

# Split into X (features) and y (target)
X = df.drop(columns=[target_col])
y = df[target_col]

# Perform train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Xtrain.to_csv("Xtrain.csv",index=False)
Xtest.to_csv("Xtest.csv",index=False)
ytrain.to_csv("ytrain.csv",index=False)
ytest.to_csv("ytest.csv",index=False)


files = ["Xtrain.csv","Xtest.csv","ytrain.csv","ytest.csv"]

for file_path in files:
    api.upload_file(
        path_or_fileobj=file_path,
        path_in_repo=file_path.split("/")[-1],  # just the filename
        repo_id="<Sudu2025>/TourPkg_Prediction",
        repo_type="dataset",
    )

Writing tourism_project/model_building/prep.py


**Model Building with Experimentation Tracking-**
- Load the train and test data from the Hugging Face data space
- Define a model and parameters
- Tune the model with the defined parameters
- Log all the tuned parameters
- Evaluate the model performance
- Register the best model in the Hugging Face model hub
* The ML models to be built can be any of the following algorithms, such as Decision Tree, Bagging, Random Forest, AdaBoost, Gradient Boosting, and XGBoost