{  “cells”: \[   {    “cell_type”: “markdown”,    “metadata”: {},  
 “source”: \[     “\# **Voyage Analytics: End-to-End MLOps for Travel
Recommendations**”,     “”,     “\## **Project Overview**”,     “This
project aims to optimize travel decision-making using Machine Learning
with the specific `new_users`, `flights`, and `hotels` datasets.”,    
“”,     “\### **Objectives**”,     “1.  **User Profiling (NLP):** Train
a Naive Bayes classifier on the `new_users.csv` dataset to learn
name-gender patterns for future user onboarding.”,     “2.  **Flight
Price Prediction:** Build regression models (XGBoost, Random Forest)
using `flights.csv` to forecast ticket prices.”,     “3.  **Hotel
Recommendation:** Create a content-based recommendation engine using
`hotels.csv` to suggest properties based on similarity.”,     “”,    
“**MLOps Integration:** Uses MLflow for experiment tracking.”    \]   },
  {    “cell_type”: “markdown”,    “metadata”: {},    “source”: \[    
“\## **1. Setup & Environment**”    \]   },   {    “cell_type”: “code”,
   “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# Install necessary libraries”,     “%pip install -q
dagshub mlflow xgboost”    \]   },   {    “cell_type”: “code”,  
 “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# Standard Libraries”,     “import numpy as np”,    
“import pandas as pd”,     “import matplotlib.pyplot as plt”,    
“import seaborn as sns”,     “import warnings”,     “”,     “\# Machine
Learning - Preprocessing & Metrics”,     “from sklearn.preprocessing
import StandardScaler, OneHotEncoder, MinMaxScaler, LabelEncoder”,    
“from sklearn.model_selection import train_test_split”,     “from
sklearn.metrics import mean_squared_error, r2_score, accuracy_score,
classification_report”,     “from sklearn.compose import
ColumnTransformer”,     “from sklearn.pipeline import make_pipeline”,  
  “”,     “\# Machine Learning - Models”,     “from sklearn.ensemble
import RandomForestRegressor”,     “from sklearn.feature_extraction.text
import CountVectorizer”,     “from sklearn.naive_bayes import
MultinomialNB”,     “from sklearn.neighbors import NearestNeighbors”,  
  “import xgboost as xgb”,     “”,     “\# MLOps”,     “import dagshub”,
    “import mlflow”,     “import mlflow.sklearn”,     “”,     “\#
Configuration”,     “sns.set_style(‘darkgrid’)”,    
“warnings.filterwarnings(‘ignore’)”,     “%matplotlib inline”    \]   },
  {    “cell_type”: “code”,    “execution_count”: null,    “metadata”:
{},    “outputs”: \[\],    “source”: \[     “\# Initialize MLOps
Tracking”,     “\# Update ‘repo_owner’ and ‘repo_name’ with your actual
DagsHub details if available”,     “try:”,     ”  
 dagshub.init(repo_owner=‘chandrapapr1501’,
repo_name=‘Group_project_experiments_flights_etc’, mlflow=True)“,    ”  
 mlflow.autolog()“,    ”except Exception as e:“,    ”  
 print("DagsHub/MLflow not configured. Running in local mode.")”    \]  
},   {    “cell_type”: “markdown”,    “metadata”: {},    “source”: \[  
  “\## **2. Data Loading**”,     “Loading the specific datasets:
`new_users.csv`, `flights.csv`, and `hotels.csv`.”    \]   },   {  
 “cell_type”: “code”,    “execution_count”: null,    “metadata”: {},  
 “outputs”: \[\],    “source”: \[     “\# Load Datasets (Assuming files
are in the local directory)”,     “try:”,     ”    user_df =
pd.read_csv("new_users.csv")“,    ”    flights_df =
pd.read_csv("flights.csv")“,    ”    hotels_df =
pd.read_csv("hotels.csv")“,    ”    “,    ”    print("? Datasets loaded
successfully.")“,    ”except FileNotFoundError:“,    ”    print("? Files
not found. Please upload ‘new_users.csv’, ‘flights.csv’, and
‘hotels.csv’.")”    \]   },   {    “cell_type”: “code”,  
 “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# DATASET ADAPTATION”,     “\# ‘new_users.csv’ does
not have a user ID code, but ‘flights.csv’ has ‘userCode’.”,     “\# We
generate a ‘code’ column based on the index to align the schemas.”,    
“”,     “user_df\[‘code’\] = user_df.index”,     “print(f"Users:
{user_df.shape}\nFlights: {flights_df.shape}\nHotels:
{hotels_df.shape}")”,     “user_df.head()”    \]   },   {  
 “cell_type”: “markdown”,    “metadata”: {},    “source”: \[     “\##
**3. User Profiling: Gender Classification (NLP)**”,     “The
`new_users.csv` file appears to have labeled data. We will use this to
**train** a model that can predict gender for any future users who sign
up without providing it.”    \]   },   {    “cell_type”: “code”,  
 “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# — STEP 1: PREPROCESSING —”,     “\# Ensure Name is
string, lowercased, and stripped of whitespace”,     “user_df\[‘name’\]
= user_df\[‘name’\].astype(str).str.lower().str.strip()”,    
“user_df\[‘gender’\] =
user_df\[‘gender’\].astype(str).str.strip().str.title()”,     “”,    
“\# Check for missing values”,     “missing_mask =
user_df\[‘gender’\].isin(\[‘None’, ‘Nan’, ‘none’, ‘Null’, ‘nan’,
np.nan\])”,     “”,     “if missing_mask.sum() \> 0:”,     ”  
 print(f"Found {missing_mask.sum()} missing gender labels. Will impute
later.")“,    ”else:“,    ”    print("Dataset is fully labeled. Training
prediction model for future use.")“,    ”“,    ”\# — STEP 2: BUILD
PIPELINE —“,    ”\# Char-level n-grams work best for names (e.g., "a" vs
"o" ending)“,    ”gender_model = make_pipeline(“,    ”  
 CountVectorizer(analyzer=‘char_wb’, ngram_range=(2, 4)),“,    ”  
 MultinomialNB()“,    ”)“,    ”“,    ”\# — STEP 3: TRAIN TEST SPLIT —“,
   ”X = user_df\[\~missing_mask\]\[‘name’\]“,    ”y =
user_df\[\~missing_mask\]\[‘gender’\]“,    ”“,    ”X_train_g, X_test_g,
y_train_g, y_test_g = train_test_split(X, y, test_size=0.2,
random_state=42)“,    ”“,    ”\# — STEP 4: TRAIN & EVALUATE —“,  
 ”gender_model.fit(X_train_g, y_train_g)“,    ”preds_g =
gender_model.predict(X_test_g)“,    ”“,    ”print(f"\nGender Model
Accuracy: {accuracy_score(y_test_g, preds_g):.2f}")“,    ”“,    ”\# Test
on new names“,    ”sample_names = \["Alexander", "Sophia", "Ravi",
"Priya"\]“,    ”print(f"\nTest Predictions: {dict(zip(sample_names,
gender_model.predict(\[n.lower() for n in sample_names\])))}")”    \]  
},   {    “cell_type”: “markdown”,    “metadata”: {},    “source”: \[  
  “\## **4. Feature Engineering: Flight Prices**”,     “Using
`flights.csv` to predict ticket prices.”    \]   },   {    “cell_type”:
“code”,    “execution_count”: null,    “metadata”: {},    “outputs”:
\[\],    “source”: \[     “\# Create a copy for processing”,    
“df_flights = flights_df.copy()”,     “”,     “\# 1. Date Features”,    
“df_flights\["date"\] = pd.to_datetime(df_flights\["date"\])”,    
“df_flights\["day_name"\] = df_flights\["date"\].dt.day_name()”,    
“df_flights\["month"\] = df_flights\["date"\].dt.month”,     “”,     “\#
2. Drop Duplicates”,     “df_flights.drop_duplicates(inplace=True,
keep="first")”,     “”,     “\# 3. Feature Selection”,     “\# Drop IDs
that don’t help prediction”,     “drop_cols = \["travelCode",
"userCode", "date"\] ”,     “df_flights_processed =
df_flights.drop(drop_cols, axis=1)”,     “”,     “print("Processed
Flight Data Shape:", df_flights_processed.shape)”,    
“df_flights_processed.head(3)”    \]   },   {    “cell_type”: “code”,  
 “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# 4. Encoding & Scaling Pipeline”,     “”,    
“cat_features = \["from", "to", "flightType", "agency", "day_name"\]”,  
  “num_features = \["distance", "time"\]”,     “”,     “preprocessor =
ColumnTransformer(”,     ”    transformers=\[“,    ”        ("cat",
OneHotEncoder(handle_unknown=‘ignore’), cat_features),“,    ”      
 ("num", StandardScaler(), num_features)“,    ”    \],“,    ”  
 remainder="passthrough" \# Keep Price“,    ”)“,    ”“,    ”\# Separate
X and y“,    ”X = df_flights_processed.drop("price", axis=1)“,    ”y =
df_flights_processed\["price"\]“,    ”“,    ”\# Apply transformations“,
   ”X_processed = preprocessor.fit_transform(X)“,    ”“,    ”\# 5. Data
Splitting“,    ”X_train, X_test, y_train, y_test =
train_test_split(X_processed, y, test_size=0.2, random_state=42)“,  
 ”“,    ”print(f"Train Shape: {X_train.shape}, Test Shape:
{X_test.shape}")”    \]   },   {    “cell_type”: “markdown”,  
 “metadata”: {},    “source”: \[     “\## **5. Model Development: Flight
Price Regression**”    \]   },   {    “cell_type”: “code”,  
 “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# Helper function”,     “def eval_metrics(actual,
pred):”,     ”    rmse = np.sqrt(mean_squared_error(actual, pred))“,  
 ”    r2 = r2_score(actual, pred)“,    ”    return rmse, r2”    \]   },
  {    “cell_type”: “code”,    “execution_count”: null,    “metadata”:
{},    “outputs”: \[\],    “source”: \[     “\# XGBoost Regressor
(Tuned)”,     “mlflow.set_experiment("Flight_Prices_New_Dataset")”,    
“”,     “params = {”,     ”    ‘n_estimators’: 500,“,    ”  
 ‘learning_rate’: 0.05,“,    ”    ‘max_depth’: 8,“,    ”    ‘subsample’:
0.8,“,    ”    ‘colsample_bytree’: 0.8,“,    ”    ‘n_jobs’: -1,“,    ”  
 ‘random_state’: 42“,    ”}“,    ”“,    ”xgb_model =
xgb.XGBRegressor(\*\*params)“,    ”“,    ”with
mlflow.start_run(run_name="XGBoost_V2"):“,    ”    # Train“,    ”  
 xgb_model.fit(X_train, y_train)“,    ”    “,    ”    # Predict“,    ”  
 y_pred = xgb_model.predict(X_test)“,    ”    rmse, r2 =
eval_metrics(y_test, y_pred)“,    ”    “,    ”    # Log metrics“,    ”  
 mlflow.log_metric("rmse", rmse)“,    ”    mlflow.log_metric("r2", r2)“,
   ”    “,    ”    print(f"XGBoost Performance:\nRMSE: {rmse:.2f}\nR2
Score: {r2:.4f}")”    \]   },   {    “cell_type”: “markdown”,  
 “metadata”: {},    “source”: \[     “\## **6. Recommendation System:
Hotels**”,     “Using `hotels.csv` to find similar hotels based on
location, price, and popularity.”    \]   },   {    “cell_type”: “code”,
   “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# 1. Aggregation: Create "Hotel Profiles"”,     “\#
We aggregate because one hotel has multiple bookings (rows)”,    
“hotel_profiles = hotels_df.groupby(‘name’).agg({”,     ”    ‘place’:
‘first’,         \# Location“,    ”    ‘price’: ‘mean’,          #
Average Price“,    ”    ‘days’: ‘mean’,           \# Average Stay
Duration“,    ”    ‘total’: ‘count’          # Popularity (Bookings
count)“,    ”}).rename(columns={‘total’: ‘popularity’}).reset_index()“,
   ”“,    ”print(f"Unique Hotels Identified:
{hotel_profiles.shape\[0\]}")“,    ”“,    ”\# 2. Encoding Location“,  
 ”le_place = LabelEncoder()“,    ”hotel_profiles\[‘place_encoded’\] =
le_place.fit_transform(hotel_profiles\[‘place’\])“,    ”“,    ”\# 3.
Scaling Features“,    ”scaler_rec = MinMaxScaler()“,    ”rec_features =
\[‘place_encoded’, ‘price’, ‘days’, ‘popularity’\]“,  
 ”hotel_features_matrix =
scaler_rec.fit_transform(hotel_profiles\[rec_features\])“,    ”“,    ”\#
4. Build KNN Model“,    ”knn_model = NearestNeighbors(n_neighbors=6,
metric=‘manhattan’, algorithm=‘brute’)“,  
 ”knn_model.fit(hotel_features_matrix)”    \]   },   {    “cell_type”:
“code”,    “execution_count”: null,    “metadata”: {},    “outputs”:
\[\],    “source”: \[     “def get_hotel_recommendations(hotel_name):”,
    ”    """Returns top 5 similar hotels based on price, location, and
popularity."""“,    ”    “,    ”    if hotel_name not in
hotel_profiles\[‘name’\].values:“,    ”        return f"Hotel
‘{hotel_name}’ not found in database."“,    ”“,    ”    # Get Index“,  
 ”    idx = hotel_profiles\[hotel_profiles\[‘name’\] ==
hotel_name\].index\[0\]“,    ”“,    ”    # Find Neighbors“,    ”  
 distances, indices =
knn_model.kneighbors(\[hotel_features_matrix\[idx\]\])“,    ”“,    ”  
 # Exclude the first one (itself)“,    ”    similar_indices =
indices\[0\]\[1:\]“,    ”    “,    ”    # Return Result“,    ”    return
hotel_profiles.iloc\[similar_indices\]\[\[‘name’, ‘place’, ‘price’,
‘popularity’\]\]”    \]   },   {    “cell_type”: “code”,  
 “execution_count”: null,    “metadata”: {},    “outputs”: \[\],  
 “source”: \[     “\# Test Recommendation”,     “sample_hotel =
hotel_profiles\[‘name’\].iloc\[0\]”,     “print(f"Searching
recommendations for: {sample_hotel}\n")”,    
“print(get_hotel_recommendations(sample_hotel))”    \]   },   {  
 “cell_type”: “markdown”,    “metadata”: {},    “source”: \[     “\##
**7. Conclusion**”,     “1.  **Users:** Successfully adapted to
`new_users.csv`. A gender prediction model was trained with high
accuracy on the clean data.”,     “2.  **Flights:** The price prediction
model (XGBoost) was trained on `flights.csv` features.”,     “3.
 **Hotels:** A recommendation engine was built on `hotels.csv`,
successfully aggregating booking data into hotel profiles.”    \]   }
 \],  “metadata”: {   “kernelspec”: {    “display_name”: “Python 3”,  
 “language”: “python”,    “name”: “python3”   },   “language_info”: {  
 “codemirror_mode”: {     “name”: “ipython”,     “version”: 3    },  
 “file_extension”: “.py”,    “mimetype”: “text/x-python”,    “name”:
“python”,    “nbconvert_exporter”: “python”,    “pygments_lexer”:
“ipython3”,    “version”: “3.10.12”   }  },  “nbformat”: 4,
 “nbformat_minor”: 5 }