# TravelTide 2.0

In this notebook we loaded all tables from `.csv` files that we saved on the version 1.0 of this project. We wanted to make the loading time smaller and already save some features that were created in the last version, so we can start to build and aggragate other data and information.  

## Imports

In [1]:
# Importing required libraries
import pandas as pd
from datetime import datetime
from collections import defaultdict
import numpy as np 

## Data Upload

### Sessions

|Metric| Description|
|-------|-----------|
session_count|Total number of sessions
booking_count|Number of sessions where a booking was made
cancellations|Number of sessions where a cancellation occurred
avg_session_duration_min|Average session duration in minutes
|booking_conversion_rate|Ratio of bookings to total sessions
cancellation_rate|Ratio of cancellations to total bookings
explorer_bucket|User type based on session and booking behavior
flight_discount_booking_rate|Ratio of flight bookings that used a discount
hotel_discount_booking_rate|Ratio of hotel bookings that used a discount
average_flight_discount|Average discount amount applied to flights
average_hotel_discount|Average discount amount applied to hotels
discount_booking_rate|Generalized rate combining both flight and hotel discounts
discount_sensitivity_bucket|User discount sensitivity label
avg_days_between_sessions|Average number of days between sessions
weekend_user|Categorization as ‘Weekend’ or ‘Weekday’ user
days_since_last_booking_bucket|Time bucket since user’s last booking


In [8]:
# Reading CSV data file into a DataFrame
df_sessions = pd.read_csv('df_sessions.csv')

### Sessions Features

In [10]:
# Reading CSV data file into a DataFrame
sessions_features = pd.read_csv('sessions_features.csv')

## New Features

**Sessions and Booking Behaviour**

|Bucket|Definition|
|------|---------|
|Unengaged| Fewer than 5 sessions and no bookings|
|Silent Explorer| 5+ sessions and no bookings|
|Casual Explorer| Few sessions and at least 1 booking|
|Engaged Explorer| 6-10 sessions and bookings|
|Power Explorer|11+ Sessions und bookings|


In [12]:
def assign_bucket(row):
    if row['session_count'] >= 5 and row['booking_count'] == 0:
        return "Silent Explorer"
    elif row['session_count'] < 5 and row['booking_count'] >= 1:
        return "Casual Explorer"
    elif 6 <= row['session_count'] <= 10 and row['booking_count'] >= 1:
        return "Engaged Explorer"
    elif row['session_count'] > 10 and row['booking_count'] >= 1:
        return "Power Explorer"
    else:
        return "Unengaded"

# Applying a function row-wise to compute a new column
sessions_features['explorer_bucket'] = sessions_features.apply(assign_bucket, axis=1)

**Discounts** \
The main idea of this part is to analyse the behavior of the users based on discounts. 

1. First we calculated the Flights and Hotels bookings, that had a discount. So if flight_discout/hotel_discount **true** and a booking was made, we got 1 (per session). Otherwise, 0. 
2. Average discounts given pro user for both *flights* and *hotels*.
       - For a lot users there was the information that they got a discount, but at the discount amount column there was no information about how much it was. So the idea was to filter these 0 values out, since it would give the data wrong average values.
3. We created buckets based on the discounts and booking behavior of the users.

1. Calculate if a booking was made using a discount

In [13]:
df_sessions['booked_with_flight_discount'] = (
    (df_sessions['flight_discount'] == True) & (df_sessions['booking_made'] == True)
).astype(int)

df_sessions['booked_with_hotel_discount'] = (
    (df_sessions['hotel_discount'] == True) & (df_sessions['booking_made'] == True)
).astype(int)

2. Average discounts given pro user for both *flights* and *hotels*. For some sessions, where `flight_discount` or `hotel_discount` were **true**, we have `NaN` values. For calculation propose, we leave the information like this and didn't change for now.

$\rightarrow$ hotel_discount_booking_rate & flight_discount_booking_rate means an average value of all bookings compared to bookings with discount. 

In [14]:
# Aggregating discount-related metrics per user:
# - Calculates average booking rates with discounts and average discount amounts.
discount_aggregates = df_sessions.groupby('user_id').agg({
    'booked_with_flight_discount': 'mean',
    'booked_with_hotel_discount': 'mean',
    'flight_discount_amount': 'mean',
    'hotel_discount_amount': 'mean'
}).rename(columns={
    'booked_with_flight_discount': 'flight_discount_booking_rate',
    'booked_with_hotel_discount': 'hotel_discount_booking_rate',
    'flight_discount_amount': 'average_flight_discount',
    'hotel_discount_amount': 'average_hotel_discount'
}).reset_index().round(2)

In [15]:
# Removing duplicates
sessions_features = sessions_features.drop_duplicates(subset='user_id')

# Merging datasets to enrich features
sessions_features = sessions_features.merge(discount_aggregates, on='user_id', how='left')

3. Buckets

|Bucket|Definition|Rate|
|-------|-----------|--|
|No Booking & No Discount Exposure|Booking Conversion Rate of 0% and No Discounts|-|
|Not Responsive to Discounts|Booking Conversion Rate of 0% and Discounts|-|
|Non-sensitive|Books without any Discounts|0%|
|Mildly sensitive|Sometimes uses Discounts|>0% - 50%|
|Highly sensitive|Do Bookinhs Almost Only With Discounts|>50% - < 100%|
|Only with discounts|Do Bookinhs Only With Discounts|100%|


In [16]:
# Estimating the number of bookings made with flight and hotel discounts per user.
# Calculated by multiplying discount booking rates with total booking count.
sessions_features['flight_discount_bookings'] = (
    sessions_features['flight_discount_booking_rate'] * sessions_features['booking_count']
).round(2)

sessions_features['hotel_discount_bookings'] = (
    sessions_features['hotel_discount_booking_rate'] * sessions_features['booking_count']
).round(2)

In [17]:
# Calculating total number of discounted bookings per user by summing flight and hotel discounts.
sessions_features['discounted_bookings'] = (
    sessions_features['flight_discount_bookings'] + sessions_features['hotel_discount_bookings']
).round(2)

In [18]:
# Calculating the overall discount booking rate per user.
# This represents the share of bookings made with either a flight or hotel discount.
sessions_features['discount_booking_rate'] = (
    sessions_features['discounted_bookings'] / sessions_features['booking_count']
).round(2)

In [19]:
# Assigning users to discount sensitivity buckets based on their booking behavior and discount exposure.
# Cases:
# - No Booking & No Discount Exposure: user never booked and was never shown discounts.
# - Not Responsive to Discounts: discounts were available, but the user did not book.
# - For users with bookings: calculates the share of bookings made with discounts and assigns sensitivity levels:
#     - Non-sensitive (0% discount bookings)
#     - Mildly sensitive (< 50% discount bookings)
#     - Highly sensitive (50-99% discount bookings)
#     - Only with discounts (100% discount bookings)
def assign_discount_sensitivity_bucket(row):
    if row['booking_count'] == 0 and \
       pd.isna(row['average_flight_discount']) and \
       pd.isna(row['average_hotel_discount']):
        return 'No Booking & No Discount Exposure'
    
    elif row['booking_count'] == 0 and (
        not pd.isna(row['average_flight_discount']) or not pd.isna(row['average_hotel_discount'])
    ):
        return 'Not Responsive to Discounts'
    
    else:
        
        total_discounted_bookings = (
            row['flight_discount_booking_rate'] + row['hotel_discount_booking_rate']
        ) * row['booking_count']

        discount_booking_rate = total_discounted_bookings / row['booking_count']
        
        if discount_booking_rate == 0:
            return 'Non-sensitive'
        elif discount_booking_rate < 0.5:
            return 'Mildly sensitive'
        elif discount_booking_rate < 1.0:
            return 'Highly sensitive'
        else:
            return 'Only with discounts'


sessions_features['discount_sensitivity_bucket'] = sessions_features.apply(
    assign_discount_sensitivity_bucket, axis=1
)

**Average Days Between Sessions**

Here calculated the average number of days between sessions per user.

In [1]:
# 1. Convert session_start to datetime.
df_sessions['session_start'] = pd.to_datetime(df_sessions['session_start'])

# 2. Sort sessions chronologically per user.
df_sessions_sorted = df_sessions.sort_values(by=['user_id', 'session_start'])

# 3. Compute day difference between consecutive sessions for each user.
df_sessions_sorted['days_between'] = df_sessions_sorted.groupby('user_id')['session_start'].diff().dt.days

# 4. Calculate the average days between sessions per user.
avg_days_between_sessions = df_sessions_sorted.groupby('user_id')['days_between'].mean().reset_index().round(2)

# 5. Rename column for clarity.
avg_days_between_sessions.rename(columns={'days_between': 'avg_days_between_sessions'}, inplace=True)

# Merging datasets to enrich features
sessions_features = sessions_features.merge(avg_days_between_sessions, on='user_id', how='left')

NameError: name 'pd' is not defined

**Day Preference** 

Here we analysed if a user have more sessions on the week days or on the weekend. 

In [21]:
# Categorizing each session as 'Weekday' or 'Weekend' based on the session_start date.
# Days 5 and 6 (Saturday, Sunday) are labeled as 'Weekend', others as 'Weekday'.
df_sessions['day_type'] = df_sessions['session_start'].dt.dayofweek.apply(
    lambda x: 'Weekend' if x >= 5 else 'Weekday'
)
df_sessions['day_type'] = df_sessions['session_start'].dt.dayofweek.apply(
    lambda x: 'Weekend' if x >= 5 else 'Weekday'
)

In [22]:
day_type_counts = df_sessions.groupby(['user_id', 'day_type']).size().unstack(fill_value=0).reset_index()

In [23]:
# Classifying users based on their session activity:
# - 'Weekend User' if more sessions on weekends
# - 'Weekday User' if more sessions on weekdays
# - 'Balanced' if equal sessions on both
def classify_user(row): 
    if row['Weekend'] > row['Weekday']:
        return 'Weekend User'
    elif row['Weekday'] > row['Weekend']:
        return 'Weekday User'
    else: 
        return 'Balanced'
day_type_counts['session_day_type'] = day_type_counts.apply(classify_user, axis=1)

In [24]:
# Merging datasets to enrich features
sessions_features = sessions_features.merge(day_type_counts[['user_id', 'session_day_type']], on='user_id', how='left')

In [25]:
df_sessions["day_type"].value_counts()

day_type
Weekday    1807663
Weekend     707639
Name: count, dtype: int64

**Days Since Last Booking**

|Recency Bucket|Description|
|--------------|-----------|
|<= 7 days|Very Active|
|8–14 days|Recent|
|15–30 days|Semi-recent|
|31–90 days|Dormant|
|90+ days|At Risk|

In [26]:
latest_date = df_sessions['session_start'].max().normalize()

In [27]:
# Filter only bookings
bookings = df_sessions[df_sessions['booking_made'] == True]
# Get last booking per user
last_booking = bookings.groupby('user_id')['session_start'].max().reset_index()

last_booking.rename(columns={'session_start': 'last_booking_date'}, inplace=True)

In [28]:
# Calculating the number of days since each user's last booking.
# Subtracting the normalized last_booking_date from the latest session date.
last_booking['days_since_last_booking'] = (
    latest_date - last_booking['last_booking_date'].dt.normalize()
).dt.days

In [29]:
def bucket_recency(days):
    if days <= 7:
        return '<= 7 days'
    elif days <= 14:
        return '8–14 days'
    elif days <= 30:
        return '15–30 days'
    elif days <= 90:
        return '31–90 days'
    else:
        return '90+ days'

last_booking['recency_bucket'] = last_booking['days_since_last_booking'].apply(bucket_recency)

In [30]:
# Merging datasets
sessions_features = sessions_features.merge(
    last_booking[['user_id', 'days_since_last_booking', 'recency_bucket']],
    on='user_id',
    how='left'
)

In [31]:
sessions_features['recency_bucket'] = sessions_features['recency_bucket'].fillna('No Booking')

### Flights


Metric|Description
------|-----------
flight_booking_count|Total number of flight bookings
total_flight_spent|Total amount spent on flights
avg_seats_booked|Average number of seats booked per trip


In [32]:
# Reading CSV data file into a DataFrame
df_flights = pd.read_csv('df_flights.csv')

### Flights Features

In [34]:
# Reading CSV data file into a DataFrame
flights_features = pd.read_csv('flights_features.csv')

#### "scaled_ADS_per_km"


In [44]:
# Step 1: Adding user_id to flights data by merging with sessions via trip_id.
df_flights = df_flights.merge(
    df_sessions[["trip_id", "user_id"]],
    on="trip_id",
    how="left"
)

# Step 2: Adding origin airport coordinates from users data to flights.
df_flights = df_flights.merge(
    df_users[["user_id", "home_airport_lat", "home_airport_lon"]],
    on="user_id",
    how="left"
)

In [46]:
# Function to calculate the great-circle distance between two coordinates using the Haversine formula.
# Returns distance in kilometers.
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth's radius in km
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])  # Convert degrees to radians
    dlat = lat2 - lat1  # Latitude difference
    dlon = lon2 - lon1  # Longitude difference
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2  # Haversine formula part
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))  # Angular distance in radians
    return R * c  # Distance in km

# Calculating the flight distance in kilometers for each trip using home and destination airport coordinates.
df_flights["flight_distance_km"] = haversine(
    df_flights["home_airport_lat"], df_flights["home_airport_lon"],
    df_flights["destination_airport_lat"], df_flights["destination_airport_lon"]
)

# Calculating the total amount saved in dollars through flight discounts per user.
# Multiplying total flight spending by the average flight discount rate.
df_flights["flight_discount_dollars_saved"] = (
    flights_features["total_flight_spent"] * sessions_features["average_flight_discount"]
)

# Calculating the scaled dollars saved per kilometer.
# Measures how much discount a user saves per kilometer flown.
flights_features["scaled_ADS_per_km"] = (
    df_flights["flight_discount_dollars_saved"] / df_flights["flight_distance_km"]
)

# Cleaning up extreme values:
# Replace infinite values with NaN, then drop rows where scaled_ADS_per_km could not be calculated.
flights_features = flights_features.replace([np.inf, -np.inf], np.nan)
flights_features = flights_features.dropna(subset=["scaled_ADS_per_km"])

## Bargain Seekers

Bargain Seekers are users who are highly price-sensitive and tend to respond well to discounts and promotional offers.

To identify this segment, we created a Bargain Index combining the following metrics:
- **flight_discount_booking_rate**: Share of bookings made with flight discounts.
- **average_flight_discount**: Average discount percentage applied to flight bookings.
- **scaled_ADS_per_km**: Dollars saved per kilometer flown, adjusted to user behavior.

Users with a high Bargain Index are those who consistently benefit from discounts and prioritize lower prices when making travel decisions.

For segmentation, we selected the **top 10% of users with the highest Bargain Index** to receive perks that match their discount-driven behavior.
These perks include **exclusive flight and hotel discount offers** targeted to encourage repeat bookings.

#### Bargain Index

In [48]:
flights_features["bargain_index"] = (
    sessions_features["flight_discount_booking_rate"] *
    sessions_features["average_flight_discount"] *
    flights_features["scaled_ADS_per_km"]
)

In [49]:
# 90th percentile cutoff
cutoff_90 = flights_features["bargain_index"].quantile(0.90)

flights_features["bargain_perk_segment"] = (flights_features["bargain_index"] >= cutoff_90)

### Hotels

Metric|Description
------|-----------
hotels_booking_count|Total number of hotel bookings
total_hotel_spent|Total amount spent on hotels
avg_nights_booked|Average number of nights booked per stay


In [50]:
# Reading CSV data file into a DataFrame
df_hotels = pd.read_csv('df_hotels.csv')

### Hotels Features

In [52]:
# Reading CSV data file into a DataFrame
hotels_features = pd.read_csv('hotels_features.csv')

### Users

Column|Description
------|-----------
user_id|Unique user identifier
birthdate|Date of birth
gender|Gender
married|Marital status
has_children|Whether the user has children
home_country|Country of residence
home_city|City of residence
sign_up_date|Date of account creation
age|Calculated age based on birthdate
age_bucket|Grouped age range 



In [37]:
# Reading CSV data file into a DataFrame
df_users = pd.read_csv('df_users.csv')

#### Joining Users Table and Feature Tables

In [53]:
users_features = df_users.copy()

# Merging datasets
users_features = users_features.merge(sessions_features, on="user_id", how="left")
users_features = users_features.merge(flights_features, on="user_id", how="left")
users_features = users_features.merge(hotels_features, on="user_id", how="left")

## Save .CSV 

In [55]:
df_sessions.to_csv('df_sessions_2.csv', index=False)
df_flights.to_csv('df_flights_2.csv', index=False)
#df_hotels.to_csv('df_hotels.csv', index=False)
#df_users.to_csv('df_users.csv', index=False)

sessions_features.to_csv('sessions_features_2.csv', index=False)
flights_features.to_csv('flights_features_2.csv', index=False)
#hotel_agg.to_csv('hotels_features.csv', index=False)
users_features.to_csv('users_features.csv', index=False)