# TravelTide 3.0

In this notebook we loaded all tables from `.csv` files that we saved on the version 2.0 of this project. We wanted to make the loading time smaller and already save some features that were created in the last version, so we can start to build and aggragate other data and information.  

$\rightarrow$ See Notion board for other tasks.

## Imports

In [1]:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text
from datetime import datetime
from collections import defaultdict
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## Data Upload

### Sessions

|Metric| Description|
|-------|-----------|
session_count|Total number of sessions
booking_count|Number of sessions where a booking was made
cancellations|Number of sessions where a cancellation occurred
avg_session_duration_min|Average session duration in minutes
|booking_conversion_rate|Ratio of bookings to total sessions
cancellation_rate|Ratio of cancellations to total bookings
explorer_bucket|User type based on session and booking behavior
flight_discount_booking_rate|Ratio of flight bookings that used a discount
hotel_discount_booking_rate|Ratio of hotel bookings that used a discount
average_flight_discount|Average discount amount applied to flights
average_hotel_discount|Average discount amount applied to hotels
discount_booking_rate|Generalized rate combining both flight and hotel discounts
discount_sensitivity_bucket|User discount sensitivity label
avg_days_between_sessions|Average number of days between sessions
weekend_user|Categorization as ‘Weekend’ or ‘Weekday’ user
days_since_last_booking_bucket|Time bucket since user’s last booking


In [8]:
df_sessions = pd.read_csv('df_sessions.csv')

In [9]:
df_sessions

Unnamed: 0,session_id,user_id,trip_id,session_start,session_end,flight_discount,hotel_discount,flight_discount_amount,hotel_discount_amount,flight_booked,hotel_booked,page_clicks,cancellation,booking_made,session_duration_min
0,73956-cfd4601ebfea4c198cd738d43cdc848f,73956,,2023-03-29 12:26:00,2023-03-29 12:27:07,True,False,0.10,,False,False,9,False,0,1.12
1,74042-363a73533e4b48138d66938ad17081c6,74042,74042-c0ccd5ba2a1b4d698e88fa5ce493afcb,2023-03-29 21:17:00,2023-03-29 21:19:24,False,False,,,True,True,19,False,1,2.40
2,75154-36e1e3698f354ba8b4fca0329fda968b,75154,75154-b07e18b3aa32428c921e414f13a173c9,2023-03-29 14:11:00,2023-03-29 14:14:39,True,False,0.30,,False,True,30,False,1,3.65
3,76778-aa70d12e56a34a5785fc5779272f5e52,76778,,2023-03-29 14:36:00,2023-03-29 14:37:08,True,False,0.10,,False,False,9,False,0,1.13
4,78511-d44b047bae774d788b3ff92b67af466d,78511,,2023-03-29 22:04:00,2023-03-29 22:05:34,False,False,,,False,False,13,False,0,1.57
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2515297,70148-1c2b516de152446c9df9b8cae357ae96,70148,70148-4595801f5f004fd9910d1168d0cd21f9,2023-03-29 20:31:00,2023-03-29 20:34:30,False,False,,,True,True,28,False,1,3.50
2515298,71006-cc40bd8e9c9d48bdb2a59e82e8aa7673,71006,,2023-03-29 15:34:00,2023-03-29 15:34:32,True,False,0.30,,False,False,4,False,0,0.53
2515299,71786-0bfa66e406b046e0a9e42c9b744f4465,71786,,2023-03-29 20:54:00,2023-03-29 20:54:08,True,False,0.15,,False,False,1,False,0,0.13
2515300,72507-694cfccd26534da09c5ed726c3c993dd,72507,72507-5f4ffde2e82d4bcd98027426372ae568,2023-03-29 01:42:00,2023-03-29 01:44:44,False,False,,,True,True,22,False,1,2.73


### Sessions Features

In [10]:
sessions_features = pd.read_csv('sessions_features.csv')

In [11]:
sessions_features.columns

Index(['user_id', 'session_count', 'booking_count', 'cancellations',
       'avg_session_duration_min', 'booking_conversion_rate',
       'cancellation_rate'],
      dtype='object')

## New Features

**Sessions and Booking Behaviour**

|Bucket|Bedingung|
|------|---------|
|Unengaged| Fewer than 5 sessions and no bookings|
|Silent Explorer| 5+ sessions and no bookings|
|Casual Explorer| Few sessions and at least 1 booking|
|Engaged Explorer| 6-10 sessions and bookings|
|Power Explorer|11+ Sessions und bookings|


In [12]:
def assign_bucket(row):
    if row['session_count'] >= 5 and row['booking_count'] == 0:
        return "Silent Explorer"
    elif row['session_count'] < 5 and row['booking_count'] >= 1:
        return "Casual Explorer"
    elif 6 <= row['session_count'] <= 10 and row['booking_count'] >= 1:
        return "Engaged Explorer"
    elif row['session_count'] > 10 and row['booking_count'] >= 1:
        return "Power Explorer"
    else:
        return "Unengaded"  # Für User, die nicht in diese Kategorien passen

# Wende die Funktion auf jede Zeile an
sessions_features['explorer_bucket'] = sessions_features.apply(assign_bucket, axis=1)

**Discounts** \
The main idea of this part is to analyse the behavior of the users based on discounts. 

1. First we calculated the Flights and Hotels bookings, that had a discount. So if flight_discout/hotel_discount **true** and a booking was made, we got 1 (per session). Otherwise, 0. 
2. Average discounts given pro user for both *flights* and *hotels*.
       - For a lot users there was the information that they got a discount, but at the discount amount column there was no information about how much it was. So the idea was to filter these 0 values out, since it would give the data wrong average values.
3. We created buckets based on the discounts and booking behavior of the users.

1. Calculate if a booking was made using a discount

In [13]:
df_sessions['booked_with_flight_discount'] = (
    (df_sessions['flight_discount'] == True) & (df_sessions['booking_made'] == True)
).astype(int)

df_sessions['booked_with_hotel_discount'] = (
    (df_sessions['hotel_discount'] == True) & (df_sessions['booking_made'] == True)
).astype(int)

2. Average discounts given pro user for both *flights* and *hotels*. For some sessions, where `flight_discount` or `hotel_discount` were **true**, we have `NaN` values. For calculation propose, we leave the information like this and didn't change for now.

$\rightarrow$ hotel_discount_booking_rate & flight_discount_booking_rate means an average value of all bookings compared to bookings with discount. 

In [14]:
# Pro User aggregieren
discount_aggregates = df_sessions.groupby('user_id').agg({
    'booked_with_flight_discount': 'mean',  # Anteil Buchungen mit Flugrabatt
    'booked_with_hotel_discount': 'mean',   # Anteil Buchungen mit Hotelrabatt
    'flight_discount_amount': 'mean',       # Durchschnittlicher Flugrabatt
    'hotel_discount_amount': 'mean'          # Durchschnittlicher Hotelrabatt
}).rename(columns={
    'booked_with_flight_discount': 'flight_discount_booking_rate',
    'booked_with_hotel_discount': 'hotel_discount_booking_rate',
    'flight_discount_amount': 'average_flight_discount',
    'hotel_discount_amount': 'average_hotel_discount'
}).reset_index().round(2)

In [15]:
# sessions_features vorbereiten: Duplikate pro User entfernen
sessions_features = sessions_features.drop_duplicates(subset='user_id')

# Merge der aggregierten Rabatt-Infos
sessions_features = sessions_features.merge(discount_aggregates, on='user_id', how='left')

3. Buckets

|Bucket|Definition|Rate|
|-------|-----------|--|
|No Booking & No Discount Exposure|Booking Conversion Rate of 0% and No Discounts|-|
|Not Responsive to Discounts|Booking Conversion Rate of 0% and Discounts|-|
|Non-sensitive|Books without any Discounts|0%|
|Mildly sensitive|Sometimes uses Discounts|>0% - 50%|
|Highly sensitive|Do Bookinhs Almost Only With Discounts|>50% - < 100%|
|Only with discounts|Do Bookinhs Only With Discounts|100%|


In [16]:
sessions_features['flight_discount_bookings'] = (
    sessions_features['flight_discount_booking_rate'] * sessions_features['booking_count']
).round(2)

sessions_features['hotel_discount_bookings'] = (
    sessions_features['hotel_discount_booking_rate'] * sessions_features['booking_count']
).round(2)

In [17]:
sessions_features['discounted_bookings'] = (
    sessions_features['flight_discount_bookings'] + sessions_features['hotel_discount_bookings']
).round(2)

In [18]:
sessions_features['discount_booking_rate'] = (
    sessions_features['discounted_bookings'] / sessions_features['booking_count']
).round(2)

In [19]:
def assign_discount_sensitivity_bucket(row):
    # Fall 1: Keine Buchungen und keine Rabatte je gesehen
    if row['booking_count'] == 0 and \
       pd.isna(row['average_flight_discount']) and \
       pd.isna(row['average_hotel_discount']):
        return 'No Booking & No Discount Exposure'
    
    # Fall 2: Keine Buchungen, aber Rabatte wurden angeboten
    elif row['booking_count'] == 0 and (
        not pd.isna(row['average_flight_discount']) or not pd.isna(row['average_hotel_discount'])
    ):
        return 'Not Responsive to Discounts'
    
    # Fall 3: Buchungen vorhanden → Discount Booking Rate berechnen
    else:
        # Kombinierte Discount-Booking-Rate berechnen
        total_discounted_bookings = (
            row['flight_discount_booking_rate'] + row['hotel_discount_booking_rate']
        ) * row['booking_count']
        
        discount_booking_rate = total_discounted_bookings / row['booking_count']
        
        # Buckets basierend auf Discount-Booking-Rate
        if discount_booking_rate == 0:
            return 'Non-sensitive'
        elif discount_booking_rate < 0.5:
            return 'Mildly sensitive'
        elif discount_booking_rate < 1.0:
            return 'Highly sensitive'
        else:
            return 'Only with discounts'

# Anwendung auf dein sessions_features-DataFrame:
sessions_features['discount_sensitivity_bucket'] = sessions_features.apply(
    assign_discount_sensitivity_bucket, axis=1
)

**Average Days Between Sessions**

In [20]:
# Stelle sicher, dass session_start ein datetime-Typ ist
df_sessions['session_start'] = pd.to_datetime(df_sessions['session_start'])

# Sortieren nach user_id und session_start
df_sessions_sorted = df_sessions.sort_values(by=['user_id', 'session_start'])

# Für jeden User: Differenz der Session-Zeitpunkte berechnen
df_sessions_sorted['days_between'] = df_sessions_sorted.groupby('user_id')['session_start'].diff().dt.days

# Jetzt: Durchschnittlicher Abstand pro User
avg_days_between_sessions = df_sessions_sorted.groupby('user_id')['days_between'].mean().reset_index().round(2)
avg_days_between_sessions.rename(columns={'days_between': 'avg_days_between_sessions'}, inplace=True)

# In deine sessions_features Tabelle mergen
sessions_features = sessions_features.merge(avg_days_between_sessions, on='user_id', how='left')

**Day Preference** 

Here we analysed with a user have more sessions on the week days or on the weekend. 

In [21]:
df_sessions['day_type'] = df_sessions['session_start'].dt.dayofweek.apply(
    lambda x: 'Weekend' if x >= 5 else 'Weekday'
)

In [22]:
day_type_counts = df_sessions.groupby(['user_id', 'day_type']).size().unstack(fill_value=0).reset_index()

In [23]:
def classify_user(row): 
    if row['Weekend'] > row['Weekday']:
        return 'Weekend User'
    elif row['Weekday'] > row['Weekend']:
        return 'Weekday User'
    else: 
        return 'Balanced'

day_type_counts['session_day_type'] = day_type_counts.apply(classify_user, axis=1)

In [24]:
sessions_features = sessions_features.merge(day_type_counts[['user_id', 'session_day_type']], on='user_id', how='left')

In [25]:
df_sessions["day_type"].value_counts()

day_type
Weekday    1807663
Weekend     707639
Name: count, dtype: int64

**Days Since Last Booking**

|Recency Bucket|Description|
|--------------|-----------|
|<= 7 days|Very Active|
|8–14 days|Recent|
|15–30 days|Semi-recent|
|31–90 days|Dormant|
|90+ days|At Risk|

In [26]:
latest_date = df_sessions['session_start'].max().normalize()

In [27]:
# Filter only bookings
bookings = df_sessions[df_sessions['booking_made'] == True]

# Get last booking per user
last_booking = bookings.groupby('user_id')['session_start'].max().reset_index()
last_booking.rename(columns={'session_start': 'last_booking_date'}, inplace=True)

In [28]:
last_booking['days_since_last_booking'] = (
    latest_date - last_booking['last_booking_date'].dt.normalize()
).dt.days

In [29]:
def bucket_recency(days):
    if days <= 7:
        return '<= 7 days'
    elif days <= 14:
        return '8–14 days'
    elif days <= 30:
        return '15–30 days'
    elif days <= 90:
        return '31–90 days'
    else:
        return '90+ days'

last_booking['recency_bucket'] = last_booking['days_since_last_booking'].apply(bucket_recency)

In [30]:
sessions_features = sessions_features.merge(
    last_booking[['user_id', 'days_since_last_booking', 'recency_bucket']],
    on='user_id',
    how='left'
)

In [31]:
sessions_features['recency_bucket'] = sessions_features['recency_bucket'].fillna('No Booking')

### Flights


Metric|Description
------|-----------
flight_booking_count|Total number of flight bookings
total_flight_spent|Total amount spent on flights
avg_seats_booked|Average number of seats booked per trip


In [32]:
df_flights = pd.read_csv('df_flights.csv')

In [47]:
df_flights.head()

Unnamed: 0,trip_id,origin_airport,destination,destination_airport,seats,return_flight_booked,departure_time,return_time,checked_bags,trip_airline,destination_airport_lat,destination_airport_lon,base_fare_usd,user_id,home_airport_lat,home_airport_lon,flight_distance_km,flight_discount_dollars_saved
0,74042-c0ccd5ba2a1b4d698e88fa5ce493afcb,YUL,Detroit,DET,1,True,2023-04-05 11:00:00,2023-04-18 11:00:00,1,Ryanair,42.409,-83.01,135.98,74042,45.468,-73.741,815.742843,
1,80855-b6aed08efb70447f8de08e381f2bc5be,YOW,Memphis,MEM,1,True,2023-04-03 15:00:00,2023-04-15 15:00:00,1,Frontier Airlines,35.042,-89.977,300.31,80855,45.323,-75.669,1663.925059,
2,82871-fdd087a2c3b84176bae94fa58a571b4e,YHU,Hurghada,HRG,2,True,2023-10-12 15:00:00,2023-11-01 15:00:00,1,Transaero Airlines,27.184,33.798,3417.46,82871,45.517,-73.417,9103.191134,
3,82909-7efa59e027b645f18910aef37be444f2,LNK,Jacksonville,JAX,1,True,2023-04-08 12:00:00,2023-04-13 12:00:00,1,Alitalia,30.494,-81.688,339.56,82909,40.851,-96.759,1778.510436,
4,84527-6bbfd82819b549b08f0a08659b0105d7,IAH,Nashville,BNA,1,True,2023-04-07 10:00:00,2023-04-10 10:00:00,1,Southwest Airlines,36.124,-86.678,223.57,84527,29.98,-95.34,1056.736341,


### Flights Features

In [34]:
flights_features = pd.read_csv('flights_features.csv')

#### "scaled_ADS_per_km"


In [44]:
# Schritt 1: flights → trips (um user_id zu bekommen)
df_flights = df_flights.merge(
    df_sessions[["trip_id", "user_id"]],
    on="trip_id",
    how="left"
)

# Schritt 2: trips → users (um die Origin-Koordinaten zu holen)
df_flights = df_flights.merge(
    df_users[["user_id", "home_airport_lat", "home_airport_lon"]],
    on="user_id",
    how="left"
)

In [46]:
# Flugdistanz berechnen
def haversine(lat1, lon1, lat2, lon2):
    R = 6371
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return R * c

df_flights["flight_distance_km"] = haversine(
    df_flights["home_airport_lat"], df_flights["home_airport_lon"],
    df_flights["destination_airport_lat"], df_flights["destination_airport_lon"]
)

# Berechne Rabatt in USD
df_flights["flight_discount_dollars_saved"] = (
    flights_features["total_flight_spent"] * sessions_features["average_flight_discount"]
)

# Berechne scaled_ADS_per_km
flights_features["scaled_ADS_per_km"] = (
    df_flights["flight_discount_dollars_saved"] / df_flights["flight_distance_km"]
)

# Extremwerte entfernen
flights_features = flights_features.replace([np.inf, -np.inf], np.nan)
flights_features = flights_features.dropna(subset=["scaled_ADS_per_km"])

#### Bargain Index

In [48]:
# Berechne bargain_index
flights_features["bargain_index"] = (
    sessions_features["flight_discount_booking_rate"] *
    sessions_features["average_flight_discount"] *
    flights_features["scaled_ADS_per_km"]
)

In [49]:
# 90th percentile cutoff
cutoff_90 = flights_features["bargain_index"].quantile(0.90)

# Neue Spalte: Bargain Perk Segment (True/False)
flights_features["bargain_perk_segment"] = (
    flights_features["bargain_index"] >= cutoff_90
)

### Hotels

Metric|Description
------|-----------
hotels_booking_count|Total number of hotel bookings
total_hotel_spent|Total amount spent on hotels
avg_nights_booked|Average number of nights booked per stay


In [50]:
df_hotels = pd.read_csv('df_hotels.csv')

In [51]:
df_hotels.head()

Unnamed: 0,trip_id,hotel_name,nights,rooms,check_in_time,check_out_time,hotel_per_room_usd,location
0,99955-8f761db158784003940e13ef517d9685,Aman Resorts,1,1,2023-04-07 21:11:06.045,2023-04-09 11:00:00,359.0,Austin
1,100370-681ac8cf157049e58469a596b90a1ed6,Crowne Plaza,4,1,2023-04-05 09:55:10.785,2023-04-09 11:00:00,76.0,Toronto
2,102366-40fa771da1b140c289c10cd222875eaf,Extended Stay,5,1,2023-04-09 11:00:00.000,2023-04-14 11:00:00,149.0,Toronto
3,102411-2006895270e0486ab6b2fa801ef66e96,Extended Stay,4,3,2023-04-08 13:07:04.350,2023-04-13 11:00:00,190.0,Tucson
4,103933-3a88027c51354990ab177c01c64545cb,NH Hotel,0,1,2023-04-05 19:21:53.055,2023-04-06 11:00:00,86.0,El Paso


### Hotels Features

In [52]:
hotels_features = pd.read_csv('hotels_features.csv')

### Users

Column|Description
------|-----------
user_id|Unique user identifier
birthdate|Date of birth
gender|Gender
married|Marital status
has_children|Whether the user has children
home_country|Country of residence
home_city|City of residence
sign_up_date|Date of account creation
age|Calculated age based on birthdate
age_bucket|Grouped age range 



In [37]:
df_users = pd.read_csv('df_users.csv')

#### Joining Users Table and Feature Tables

In [53]:
# Starte mit users-Tabelle
users_features = df_users.copy()

# Merge aller Feature-Tabellen
users_features = users_features.merge(sessions_features, on="user_id", how="left")
users_features = users_features.merge(flights_features, on="user_id", how="left")
users_features = users_features.merge(hotels_features, on="user_id", how="left")

In [54]:
users_features

Unnamed: 0,user_id,birthdate,gender,married,has_children,home_country,home_city,home_airport,home_airport_lat,home_airport_lon,sign_up_date,age,age_bucket,session_count,booking_count,cancellations,avg_session_duration_min,booking_conversion_rate,cancellation_rate,explorer_bucket,flight_discount_booking_rate,hotel_discount_booking_rate,average_flight_discount,average_hotel_discount,flight_discount_bookings,hotel_discount_bookings,discounted_bookings,discount_booking_rate,discount_sensitivity_bucket,avg_days_between_sessions,session_day_type,days_since_last_booking,recency_bucket,flight_booking_count,total_flight_spent,avg_seats_booked,scaled_ADS_per_km,bargain_index,bargain_perk_segment,hotels_booking_count,total_hotel_spent,avg_nights_booked
0,440,1967-01-26,M,False,False,Usa,Long Beach,LGB,33.818,-118.151,2021-04-17,58,55-64,3,1,0,2.37,33.33,0.0,Casual Explorer,0.0,0.00,,,0.0,0.00,0.00,0.00,Non-sensitive,82.00,Weekday User,40.0,31–90 days,1.0,394.97,1.0,0.029058,0.001947,False,1.0,410.0,18.0
1,564,1986-07-06,F,False,False,Usa,New York,LGA,40.777,-73.872,2021-04-19,38,35-44,3,1,0,1.17,33.33,0.0,Casual Explorer,0.0,0.00,0.1,,0.0,0.00,0.00,0.00,Non-sensitive,30.50,Weekday User,181.0,90+ days,1.0,375.58,1.0,0.006077,0.000000,False,,,
2,1269,1991-08-21,F,False,False,Canada,Montreal,YMX,45.680,-74.039,2021-05-11,33,25-34,3,2,0,1.88,66.67,0.0,Casual Explorer,0.0,0.67,,0.1,0.0,1.34,1.34,0.67,Highly sensitive,91.50,Weekday User,15.0,15–30 days,,,,,,,1.0,159.0,2.0
3,1279,1966-11-15,F,True,True,Usa,San Antonio,SAT,29.534,-98.470,2021-05-11,58,55-64,3,1,0,1.79,33.33,0.0,Casual Explorer,0.0,0.00,,,0.0,0.00,0.00,0.00,Non-sensitive,67.50,Weekday User,25.0,15–30 days,,,,,,,1.0,220.0,1.0
4,4145,1965-10-12,M,True,True,Canada,Quebec,YQB,46.788,-71.398,2021-06-02,59,55-64,4,1,0,1.27,25.00,0.0,Casual Explorer,0.0,0.00,0.2,,0.0,0.00,0.00,0.00,Non-sensitive,26.67,Balanced,163.0,90+ days,,,,,,,1.0,279.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
611211,687002,1967-09-14,F,False,False,Usa,Chicago,MDW,41.786,-87.752,2023-03-23,57,55-64,4,0,0,1.70,0.00,0.0,Unengaded,0.0,0.00,0.1,,0.0,0.00,0.00,,Not Responsive to Discounts,1.33,Weekday User,,No Booking,,,,,,,,,
611212,687342,2000-12-14,M,False,False,Canada,Calgary,YYC,51.114,-114.020,2023-03-23,24,18-24,3,0,0,3.83,0.00,0.0,Unengaded,0.0,0.00,,,0.0,0.00,0.00,,No Booking & No Discount Exposure,1.00,Weekday User,,No Booking,,,,,,,,,
611213,689837,2004-12-06,M,False,False,Usa,Atlanta,ATL,33.640,-84.427,2023-03-24,20,18-24,3,0,0,0.38,0.00,0.0,Unengaded,0.0,0.00,,,0.0,0.00,0.00,,No Booking & No Discount Exposure,0.00,Weekend User,,No Booking,,,,,,,,,
611214,690058,1994-07-19,F,False,True,Usa,Colorado Springs,COS,38.806,-104.700,2023-03-24,30,25-34,3,0,0,0.95,0.00,0.0,Unengaded,0.0,0.00,,,0.0,0.00,0.00,,No Booking & No Discount Exposure,1.50,Weekday User,,No Booking,,,,,,,,,


## Save .CSV 

In [55]:
df_sessions.to_csv('df_sessions_2.csv', index=False)
df_flights.to_csv('df_flights_2.csv', index=False)
#df_hotels.to_csv('df_hotels.csv', index=False)
#df_users.to_csv('df_users.csv', index=False)

sessions_features.to_csv('sessions_features_2.csv', index=False)
flights_features.to_csv('flights_features_2.csv', index=False)
#hotel_agg.to_csv('hotels_features.csv', index=False)
users_features.to_csv('users_features.csv', index=False)