## ✈️ *TravelTide* - Customer Segmentation Project
---
### Business and Data Understanding

### 1. Introduction
This notebook introduces the analytical framework for the TravelTide segmentation project.  
The goal is to build a data-driven understanding of user behavior and prepare the foundation for a clustering model that will support the design of a personalized rewards program.

The analysis follows a structured approach:  
- clarify the business problem  
- review the available data sources  
- identify the fields relevant for modeling  
- outline how business needs map to measurable variables  

This ensures that the subsequent technical work is aligned with the strategic objectives of the company.

**Author**: Maria Petralia  
**Project**: TravelTide - Customer Segmentation & Perk Strategy  
**Context**: MasterSchool - Data Science Program  
**Date**: Feb 2026

### 2. Business Understanding

#### 2.1 Company Context
TravelTide is a young travel-booking platform founded in 2021, during a period of post-pandemic recovery.  
As a result, the available data covers a relatively short historical window (up to July 2023), making short-term engagement and conversion patterns particularly relevant for understanding user behavior.

Elena, the Head of Marketing, aims to improve customer retention by introducing a personalized rewards program.  
To design effective perks, she needs a clear view of how different users interact with the platform, how frequently they book, how they respond to discounts, and what types of trips they tend to take.

The business objective is therefore to identify distinct customer segments that reflect meaningful behavioral differences.  
These segments will guide the selection of perks that are relevant, actionable, and aligned with TravelTide's early-stage growth strategy.


### 3. Data Understanding

This phase focuses on reviewing the available data sources and assessing their relevance for the segmentation task.  
Understanding the structure, granularity, and quality of the data is essential before moving into feature engineering and modeling.

#### 3.1 Available Tables

The dataset includes four main tables that capture demographic information, browsing behavior, and booking activity on the TravelTide platform.  
Each table contributes data at a specific level of granularity.

#### users
Customer demographic and home-location information.
- user_id — unique user identifier (primary key)
- birthdate — date of birth
- gender — gender category
- married — marriage status
- has_children — whether the user has children
- home_country — country of residence
- home_city — city of residence
- home_airport — preferred home airport
- home_airport_lat / home_airport_lon — geographic coordinates of home airport
- sign_up_date — account creation date

#### sessions
Session-level browsing interactions. Only sessions with at least two clicks are included.
- session_id — unique session identifier (primary key)
- user_id — user identifier (foreign key)
- trip_id — identifier linking to flight or hotel bookings (foreign key)
- session_start / session_end — session timestamps
- flight_discount / hotel_discount — whether a discount was offered
- flight_discount_amount / hotel_discount_amount — discount percentage
- flight_booked / hotel_booked — whether a booking occurred in the session
- page_clicks — number of page clicks
- cancellation — whether the session was used to cancel a trip

#### flights
Flight booking information associated with a trip_id.
- trip_id — unique trip identifier (primary key)
- origin_airport — departure airport
- destination / destination_airport — destination city and airport
- seats — number of seats booked
- return_flight_booked — whether a return flight was booked
- departure_time / return_time — timestamps for departure and return
- checked_bags — number of checked bags
- trip_airline — airline used for the trip
- destination_airport_lat / destination_airport_lon — geographic coordinates of destination airport
- base_fare_usd — pre-discount airfare price

#### hotels
Hotel booking information associated with a trip_id.
- trip_id — unique trip identifier (primary key)
- hotel_name — hotel brand
- nights — number of nights stayed
- rooms — number of rooms booked
- check_in_time / check_out_time — timestamps for hotel stay
- hotel_per_room_usd — pre-discount price per room per night

#### 3.2 Key Fields for the Analysis
Not all fields contribute equally to the segmentation task.  
The following variables are particularly relevant for understanding user behavior and constructing meaningful features:

- **trip_id** — links sessions to bookings and enables conversion analysis.
- **page_clicks** — proxy for engagement and interaction intensity.
- **flight_booked / hotel_booked** — indicators of conversion within a session.
- **flight_discount_amount / hotel_discount_amount** — exposure to promotional incentives.
- **cancellation** — signals potential dissatisfaction or specific travel patterns.
- **base_fare_usd / hotel_per_room_usd** — support monetary feature creation.
- **nights / rooms / seats** — reflect trip characteristics and travel intensity.
- **home_airport_lat / home_airport_lon** — enable geographic or distance-based features.
- **sign_up_date** — allows derivation of account age and recency metrics.

#### 3.3 Analytical Implications

A clear understanding of the dataset supports several key decisions in the modeling workflow:

- identifying which variables can be used directly as features  
- determining which fields require transformation or aggregation  
- understanding how tables can be joined to create a unified customer‑level dataset  
- recognizing behavioral signals that may differentiate user segments  
- anticipating potential data quality issues such as missing values or inconsistent granularity  

These insights guide the feature engineering process and ensure that the clustering model is built on reliable, interpretable, and behaviorally meaningful information.

### 4. Mapping Business Needs to Data 
The goal of this project is to support the design of a personalized rewards program by identifying distinct groups of TravelTide users based on their behavior. To achieve this, the business requirements must be translated into measurable variables that can be derived from the available data.

#### 4.1 Business Requirements

Elena, aims to:
- improve customer retention through a personalized rewards program  
- understand which types of perks different users may value  
- identify behavioral patterns that distinguish high-value, occasional, and discount-sensitive customers  

These objectives require a data-driven view of how users interact with the platform and how their browsing and booking behavior varies across the customer base.

#### 4.2 Data Requirements

To address these needs, the analysis must capture:
- **engagement** (how actively users interact with the platform)  
- **conversion behavior** (whether browsing sessions lead to bookings)  
- **travel patterns** (frequency, destinations, trip characteristics)  
- **spending behavior** (flight and hotel costs, trip size)  
- **response to discounts** (whether users book when promotions are offered)  
- **account history** (tenure, recency within a short historical window)  

These dimensions can be derived from the users, sessions, flights, and hotels tables.

#### 4.3 Linking Requirements to Data Sources

The following mapping outlines how each analytical need connects to specific fields in the dataset:

- **Engagement**  
  Derived from session-level metrics such as page_clicks, number of sessions, and session duration.

- **Conversion Behavior**  
  Captured through flight_booked, hotel_booked, and the presence of a trip_id in sessions.

- **Travel Patterns**  
  Inferred from flights and hotels data, including seats, nights, rooms, destinations, and trip frequency.

- **Spending Behavior**  
  Based on base_fare_usd and hotel_per_room_usd, combined with trip characteristics.

- **Discount Sensitivity**  
  Measured through flight_discount_amount, hotel_discount_amount, and whether bookings occur when discounts are offered.

- **Account History**  
  Derived from sign_up_date and the timing of sessions and booking timestamps.

#### 4.4 Analytical Implications

This mapping ensures that the feature engineering process focuses on variables that directly support the business objective.  
By aligning the analytical work with the needs of the rewards program, the resulting customer segments will be interpretable, actionable, and relevant for designing personalized perks.

### 5. Next Steps

The next notebook will begin the Data Understanding and Preparation phase by focusing on the SQL exploration of the raw tables.  
This includes:

- inspecting the raw tables using SQL to confirm data quality and consistency  
- evaluating missing values, duplicates, and potential data integrity issues  
- applying the filtering criteria suggested by Elena to verify whether the data supports her hypothesis about the existence of customer groups that would be particularly interested in the proposed perks 
- aggregate session-level information to build a unified session-level dataset
