# ✈️Flight Selection Analysis and Prediction — *FlightRank 2025*

This notebook tackles a **ranking-based recommendation system** challenge applied to the context of **corporate travel**. The goal is to predict which flight a user (business traveler) is most likely to select among multiple available options in a search session.

> 📎 The complete project is available on [GitHub — FlightRank 2025](https://github.com/LeopoldoZanellato/FlightRank-2025)

---

## 🎯Problem and Objective

This is a **supervised learning problem with a ranking focus**, where observations are grouped by search sessions (`ranker_id`).

- Each `ranker_id` represents a real flight search.
- Each group contains multiple flight options.
- Exactly **one** of these options was selected (`selected = 1`).

Our goal is to **train a model that ranks the options correctly** so that the flight chosen by the user appears among the top-ranked options.

---

## 📏Evaluation Metric: HitRate@3

The official competition metric is **HitRate@3**, which checks whether the flight actually chosen appears among the **top 3 ranked options** for each search group.

The formula is:

`HitRate@3 = (1 / |Q|) * ∑ 𝟙(rank_i ≤ 3)`

Where:

- `|Q|` is the number of evaluated search sessions (with more than 10 flights).
- `rank_i` is the rank assigned by the model to the correct flight in session `i`.
- `𝟙(rank_i ≤ 3)` equals 1 if the flight is in the top 3, and 0 otherwise.

> **Important:** only sessions with **more than 10 flight options** are considered in the final metric.



In [1]:
import pandas as pd
import os
import subprocess
import zipfile
import matplotlib.pyplot as plt
import lightgbm as lgb
import numpy as np

from itertools import product
from sklearn.model_selection import GroupShuffleSplit
from itertools import chain
from sklearn.model_selection import GroupKFold

In [2]:
# resetando as configurações 
pd.reset_option('display.max_columns')

# setando as configurações de coluna maxima
pd.set_option('display.max_columns', None)

## ⚙️Download and Extraction of Data

Before starting the analysis, we need to ensure that the competition data is available locally.

This step performs:

1. **Verification**: checks whether the files have already been downloaded.
2. **Automatic download** via the Kaggle API (if necessary).
3. **Extraction** of the `.zip` files into the `data/aeroclub/` folder.

> The Kaggle API was configured using the `kaggle.json` file directly on Windows, without using the project directory.



In [3]:
def download_files():
    # Define caminhos
    zip_path = "data/aeroclub-recsys-2025.zip"
    extract_path = "data/aeroclub"
    
    # Cria a pasta base se necessário
    os.makedirs("data", exist_ok=True)

    # Verifica se o arquivo .zip já foi baixado
    if not os.path.exists(zip_path):
        print("🔽 Baixando arquivos da competição...")
        subprocess.run([
            "kaggle", "competitions", "download",
            "-c", "aeroclub-recsys-2025",
            "-p", "data"
        ])
    else:
        print("✅ Arquivo ZIP já existe. Pulando download.")

    # Verifica se os arquivos já foram extraídos
    if not os.path.exists(extract_path) or not os.listdir(extract_path):
        print("📦 Extraindo arquivos...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)
    else:
        print("✅ Arquivos já extraídos. Pulando extração.")

In [4]:
# Executa
download_files()

🔽 Baixando arquivos da competição...
✅ Arquivos já extraídos. Pulando extração.


## 📚 3. Data Loading

In this step, we load the **training dataset**, available in the `train.parquet` file.

This dataset contains complete information about **flight search sessions**, including:

- Flight and user identifiers  
- Company-related data  
- Route and schedule information  
- Total price and taxes  
- Cancellation and rebooking rules  
- Indication of which flight was selected (`selected = 1`)

> This will be the **main dataset** used for building, training, and validating the recommendation model.



In [5]:
train = pd.read_parquet("data/aeroclub/train.parquet")

In [6]:
def reduce_memory_usage(df):
    for col in df.columns:
        col_type = df[col].dtypes
        
        if col_type == 'float64':
            df[col] = pd.to_numeric(df[col], downcast='float')
        elif col_type == 'int64':
            df[col] = pd.to_numeric(df[col], downcast='integer')
        elif col_type == 'object':
            num_unique = df[col].nunique()
            num_total = len(df[col])
            if num_unique / num_total < 0.5:
                df[col] = df[col].astype('category')
    
    return df
train = reduce_memory_usage(train)

In [7]:
df_train_raw = train.copy()

## 🧹 4. Selection of Relevant Columns

With the dataset loaded, the next step is to **select only the most relevant columns** for the baseline model.

The focus is on keeping variables that provide **useful information for flight recommendation**, including:

- Identifiers (`Id`, `ranker_id`, `profileId`, etc.)
- Passenger and company information
- Route details, timings, and connections
- Price data, taxes, and refund/exchange policies
- Flight segment details (airline, seat, baggage)
- Target variable: `selected` (indicates the flight chosen by the user)

> This step reduces dimensionality and improves model performance by eliminating irrelevant or redundant columns.

---

### Optional Sampling for Prototyping

During early experiments, it is common to work with a **reduced sample of the data** to speed up iteration. For that, we created a helper function that allows:

- Selecting only the desired columns  
- Optionally limiting the number of rows loaded

### Examples:

```python
# Load only a sample (1 million rows)
df_train = load_subset(df_train_raw, columns_to_keep, max_rows=1_000_000)

# Load the full dataset (all rows)
df_train = load_subset(df_train_raw, columns_to_keep)



In [8]:
# Define as colunas que você quer manter
columns_to_keep = [
    # Identifiers
    'Id',  # num
    'ranker_id', 
    'profileId', 
    'companyID',
    
    # User info
    'sex', 'nationality', 'frequentFlyer', 'isVip', 'bySelf', 'isAccess3D',

    # Company info
    'corporateTariffCode',

    # Search & route
    'searchRoute', 'requestDate',

    # Pricing
    'totalPrice', 'taxes',

    # Flight timing
    'legs0_departureAt', 'legs0_arrivalAt', 'legs0_duration',
    'legs1_departureAt', 'legs1_arrivalAt', 'legs1_duration',

    # Segment-level info (só do segmento 0 da ida para simplificar no baseline)
    'legs0_segments0_departureFrom_airport_iata',
    'legs0_segments0_arrivalTo_airport_iata',
    'legs0_segments0_arrivalTo_airport_city_iata',
    'legs0_segments0_marketingCarrier_code',
    'legs0_segments0_operatingCarrier_code',
    'legs0_segments0_aircraft_code',
    'legs0_segments0_flightNumber',
    'legs0_segments0_duration',
    'legs0_segments0_baggageAllowance_quantity',
    'legs0_segments0_baggageAllowance_weightMeasurementType',
    'legs0_segments0_cabinClass',
    'legs0_segments0_seatsAvailable', 
    'legs0_segments1_departureFrom_airport_iata',
    'legs0_segments2_departureFrom_airport_iata',
    'legs0_segments3_departureFrom_airport_iata',

    # Cancellation & exchange rules
    'miniRules0_monetaryAmount', 'miniRules0_percentage', 'miniRules0_statusInfos',
    'miniRules1_monetaryAmount', 'miniRules1_percentage', 'miniRules1_statusInfos',

    # Pricing policy
    'pricingInfo_isAccessTP', 'pricingInfo_passengerCount',

    # Target
    'selected'
]

# Filtra os dados para o baseline
def load_subset(df, columns,  max_rows=None):
    if max_rows:
        return df[columns].iloc[:max_rows].copy()
    else:
        return df[columns].copy()

# Exemplo de uso
df_train = load_subset(df_train_raw, columns_to_keep, max_rows=1_000_000) # ONLY 1M

#############################          IMPORTANT      ########################################
#############################          IMPORTANT      ########################################
#############################          IMPORTANT      ########################################
#############################          IMPORTANT      ########################################
#df_train = load_subset(df_train_raw, columns_to_keep) # ALL REGISTERS

## 🛠️ 5. Feature Engineering

In this step, we transform raw columns into more informative, consistent, and suitable variables for use in machine learning models.

The features will be built in subtopics, organized by type of transformation.

---

### 🧾 5.1 Data Type Correction (`dtypes`)

The first step is to ensure that data types are correct and optimized.

- Categorical columns initially encoded as `category` are analyzed and converted to:
  - `int` or `float`, when possible  
  - `bool`, if they contain only logical values  
  - `category`, in all other cases

- The `nationality` column, which arrives as an integer, is converted to `string` to preserve its categorical meaning.

> This standardization is essential to avoid errors and ensure that the model interprets variables correctly.

In [9]:
def fix_column_types(df):
    df_fixed = df.copy()
    for col in df.columns:
        if isinstance(df[col].dtype, pd.CategoricalDtype):
            # Tenta converter para tipo numérico
            try:
                df_fixed[col] = pd.to_numeric(df[col])
            except:
                # Se não for numérico, tenta bool
                unique_vals = df[col].dropna().unique()
                if set(unique_vals) <= {True, False}:
                    df_fixed[col] = df[col].astype(bool)
                else:
                    df_fixed[col] = df[col].astype(str)
    return df_fixed
df_train = fix_column_types(df_train)

# Ajusta a nacionalidade (está em Int)
df_train["nationality"] = df_train["nationality"].astype("str")
df_train['companyID'] = df_train['companyID'].astype('category')

df_train.dtypes  # Checar resultado

Id                                                                 int32
ranker_id                                                         object
profileId                                                          int32
companyID                                                       category
sex                                                                 bool
nationality                                                       object
frequentFlyer                                                     object
isVip                                                               bool
bySelf                                                              bool
isAccess3D                                                          bool
corporateTariffCode                                                Int64
searchRoute                                                       object
requestDate                                               datetime64[ns]
totalPrice                                         

### 🛫 5.2 Number of Segments in the Outbound Flight

In this step, we create variables related to the **structure of the outbound flight**, focusing on the number of connections.

#### What is being done:

- **`n_segments_ida`**: calculates the total number of segments in the outbound leg.
  - By definition, every flight has at least **segment 0** (origin to first destination).
  - If there are connections (segments 1, 2, or 3), this number is incremented.
  
- **`has_connections_ida`**: a boolean variable indicating whether the outbound flight has one or more connections.

- After extracting this information, the auxiliary columns for segments 1 to 3 are dropped, as they are no longer needed directly.

> These features help the model distinguish between direct and connecting flights, which may influence the corporate traveler's decision.

In [10]:
# 1. Cria a feature 'n_segments_ida' baseada na presença de segmentos 1, 2, 3
segments_ida_cols = {
    1: 'legs0_segments1_departureFrom_airport_iata',
    2: 'legs0_segments2_departureFrom_airport_iata',
    3: 'legs0_segments3_departureFrom_airport_iata'
}

# Começa com 1 porque o segmento 0 está sempre presente
df_train['n_segments_ida'] = 1

for seg, col in segments_ida_cols.items():
    df_train['n_segments_ida'] += df_train[col].notnull().astype(int)

# 2. Cria a flag booleana se há conexões (mais de 1 segmento na ida)
df_train['has_connections_ida'] = (df_train['n_segments_ida'] > 1).astype('boolean')

df_train.drop('legs0_segments1_departureFrom_airport_iata', inplace=True, axis=1)
df_train.drop('legs0_segments2_departureFrom_airport_iata', inplace=True, axis=1)
df_train.drop('legs0_segments3_departureFrom_airport_iata', inplace=True, axis=1)

### 🎫 5.3 Frequent Flyer Programs (`frequentFlyer`)

The `frequentFlyer` column indicates which loyalty programs the passenger is associated with. Since it contains multiple codes concatenated with "/", it needs to be processed into more informative and model-friendly features.

#### 🧠 Applied transformations:

- **`frequentFlyer_count`**: indicates how many frequent flyer programs the passenger participates in (number of codes separated by "/").

- **`hasFrequentFlyer`**: a binary variable indicating whether the passenger participates in at least one program.

- **`ff_XXX`**: individual boolean columns for each airline, representing whether the passenger is a member of that specific loyalty program.

- After feature extraction, the original `frequentFlyer` column is removed from the dataset, as its information is now decomposed into more specific variables.

> These features help capture the user's affinity with specific airlines, which can strongly influence their flight choice.

In [11]:
def count_frequent_flyers(value):
    if pd.isna(value):
        return 0
    return len(str(value).split('/'))

df_train['frequentFlyer_count'] = df_train['frequentFlyer'].apply(count_frequent_flyers)

# Cria flag binária para frequent flyer
df_train['hasFrequentFlyer'] = df_train['frequentFlyer'].notnull().astype(int)

# Substituir valores NaN por string vazia
ff_series = df_train['frequentFlyer'].fillna('').astype(str)

# Dividir por '/' para obter lista
ff_lists = ff_series.str.split('/')

all_programs = set(chain.from_iterable(ff_lists))
print(f"Total de companhias únicas: {len(all_programs)}")

df_train.drop('frequentFlyer', axis=1, inplace=True)

Total de companhias únicas: 41


### ⏰ 5.4 Processing Dates, Times, and Durations

In this step, we extract temporal information from date and time columns, and convert duration columns into numerical formats.

#### 📅 Applied transformations:

- **Datetime conversion**: the columns `requestDate`, `legs0_departureAt`, `legs0_arrivalAt`, `legs1_departureAt`, and `legs1_arrivalAt` are converted to `datetime` type.

- **Creation of new temporal variables**:
  - `legs0_dep_hour` / `legs1_dep_hour`: departure hour (outbound and return).
  - `legs0_dep_dayofweek` / `legs1_dep_dayofweek`: day of the week of departure.
  - `trip_days`: trip duration in days (return - outbound).
  - `booking_to_trip_days`: number of days between the search date and the departure.

- **Binary flags**:
  - `ida_fds` / `volta_fds`: indicates whether the flight occurs on a weekend.
  - `ida_comercial` / `volta_comercial`: indicates whether the flight occurs during business hours (between 07:00 and 19:00).

#### ⏱️ Conversion of durations to minutes

- The columns `legs0_duration` and `legs1_duration` (in text format) are converted into **total duration in minutes**, becoming numeric variables.

- The `legs0_segments0_duration` column, corresponding to the first segment of the outbound flight, is also converted into minutes in a new variable: `legs0_duration_minutes`.

> Handling temporal variables is essential for capturing behavioral patterns, such as preferences for daytime flights, short trips, or early booking.

In [12]:
# 🗓️ Colunas de datas e horários
cols_datetime = [
    'requestDate',
    'legs0_departureAt', 'legs0_arrivalAt',
    'legs1_departureAt', 'legs1_arrivalAt'
]
def process_datetime_and_duration(df):
    df_processed = df.copy()

    # Datas para datetime
    for col in cols_datetime:
        df_processed[col] = pd.to_datetime(df_processed[col], errors='coerce')

    # Features de hora e dia da semana
    df_processed['legs0_dep_hour'] = df_processed['legs0_departureAt'].dt.hour
    df_processed['legs0_dep_dayofweek'] = df_processed['legs0_departureAt'].dt.dayofweek
    df_processed['legs1_dep_hour'] = df_processed['legs1_departureAt'].dt.hour
    df_processed['legs1_dep_dayofweek'] = df_processed['legs1_departureAt'].dt.dayofweek

    # Dias entre ida e volta (duração da viagem)
    df_processed['trip_days'] = (df_processed['legs1_departureAt'] - df_processed['legs0_departureAt']).dt.days

    # Dias de antecedência (request → ida)
    df_processed['booking_to_trip_days'] = (df_processed['legs0_departureAt'] - df_processed['requestDate']).dt.days

    # Final de semana (ida/volta)
    df_processed['ida_fds'] = df_processed['legs0_dep_dayofweek'].isin([5, 6]).astype(int)
    df_processed['volta_fds'] = df_processed['legs1_dep_dayofweek'].isin([5, 6]).astype(int)

    # Horário comercial (7h às 19h)
    def is_business_hour(hour):
        return int(7 <= hour <= 19)

    df_processed['ida_comercial'] = df_processed['legs0_dep_hour'].apply(is_business_hour)
    df_processed['volta_comercial'] = df_processed['legs1_dep_hour'].apply(is_business_hour)

    # ⏱️ Converter colunas de duração para minutos
    def clean_and_convert_duration(col):
        return (
            col
            .fillna("00:00:00")
            .astype(str)
            .str.strip()
            .str.replace("nan", "00:00:00")
            .pipe(pd.to_timedelta, errors='coerce')
            .dt.total_seconds() / 60  # minutos
        )

    cols_duration = ['legs0_duration', 'legs1_duration']
    for col in cols_duration:
        df_processed[col] = clean_and_convert_duration(df_processed[col])

    return df_processed

In [13]:
df_train['legs0_duration_minutes'] = (
    pd.to_timedelta(
        df_train['legs0_segments0_duration'].fillna("00:00:00").astype(str).str.strip(),
        errors='coerce'
    ).dt.total_seconds() / 60  # em minutos
)

df_train.drop('legs0_segments0_duration', axis=1, inplace=True)

In [14]:
# ✅ Applicação
df_train = process_datetime_and_duration(df_train)
df_train.drop(columns=cols_datetime, inplace=True)

### 📍 5.5 Processing the Search Route (`searchRoute`)

The `searchRoute` column represents the full searched travel route, including both outbound and return legs, and is encoded as a string in the format:

OUTBOUND/RETURN → e.g., "GRUFOR/FORGRU" or "GRUCGHFOR/FORGRUCGH"

    
#### 🛠️ Applied transformations:

- The `searchRoute` column is converted to `string` type to ensure consistency.

- The string is split into two parts:
  - **`route_ida`**: outbound leg of the trip
  - **`route_volta`**: return leg of the trip (if present)

- From each leg, the following are extracted:
  - **`ida_from`** and **`ida_to`**: origin and destination of the outbound leg
  - **`volta_from`** and **`volta_to`**: origin and destination of the return leg

- An auxiliary variable `searchRoute_count` was also created to indicate the number of legs in the route (via splitting by "/"), used only for validation and later removed.

- The original `searchRoute` column was dropped after decomposing the information.

> This decomposition allows the model to capture origin-destination patterns, as well as distinct behaviors in multi-leg routes — valuable insights for the recommendation task.

In [15]:
df_train['searchRoute'] = df_train['searchRoute'].astype(str)
df_train['searchRoute_count'] = df_train['searchRoute'].apply(lambda x: x.split("/"))
df_train['searchRoute_count'] = df_train['searchRoute_count'].apply(lambda x: len(x))
print(f" min {min(df_train['searchRoute_count'])}")
print(f" max {max(df_train['searchRoute_count'])}")
df_train.drop('searchRoute_count', axis=1, inplace=True)

 min 1
 max 2


In [16]:
# Garante que searchRoute está como string
df_train['searchRoute'] = df_train['searchRoute'].astype(str)

# Separa ida e volta
df_train[['route_ida', 'route_volta']] = df_train['searchRoute'].str.split('/', expand=True)

# Extrai origem e destino da ida
df_train['ida_from'] = df_train['route_ida'].str[:3]
df_train['ida_to'] = df_train['route_ida'].str[3:]

# Extrai origem e destino da volta (se existir)
df_train['volta_from'] = df_train['route_volta'].str[:3]
df_train['volta_to'] = df_train['route_volta'].str[3:]

df_train.drop('searchRoute', axis=1, inplace=True)

In [17]:
df_train.head()

Unnamed: 0,Id,ranker_id,profileId,companyID,sex,nationality,isVip,bySelf,isAccess3D,corporateTariffCode,totalPrice,taxes,legs0_duration,legs1_duration,legs0_segments0_departureFrom_airport_iata,legs0_segments0_arrivalTo_airport_iata,legs0_segments0_arrivalTo_airport_city_iata,legs0_segments0_marketingCarrier_code,legs0_segments0_operatingCarrier_code,legs0_segments0_aircraft_code,legs0_segments0_flightNumber,legs0_segments0_baggageAllowance_quantity,legs0_segments0_baggageAllowance_weightMeasurementType,legs0_segments0_cabinClass,legs0_segments0_seatsAvailable,miniRules0_monetaryAmount,miniRules0_percentage,miniRules0_statusInfos,miniRules1_monetaryAmount,miniRules1_percentage,miniRules1_statusInfos,pricingInfo_isAccessTP,pricingInfo_passengerCount,selected,n_segments_ida,has_connections_ida,frequentFlyer_count,hasFrequentFlyer,legs0_duration_minutes,legs0_dep_hour,legs0_dep_dayofweek,legs1_dep_hour,legs1_dep_dayofweek,trip_days,booking_to_trip_days,ida_fds,volta_fds,ida_comercial,volta_comercial,route_ida,route_volta,ida_from,ida_to,volta_from,volta_to
0,0,98ce0dabf6964640b63079fbafd42cbe,2087645,57323,True,36,False,True,False,,16884.0,370.0,160.0,155.0,TLK,KJA,KJA,KV,KV,YK2,216,1.0,0.0,1.0,9.0,,,,,,,1.0,1,1,3,True,3,1,160.0,15,5,9.0,1.0,23.0,29,1,0,1,1,TLKKJA,KJATLK,TLK,KJA,KJA,TLK
1,1,98ce0dabf6964640b63079fbafd42cbe,2087645,57323,True,36,False,True,True,123.0,51125.0,2240.0,445.0,505.0,TLK,OVB,OVB,S7,S7,E70,5358,1.0,0.0,1.0,4.0,2300.0,,1.0,3500.0,,1.0,1.0,1,0,3,True,3,1,170.0,9,5,22.0,1.0,24.0,29,1,0,1,0,TLKKJA,KJATLK,TLK,KJA,KJA,TLK
2,2,98ce0dabf6964640b63079fbafd42cbe,2087645,57323,True,36,False,True,False,,53695.0,2240.0,445.0,505.0,TLK,OVB,OVB,S7,S7,E70,5358,1.0,0.0,1.0,4.0,2300.0,,1.0,3500.0,,1.0,1.0,1,0,3,True,3,1,170.0,9,5,22.0,1.0,24.0,29,1,0,1,0,TLKKJA,KJATLK,TLK,KJA,KJA,TLK
3,3,98ce0dabf6964640b63079fbafd42cbe,2087645,57323,True,36,False,True,True,123.0,81880.0,2240.0,445.0,505.0,TLK,OVB,OVB,S7,S7,E70,5358,1.0,0.0,1.0,4.0,0.0,,1.0,0.0,,1.0,1.0,1,0,3,True,3,1,170.0,9,5,22.0,1.0,24.0,29,1,0,1,0,TLKKJA,KJATLK,TLK,KJA,KJA,TLK
4,4,98ce0dabf6964640b63079fbafd42cbe,2087645,57323,True,36,False,True,False,,86070.0,2240.0,445.0,505.0,TLK,OVB,OVB,S7,S7,E70,5358,1.0,0.0,1.0,4.0,0.0,,1.0,0.0,,1.0,1.0,1,0,3,True,3,1,170.0,9,5,22.0,1.0,24.0,29,1,0,1,0,TLKKJA,KJATLK,TLK,KJA,KJA,TLK


## 🧪 6. Training Preparation

With all features processed, the next step is to prepare the data for training the ranking model with LightGBM.

---

### 🎯 6.1 Target and Group Definition

- **`target_col`**: target variable indicating whether the flight was selected (`selected = 1`).
- **`group_col`**: identifies each flight search session (`ranker_id`), used to properly group options in the ranking model.

---

### 🧮 6.2 Feature Organization

Features are divided into three types:

- 🔢 **Numerical (`numeric_cols`)**: continuous values like price, duration, baggage count, etc.
- 🏷️ **Categorical (`categorical_cols`)**: variables representing codes, airports, airlines, etc.
- ✅ **Boolean (`boolean_cols`)**: indicator variables (e.g., `isVip`, `ida_fds`, `hasFrequentFlyer`, etc.)

These lists are combined into the final `features` variable, which will be used as input for the model.

---

### 🧪 6.3 Train/Validation Split

`GroupShuffleSplit` is used to perform the **split while respecting groups (`ranker_id`)**, ensuring that all options from the same search session appear **either in training or in validation**, but not both.

---

### 📦 6.4 Dataset Construction for LightGBM

The `train_dataset` and `val_dataset` objects are created, which are optimized LightGBM structures for ranking:

- Include the data (`X_train`, `X_val`) and targets (`y_train`, `y_val`)
- Receive the list of categorical columns
- Incorporate the groups (`group=...`) required for **supervised ranking**
- Define the `max_bin` parameter, which controls discretization of continuous variables (used to speed up training and allow GPU usage)

> This structure is essential for using the **`lambdarank` objective**, as the model needs to understand the comparison groups.


In [18]:
# --- Target e grupo
target_col = "selected"
group_col = "ranker_id"

# --- Categóricas para LightGBM
categorical_cols = [
    'companyID',
    'nationality',
    'legs0_segments0_departureFrom_airport_iata',
    'legs0_segments0_arrivalTo_airport_iata',
    'legs0_segments0_arrivalTo_airport_city_iata',
    'legs0_segments0_marketingCarrier_code',
    'legs0_segments0_operatingCarrier_code',
    'legs0_segments0_aircraft_code',
    'corporateTariffCode',
    
    # novas features categóricas da searchRoute
    'route_ida',
    'route_volta',
    'ida_from',
    'ida_to',
    'volta_from',
    'volta_to'
]

# --- Booleanas e numéricas
boolean_cols = [
    'sex', 'isVip', 'bySelf',
    'pricingInfo_isAccessTP', 'hasFrequentFlyer',
    'ida_fds', 'volta_fds',
    'ida_comercial', 'volta_comercial',
    'isAccess3D',
    'has_connections_ida'
] + [col for col in df_train.columns if col.startswith("ff_")]

numeric_cols = [
    'totalPrice', 'taxes',
    'legs0_duration', 'legs1_duration',
    'legs0_segments0_baggageAllowance_quantity',
    'legs0_segments0_baggageAllowance_weightMeasurementType',
    'legs0_segments0_cabinClass',
    'legs0_segments0_seatsAvailable',
    'miniRules0_monetaryAmount', 'miniRules0_percentage',
    'miniRules1_monetaryAmount', 'miniRules1_percentage',
    'booking_to_trip_days', 'trip_days',
    'legs0_dep_hour', 'legs0_dep_dayofweek',
    'legs1_dep_hour', 'legs1_dep_dayofweek',
    'frequentFlyer_count', 'legs0_duration_minutes'
]
features = numeric_cols + categorical_cols + boolean_cols

# --- Converte categóricas para category
for col in categorical_cols:
    df_train[col] = df_train[col].astype("category")

In [19]:
# --- Separação por grupo (ranker_id)
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, val_idx = next(gss.split(df_train, groups=df_train["ranker_id"]))

df_train_split = df_train.iloc[train_idx].copy()
df_val = df_train.iloc[val_idx].copy()

# --- Features e targets
X_train = df_train_split[features]
y_train = df_train_split[target_col]
groups_train = df_train_split[group_col].value_counts().sort_index().values

X_val = df_val[features]
y_val = df_val[target_col]
groups_val = df_val[group_col].value_counts().sort_index().values

dataset_params = {
    "max_bin": 63 
}

# --- Criação dos Datasets
train_dataset = lgb.Dataset(
    X_train,
    label=y_train,
    group=groups_train,
    categorical_feature=categorical_cols,
    params=dataset_params  # 💡 AQUI é onde max_bin deve ir também!
)

val_dataset = lgb.Dataset(
    X_val,
    label=y_val,
    group=groups_val,
    categorical_feature=categorical_cols,
    reference=train_dataset,
    params=dataset_params
)



## 🧮 7. Hyperparameter Search with LightGBM

Before training the final model, we perform a **manual Grid Search** to identify the best combination of hyperparameters for the `Lambdarank` model.

---

### 🔧 7.1 Parameters Tested

The search is conducted over the following hyperparameters:

- `learning_rate`: learning rate (e.g., 0.05)
- `num_leaves`: tree complexity (e.g., 63, 127)
- `min_data_in_leaf`: regularization via minimum samples per leaf (e.g., 50, 70, 100)

All possible combinations of these values are tested using `itertools.product`.

---

### 🧪 7.2 Training and Validation

For each parameter combination:

1. The LightGBM model is trained with:
   - Objective: `lambdarank`
   - Metric: `ndcg@3`
   - Early stopping after 50 rounds without improvement

2. The model's performance is evaluated based on the **best NDCG@3** score achieved on the validation set.

3. The best model and parameter set are stored.

---

### ✅ 7.3 Search Result

At the end of the search:

- The **best parameter combination** is displayed
- The **NDCG@3 score** is reported
- A **validation prediction** is performed using the best model
- The **top-1 accuracy** is calculated, i.e., the fraction of sessions where the correct flight was ranked first by the model

> This evaluation serves as a practical check of the recommendation quality before the final training on the full dataset.

In [20]:
param_grid = {
    'learning_rate': [0.05],
    'num_leaves': [63, 127],        
    'min_data_in_leaf': [50, 70, 100]     
}
# Gera todas combinações de parâmetros
param_combinations = list(product(*param_grid.values()))
param_keys = list(param_grid.keys())

best_score = -1
best_model = None
best_params = None

for combo in param_combinations:
    param_set = dict(zip(param_keys, combo))
    print(f"Treinando com: {param_set}")

    params = {
        "objective": "lambdarank",
        "metric": "ndcg",
        "ndcg_eval_at": [3],
        "boosting_type": "gbdt",
        "feature_fraction": 0.8,
        "bagging_fraction": 0.8,
        "bagging_freq": 1,
        "seed": 42,
        "verbosity": -1,
        "num_threads": 8,
        **param_set
    }

    model = lgb.train(
        params,
        train_dataset,
        valid_sets=[val_dataset],
        valid_names=["valid"],
        num_boost_round=1000,
        callbacks=[lgb.early_stopping(stopping_rounds=50)],
    )

    score = model.best_score["valid"]["ndcg@3"]

    if score > best_score:
        best_score = score
        best_model = model
        best_params = param_set

print("\n✅ Melhor combinação:")
print(best_params)
print(f"NDCG@3: {best_score:.5f}")


Treinando com: {'learning_rate': 0.05, 'num_leaves': 63, 'min_data_in_leaf': 50}
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[281]	valid's ndcg@3: 0.801424
Treinando com: {'learning_rate': 0.05, 'num_leaves': 63, 'min_data_in_leaf': 70}
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[285]	valid's ndcg@3: 0.804463
Treinando com: {'learning_rate': 0.05, 'num_leaves': 63, 'min_data_in_leaf': 100}
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[293]	valid's ndcg@3: 0.806283
Treinando com: {'learning_rate': 0.05, 'num_leaves': 127, 'min_data_in_leaf': 50}
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[250]	valid's ndcg@3: 0.809945
Treinando com: {'learning_rate': 0.05, 'num_leaves': 127, 'min_data_in_leaf': 70}
Training until validation scores don't improve for 50 rounds
Early stopping, best it

In [21]:
# --- Predição
y_pred = model.predict(X_val)

# --- Avaliação Top-1
df_pred = df_val.copy()
df_pred['y_true'] = y_val
df_pred['y_pred'] = y_pred

df_pred_sorted = df_pred.sort_values(['ranker_id', 'y_pred'], ascending=[True, False])
df_top1 = df_pred_sorted.groupby('ranker_id').head(1)

acertos = df_top1['y_true'].sum()
total = df_top1.shape[0]

print(f"Voos escolhidos corretamente (top1): {acertos} de {total} sessões")
print(f"Acurácia top1: {acertos / total:.4f}")

Voos escolhidos corretamente (top1): 590 de 1542 sessões
Acurácia top1: 0.3826


## 🧠 8. Final Training with Validation

With the best hyperparameters defined, we proceed with the full training of the `LightGBM` model using the `lambdarank` objective.

---

### ⚙️ 8.1 Model Configuration

The model is configured with:

- `objective = "lambdarank"`: supervised ranking model
- `metric = "ndcg"` with `ndcg_eval_at = [3]`: directly optimizes the evaluation metric
- Hyperparameters:
  - `"learning_rate": best_params['learning_rate']`
  - `"num_leaves": best_params['num_leaves']`
  - `"min_data_in_leaf": best_params['min_data_in_leaf']`
  - `feature_fraction = 0.8`
  - `bagging_fraction = 0.8`
  - `bagging_freq = 1`
- `early_stopping`: stops training after 120 rounds without improvement
- Training runs for a maximum of `num_boost_round = 1000`

---

### 📊 8.2 Model Evaluation

After training, a prediction is performed on the validation set (`X_val`). The evaluation considers:

- Sorting flight options within each group (`ranker_id`) based on the predicted score (`y_pred`)
- Computing **top-1 accuracy**, i.e., the fraction of sessions where the correct flight is ranked in the **first position**

> This metric serves as a more direct proxy for assessing the model's practical effectiveness, complementing NDCG@3.


In [22]:
params = {
    "objective": "lambdarank",
    "metric": "ndcg",
    "ndcg_eval_at": [3],
    "learning_rate": best_params['learning_rate'],
    "num_leaves": best_params['num_leaves'],
    "min_data_in_leaf": best_params['min_data_in_leaf'],
    "boosting_type": "gbdt",
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq": 1,
    "seed": 42,
    "verbosity": -1,
    "num_threads": 8  # ✅ Aqui está correto
}

model = lgb.train(
    params,
    train_dataset,
    valid_sets=[train_dataset, val_dataset],
    valid_names=["train", "valid"],
    num_boost_round=1000, 
    callbacks=[lgb.early_stopping(stopping_rounds=120)],
)


Training until validation scores don't improve for 120 rounds
Early stopping, best iteration is:
[308]	train's ndcg@3: 0.964658	valid's ndcg@3: 0.814894


In [23]:
# --- Predição
y_pred = model.predict(X_val)

# --- Avaliação Top-1
df_pred = df_val.copy()
df_pred['y_true'] = y_val
df_pred['y_pred'] = y_pred

df_pred_sorted = df_pred.sort_values(['ranker_id', 'y_pred'], ascending=[True, False])
df_top1 = df_pred_sorted.groupby('ranker_id').head(1)

acertos = df_top1['y_true'].sum()
total = df_top1.shape[0]

print(f"Voos escolhidos corretamente (top1): {acertos} de {total} sessões")
print(f"Acurácia top1: {acertos / total:.4f}")


Voos escolhidos corretamente (top1): 590 de 1542 sessões
Acurácia top1: 0.3826


## 🏁 9. Final Training with the Entire Dataset

After validating the model and finding the best hyperparameter configuration, we perform the **final training using 100% of the training data**.

---

### 📦 9.1 Full Dataset

In this stage, we use all available observations:

- `X_full`: all features from the `df_train` base
- `y_full`: target variable (`selected`)
- `groups_full`: grouping structure (`ranker_id`) with all groups

These data are converted into a LightGBM `Dataset` optimized for ranking.

---

### 🧠 9.2 Use of `best_iteration`

During validation, the model trained with `early_stopping` identified an optimal number of boosting rounds — stored in `model.best_iteration`.

This value represents the point where the model:

- Achieved **best validation performance**
- Before starting to **overfit**

⚠️ Therefore, when training on the full dataset, we use:

```python
num_boost_round = model.best_iteration



In [24]:

# ✅ Treinamento final com TODO o dataset de treino
#     usando best_iteration encontrado na validação
# ============================================================

X_full = df_train[features]
y_full = df_train[target_col]
groups_full = df_train[group_col].value_counts().sort_index().values

full_dataset = lgb.Dataset(X_full, y_full, group=groups_full, categorical_feature=categorical_cols)

# ⚠️ Usa o número ideal de iterações do treino anterior
final_model = lgb.train(
    params,
    full_dataset,
    num_boost_round=model.best_iteration 
)

In [25]:
final_model.save_model('modelo_final.txt')

<lightgbm.basic.Booster at 0x1b361f6a0e0>

## 📤 10. Submission Generation

- Load the trained model from `modelo_final.txt`.
- Apply the same preprocessing steps used during training.
- Perform predictions with the final model and sort by `y_pred`.
- Generate the `submission.csv` file with the ranked flight selections (`selected`).

In [26]:
model = lgb.Booster(model_file='modelo_final.txt')

In [27]:
# ============================================================
# ## 6. Geração de Submissão
# ============================================================

# 1. Ler test.parquet
df_test = pd.read_parquet("data/aeroclub/test.parquet")

# 2. Aplicar transformações mínimas necessárias
df_test['ranker_id'] = df_test['ranker_id'].astype(str)
df_test['nationality'] = df_test['nationality'].astype(str)
df_test['searchRoute'] = df_test['searchRoute'].astype(str)

# --- Frequent Flyer (mesmos one-hot do treino)
df_test['frequentFlyer'] = df_test['frequentFlyer'].fillna('').astype(str)
ff_lists_test = df_test['frequentFlyer'].str.split('/')

In [28]:
all_programs = set(chain.from_iterable(ff_lists))
print(f"Total de companhias únicas: {len(all_programs)}")


for program in all_programs:
    if program == '':
        continue
    df_test[f'ff_{program}'] = ff_lists_test.apply(lambda x: int(program in x))

for col in [col for col in df_test.columns if col.startswith("ff_")]:
    df_test[col] = df_test[col].astype(pd.BooleanDtype())

df_test['frequentFlyer_count'] = df_test['frequentFlyer'].apply(count_frequent_flyers)
df_test['hasFrequentFlyer'] = df_test['frequentFlyer'].notnull().astype(int)
df_test.drop(columns=['frequentFlyer'], inplace=True)



Total de companhias únicas: 41


In [29]:
# --- Datas
cols_datetime = [
    'requestDate',
    'legs0_departureAt', 'legs0_arrivalAt',
    'legs1_departureAt', 'legs1_arrivalAt'
]
for col in cols_datetime:
    df_test[col] = pd.to_datetime(df_test[col], errors='coerce')

df_test['legs0_dep_hour'] = df_test['legs0_departureAt'].dt.hour
df_test['legs0_dep_dayofweek'] = df_test['legs0_departureAt'].dt.dayofweek
df_test['legs1_dep_hour'] = df_test['legs1_departureAt'].dt.hour
df_test['legs1_dep_dayofweek'] = df_test['legs1_departureAt'].dt.dayofweek
df_test['trip_days'] = (df_test['legs1_departureAt'] - df_test['legs0_departureAt']).dt.days
df_test['booking_to_trip_days'] = (df_test['legs0_departureAt'] - df_test['requestDate']).dt.days
df_test['ida_fds'] = df_test['legs0_dep_dayofweek'].isin([5, 6]).astype(int)
df_test['volta_fds'] = df_test['legs1_dep_dayofweek'].isin([5, 6]).astype(int)

df_test['ida_comercial'] = df_test['legs0_dep_hour'].apply(lambda x: int(7 <= x <= 19))
df_test['volta_comercial'] = df_test['legs1_dep_hour'].apply(lambda x: int(7 <= x <= 19))

df_test.drop(columns=cols_datetime, inplace=True)

In [30]:
######################################
# Cria a feature 'n_segments_ida' baseada na presença de segmentos 1, 2, 3
segments_ida_cols = {
    1: 'legs0_segments1_departureFrom_airport_iata',
    2: 'legs0_segments2_departureFrom_airport_iata',
    3: 'legs0_segments3_departureFrom_airport_iata'
}

# Começa com 1 porque o segmento 0 está sempre presente
df_test['n_segments_ida'] = 1

for seg, col in segments_ida_cols.items():
    df_test['n_segments_ida'] += df_test[col].notnull().astype(int)

# Cria a flag booleana se há conexões (mais de 1 segmento na ida)
df_test['has_connections_ida'] = (df_test['n_segments_ida'] > 1).astype('boolean')

# companyID como categoria
df_test['companyID'] = df_test['companyID'].astype('category')

# isAccess3D como boolean
df_test['isAccess3D'] = df_test['isAccess3D'].astype('boolean')

df_test.drop('legs0_segments1_departureFrom_airport_iata', inplace=True, axis=1)
df_test.drop('legs0_segments2_departureFrom_airport_iata', inplace=True, axis=1)
df_test.drop('legs0_segments3_departureFrom_airport_iata', inplace=True, axis=1)
########################


# --- Duração
def clean_and_convert_duration(col):
    return (
        col
        .fillna("00:00:00")
        .astype(str)
        .str.strip()
        .str.replace("nan", "00:00:00")
        .pipe(pd.to_timedelta, errors='coerce')
        .dt.total_seconds() / 60
    )

df_test['legs0_duration'] = clean_and_convert_duration(df_test['legs0_duration'])
df_test['legs1_duration'] = clean_and_convert_duration(df_test['legs1_duration'])
df_test['legs0_segments0_duration'] = clean_and_convert_duration(df_test['legs0_segments0_duration'])
df_test['legs0_duration_minutes'] = df_test['legs0_duration']
df_test.drop(columns=['legs0_segments0_duration'], inplace=True)

# --- SearchRoute features
df_test[['route_ida', 'route_volta']] = df_test['searchRoute'].str.split('/', expand=True)
df_test['ida_from'] = df_test['route_ida'].str[:3]
df_test['ida_to'] = df_test['route_ida'].str[3:]
df_test['volta_from'] = df_test['route_volta'].str[:3]
df_test['volta_to'] = df_test['route_volta'].str[3:]
df_test.drop('searchRoute', axis=1, inplace=True)


In [31]:
df_train['companyID'] = df_train['companyID'].astype('category')
# Converte para booleano (caso ainda não esteja)
df_train['isAccess3D'] = df_train['isAccess3D'].astype('boolean')
# Cria flag indicando se há conexões na ida
df_train['has_connections_ida'] = (df_train['n_segments_ida'] > 1).astype('boolean')

# --- Tipagem
for col in categorical_cols:
    df_test[col] = df_test[col].astype("category")

for col in boolean_cols:
    if col in df_test.columns:
        df_test[col] = df_test[col].astype('boolean')

In [32]:
# Prever com o modelo
X_test = df_test[features]
df_test['y_pred'] = model.predict(X_test)

# 4. Gerar submissão
df_test_sorted = df_test.sort_values(['ranker_id', 'y_pred'], ascending=[True, False])
df_test_sorted['selected'] = df_test_sorted.groupby('ranker_id').cumcount() + 1

submission = df_test_sorted[['Id', 'ranker_id', 'selected']]
submission.to_csv("submission.csv", index=False)
print("✅ Arquivo de submissão salvo como 'submission.csv'")


✅ Arquivo de submissão salvo como 'submission.csv'
