# Logistic Regression for Flight-Class Prediction

Before we dive in, let’s briefly recap **how Logistic Regression works** and why it’s appropriate for a multiclass ticket-class problem.

---

## Logistic Regression “Behind the Mask”

A logistic regression model predicts class probabilities by passing a weighted sum of inputs through a **sigmoid** (logistic) function:

\[
\sigma(z)\;=\;\frac{1}{1 + e^{-z}}
\]

where \(z = \theta^T x\).  For a multiclass (“softmax”) version, we compute one linear score per class and normalize via:

\[
P(y=k\mid x) = \frac{\exp(\theta_k^T x)}{\sum_j \exp(\theta_j^T x)}.
\]

Rather than minimizing squared errors (OLS), logistic uses **Maximum Likelihood Estimation (MLE)** to find the best \(\theta\).

---

## Task

In this notebook we will:

1. Load our **cleaned airfare** dataset  
2. Preprocess numeric and categorical features  
3. Resample rare classes with **SMOTE**  
4. Train a **multinomial** `LogisticRegression`  
5. Evaluate via **classification report** and **confusion matrix**

---




In [2]:
## 1 Setup

# standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn + imblearn pipeline & utilities
from sklearn.model_selection     import train_test_split
from sklearn.compose             import ColumnTransformer
from sklearn.preprocessing       import OneHotEncoder, StandardScaler
from imblearn.pipeline           import Pipeline as ImbPipeline
from imblearn.over_sampling      import SMOTE
from sklearn.linear_model        import LogisticRegression
from sklearn.metrics             import classification_report, confusion_matrix

sns.set_style("whitegrid")
%matplotlib inline

In [3]:
## 2 Load & Inspect Data

df = pd.read_csv(
    r"C:\Users\jbats\Projects\cmor438\supervised-learning\Logistic Regression\Cleaned_dataset.csv"
)
df.head()

df["Class"].value_counts()



Class
Economy            252033
Business           126834
Premium Economy     73077
First                 144
Name: count, dtype: int64

In [5]:
## 3 Define Features & Target

X = df.drop(columns=["Flight_code", "Class"])
y = df["Class"]

numeric_cols = ["Duration_in_hours", "Days_left", "Fare"]
categorical_cols = [
    "Journey_day", "Airline", "Source",
    "Departure", "Total_stops", "Arrival", "Destination"
]

In [6]:
## 4 Preprocessing, SMOTE, Model Pipeline

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_cols),
    ("cat", OneHotEncoder(drop="first", sparse_output=False), categorical_cols)
])

clf = ImbPipeline([
    ("pre",   preprocessor),
    ("smote", SMOTE(random_state=42)),
    ("lr",    LogisticRegression(
        multi_class="multinomial",
        solver="saga",
        class_weight="balanced",
        max_iter=1000,
        verbose=0
    ))
])


In [7]:
## 5 Train/Test Split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [None]:
## 6 Fit and Predict

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)


In [None]:
## 7 Eval and Confusion Matrix
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt="d",
            xticklabels=clf.classes_,
            yticklabels=clf.classes_,
            cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()


## 8 Next Steps

Experiment with different SMOTE parameters (sampling_strategy, k_neighbors)

Tune C and penalty in LogisticRegression via GridSearchCV

Add interaction or time-of-day features to boost Premium Economy recall

