---
## Perceptron Classifier for Airline Fare Prediction
#### Language: Python 3.x
---


### Table of Contents
1. [Introduction](#Introduction)  
2. [Imports](#Imports)  
3. [Data Loading](#Data-Loading)  
4. [Data Preprocessing](#Data-Preprocessing)  
5. [Model Training](#Model-Training)  
6. [Evaluation](#Evaluation)  
7. [Conclusion](#Conclusion)


### Introduction <a id="Introduction"></a>
We’ll build a simple binary Perceptron to classify whether an airline fare is over \$1 000 or not.  
This notebook is structured with clear sections and reusable code blocks.


### Imports <a id="Imports"></a>


In [5]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model    import Perceptron
from sklearn.pipeline        import make_pipeline
from sklearn.preprocessing   import StandardScaler
from sklearn.metrics         import classification_report, accuracy_score
from sklearn.utils           import resample


### Data Loading <a id="Data-Loading"></a>


In [6]:
df = pd.read_csv(r"C:\Users\jbats\airline-market-fare-prediction-data\Airline_Market_Fare_Prediction_Data\MarketFarePredictionData.csv")

# Quick peek
df.head()


Unnamed: 0,MktCoupons,OriginCityMarketID,DestCityMarketID,OriginAirportID,DestAirportID,Carrier,NonStopMiles,RoundTrip,ODPairID,Pax,...,Circuity,Slot,Non_Stop,MktMilesFlown,OriginCityMarketID_freq,DestCityMarketID_freq,OriginAirportID_freq,DestAirportID_freq,Carrier_freq,ODPairID_freq
0,2,178,152,170,255,6,1807.0,1.0,4035,136.0,...,1.36746,0,0.0,1992.449761,0.004138,0.039783,0.004138,0.022049,0.116826,0.000132
1,2,178,152,170,194,20,1798.0,1.0,4035,136.0,...,1.051724,0,0.0,1992.449761,0.004138,0.039783,0.004138,0.008368,0.307651,0.000132
2,2,178,152,170,260,6,1784.0,0.0,4035,136.0,...,1.034753,0,0.0,1992.449761,0.004138,0.039783,0.004138,0.009366,0.116826,0.000132
3,2,178,152,170,255,6,1807.0,1.0,4035,136.0,...,1.029884,0,0.0,1992.449761,0.004138,0.039783,0.004138,0.022049,0.116826,0.000132
4,2,178,152,170,194,20,1798.0,1.0,4035,136.0,...,1.062291,0,0.0,1992.449761,0.004138,0.039783,0.004138,0.008368,0.307651,0.000132


### Data Preprocessing <a id="Data-Preprocessing"></a>
1. Create a binary label: 0 if fare ≤ 1000, 1 if fare > 1000  
2. Choose features  
3. Split (stratified)  
4. Upsample the minority class for balance


In [7]:
# 1) Binary target
df["label"] = (df["Average_Fare"] > 1000).astype(int)

# 2) Feature selection
features = ["MktMilesFlown", "NonStopMiles", "RoundTrip", "Carrier_freq"]
X = df[features]
y = df["label"]

# 3) Train/Test split (stratify to preserve class ratio)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 4) Combine & oversample positives
train_df = pd.concat([X_train, y_train.rename("label")], axis=1)
neg = train_df[train_df["label"] == 0]
pos = train_df[train_df["label"] == 1]
pos_up = resample(pos, replace=True, n_samples=len(neg), random_state=42)
train_bal = pd.concat([neg, pos_up]).sample(frac=1, random_state=42)
Xb, yb = train_bal[features], train_bal["label"]


### Model Training <a id="Model-Training"></a>
Pipeline: **StandardScaler → Perceptron** (no class weights since data is balanced).


In [8]:
model = make_pipeline(
    StandardScaler(),
    Perceptron(max_iter=1000, tol=1e-3)
)

model.fit(Xb, yb)


### Evaluation <a id="Evaluation"></a>
Compute accuracy and classification report on the original test set.


In [9]:
y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(
    y_test, y_pred,
    target_names=["fare ≤ 1000", "fare > 1000"]
))


Accuracy: 0.990
              precision    recall  f1-score   support

 fare ≤ 1000       1.00      0.99      0.99    316204
 fare > 1000       0.02      1.00      0.03        52

    accuracy                           0.99    316256
   macro avg       0.51      1.00      0.51    316256
weighted avg       1.00      0.99      0.99    316256



### Conclusion <a id="Conclusion"></a>
- We achieved a balanced classifier by oversampling the minority.  
- You can further tune by adding polynomial features or experimenting with learning rates.  
- Next steps:  
  - Visualize ROC curves  
  - Extract and interpret `model.named_steps['perceptron'].coef_`  
  - Compare with logistic regression baseline
