# Large-Scale Data Processing and Classification with Polars

# Deadline: week 2 (each week delay is penalized with 30 pts). Latest deadline: week 4.

### Dataset: Airline Passenger Satisfaction (CSV)
Dataset link: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

The Kaggle dataset contains train.csv and test.csv. Use only **train.csv** for training and evaluation (create your own 80/20 split).

Goal: Build an end-to-end pipeline that:
* loads and processes a large structured CSV using Polars (lazy mode)
* performs cleaning + feature engineering
* trains a classification model
* evaluates performance using standard metrics
* explains lazy plan optimizations 

## Task 1: Lazy Loading + Initial Profiling (10 pts)
* load the dataset with polars lazy (use scan_csv(...) method)
* print the data info (column names, number of rows and columns)

## Task 2: Data Cleaning (15 pts)

Using Polars expressions :
* remove irrelevant columns (e.g., id)
* handle missing values (numerical values will be filled with median, categorical to "Unknown")
* convert the target column: satisfied ->1 neutral or dissatisfied ->0

 ## Task 3: Feature Engineering (20 pts)

Create at least 3 engineered features, for example:
* delay_flag = 1 if Departure Delay or Arrival Delay > threshold (e.g. 15 minutes)
* delay_total_minutes = Departure Delay in Minutes + Arrival Delay in Minutes
* long_flight = 1 if Flight Distance > median else 0
* business_travel = 1 if Type of Travel == "Business travel", else 0
  
 Requirement: do this in lazy mode using .with_columns(...).

 ## Task 4 — Encoding Categorical Features (15 pts)

Identify categorical columns. Apply one-hot encoding using:
* to_dummies() (after collecting)

## Task 5 — Lazy Optimization Analysis (10 pts)

Print:
* naive plan: lf.explain()
* optimized plan: lf.explain(optimized=True)

## Task 6 — Train/Test Split + Conversion to NumPy (10 pts)

* Shuffle rows and split:80% train and 20% test
* Convert to NumPy only at the final step: X_train, y_train, X_test, y_test
* No pandas for splitting

## Task 7 — Classification Model and Evaluation (20 pts)
* apply Standard scaler and Logistic regression classifier
* compute accuracy, precision, recall and F1-score

All tasks will be done in a notebook. Use clear secion headers for each task. 