# Flight Ticket Price Prediction

**Goal:** Predict flight ticket prices using a Random Forest regressor. This notebook is structured as a professional GitHub project: concise explanations, clean code, and reproducible steps.

**Dataset columns expected:** `Airline`, `Date_of_Journey`, `Source`, `Destination`, `Route`, `Dep_Time`, `Arrival_Time`, `Duration`, `Total_Stops`, `Additional_Info`, `Price`

> **Note:** Upload your dataset file (e.g., `train.csv`) to the working directory before running the notebook. In Google Colab, use the file upload button or mount Google Drive and set the path accordingly.


In [None]:
# 1. Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)

print('Libraries imported successfully')

In [None]:
# 2. Load dataset
# Replace 'train.csv' with your filename if different.
DATA_PATH = 'train.csv'  # <-- change this if your file has a different name

df = pd.read_csv(DATA_PATH)
print('Dataset loaded — shape:', df.shape)
df.head()

In [None]:
# 3. Quick EDA
print('Missing values per column:\n', df.isnull().sum())
print('\nData types:\n', df.dtypes)
print('\nPrice summary:')
print(df['Price'].describe())

# Basic plots: price distribution
plt.figure()
plt.hist(df['Price'].dropna(), bins=40)
plt.title('Price distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()

In [None]:
# 4. Preprocessing & Feature Engineering
# We'll:
# - Parse Date_of_Journey into day and month
# - Extract hour/minute from Dep_Time and Arrival_Time
# - Convert Duration to total minutes
# - Clean Total_Stops (map 'non-stop' or '0 stops' to 0)

def preprocess(df):
    df = df.copy()
    # Date
    df['Date_of_Journey'] = pd.to_datetime(df['Date_of_Journey'], dayfirst=True, errors='coerce')
    df['Journey_Day'] = df['Date_of_Journey'].dt.day
    df['Journey_Month'] = df['Date_of_Journey'].dt.month

    # Times
    df['Dep_Time'] = pd.to_datetime(df['Dep_Time'], format='%H:%M', errors='coerce')
    df['Dep_Hour'] = df['Dep_Time'].dt.hour
    df['Dep_Minute'] = df['Dep_Time'].dt.minute

    df['Arrival_Time'] = pd.to_datetime(df['Arrival_Time'], format='%H:%M', errors='coerce')
    df['Arrival_Hour'] = df['Arrival_Time'].dt.hour
    df['Arrival_Minute'] = df['Arrival_Time'].dt.minute

    # Duration: convert strings like '2h 50m' or '50m' to total minutes
    def duration_to_mins(x):
        if pd.isnull(x):
            return np.nan
        parts = x.split()
        mins = 0
        for p in parts:
            if 'h' in p:
                try:
                    mins += int(p.replace('h',''))*60
                except:
                    pass
            elif 'm' in p:
                try:
                    mins += int(p.replace('m',''))
                except:
                    pass
        return mins
    df['Duration_mins'] = df['Duration'].apply(duration_to_mins)

    # Total stops: extract digits
    df['Total_Stops_num'] = df['Total_Stops'].astype(str).str.extract('(\d+)')
    df['Total_Stops_num'] = df['Total_Stops_num'].astype(float)

    return df

# Apply preprocessing
df_processed = preprocess(df)
print('After feature engineering — columns:', df_processed.columns.tolist())
df_processed[['Journey_Day','Journey_Month','Dep_Hour','Arrival_Hour','Duration_mins','Total_Stops_num']].head()

In [None]:
# 5. Prepare features and target
FEATURES = ['Airline','Source','Destination','Journey_Day','Journey_Month','Dep_Hour','Dep_Minute','Arrival_Hour','Arrival_Minute','Duration_mins','Total_Stops_num']
TARGET = 'Price'

data = df_processed[FEATURES + [TARGET]].dropna()
X = data[FEATURES]
y = data[TARGET]

print('Final dataset for modeling — X shape:', X.shape, 'y shape:', y.shape)
X.head()

In [None]:
# 6. Train/test split and modeling pipeline
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Categorical and numeric handling
cat_cols = ['Airline','Source','Destination']
num_cols = ['Journey_Day','Journey_Month','Dep_Hour','Dep_Minute','Arrival_Hour','Arrival_Minute','Duration_mins','Total_Stops_num']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_cols),
        ('num', 'passthrough', num_cols)
    ]
)

model = RandomForestRegressor(random_state=42, n_estimators=100)

pipe = Pipeline(steps=[('pre', preprocessor), ('model', model)])

# Fit
pipe.fit(X_train, y_train)
print('Model trained')

In [None]:
# 7. Evaluation
y_pred = pipe.predict(X_test)

r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print(f'R2 score: {r2:.4f}')
print(f'MAE: {mae:.2f}')
print(f'RMSE: {rmse:.2f}')

# Scatter plot: Actual vs Predicted
plt.figure()
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Ticket Price')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.show()

In [None]:
# 8. Feature importance (approximate)
pre = pipe.named_steps['pre']
cat_features = pre.named_transformers_['cat'].get_feature_names_out(['Airline','Source','Destination'])
num_cols = ['Journey_Day','Journey_Month','Dep_Hour','Dep_Minute','Arrival_Hour','Arrival_Minute','Duration_mins','Total_Stops_num']
all_features = list(cat_features) + num_cols

importances = pipe.named_steps['model'].feature_importances_
feat_imp = pd.Series(importances, index=all_features).sort_values(ascending=False)[:20]

plt.figure(figsize=(8,5))
plt.bar(feat_imp.index, feat_imp.values)
plt.xticks(rotation=90)
plt.title('Top feature importances')
plt.tight_layout()
plt.show()

print('Top features:\n', feat_imp)

## 9. Conclusion

- We built a Random Forest model to predict flight ticket prices using common flight features.
- Primary evaluation: **R² score**, **MAE**, and **RMSE** (printed above).
- Visual checks: Actual vs Predicted scatter plot and feature importance chart.

### Next steps / improvements
- Hyperparameter tuning (GridSearchCV) to further improve R².
- More advanced feature engineering (e.g., holiday flags, seasonal features).
- Try gradient boosting (XGBoost / LightGBM) for potential performance gains.

---

*Notebook generated on <class 'datetime.datetime'>.*