# Task 2

---

## Predictive modeling of customer bookings

This Jupyter notebook includes some code to get you started with this predictive modeling task. We will use various packages for data manipulation, feature engineering and machine learning.

### Exploratory data analysis

First, we must explore the data in order to better understand what we have and the statistical properties of the dataset.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/customer_booking.csv", encoding="ISO-8859-1")
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,Internet,RoundTrip,243,22,17,Wed,AKLDEL,India,1,1,0,5.52,0
3,1,Internet,RoundTrip,96,31,4,Sat,AKLDEL,New Zealand,0,0,1,5.52,0
4,2,Internet,RoundTrip,68,22,15,Wed,AKLDEL,India,1,0,1,5.52,0


The `.head()` method allows us to view the first 5 rows in the dataset, this is useful for visual inspection of our columns

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   num_passengers         50000 non-null  int64  
 1   sales_channel          50000 non-null  object 
 2   trip_type              50000 non-null  object 
 3   purchase_lead          50000 non-null  int64  
 4   length_of_stay         50000 non-null  int64  
 5   flight_hour            50000 non-null  int64  
 6   flight_day             50000 non-null  object 
 7   route                  50000 non-null  object 
 8   booking_origin         50000 non-null  object 
 9   wants_extra_baggage    50000 non-null  int64  
 10  wants_preferred_seat   50000 non-null  int64  
 11  wants_in_flight_meals  50000 non-null  int64  
 12  flight_duration        50000 non-null  float64
 13  booking_complete       50000 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.3+ 

The `.info()` method gives us a data description, telling us the names of the columns, their data types and how many null values we have. Fortunately, we have no null values. It looks like some of these columns should be converted into different data types, e.g. flight_day.

To provide more context, below is a more detailed data description, explaining exactly what each column means:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)
- `booking_complete` = flag indicating if the customer completed the booking

Before we compute any statistics on the data, lets do any necessary data conversion

In [4]:
df["flight_day"].unique()

array(['Sat', 'Wed', 'Thu', 'Mon', 'Sun', 'Tue', 'Fri'], dtype=object)

In [5]:
mapping = {
    "Mon": 1,
    "Tue": 2,
    "Wed": 3,
    "Thu": 4,
    "Fri": 5,
    "Sat": 6,
    "Sun": 7,
}

df["flight_day"] = df["flight_day"].map(mapping)

In [6]:
df["flight_day"].unique()

array([6, 3, 4, 1, 7, 2, 5])

In [7]:
df.describe()

Unnamed: 0,num_passengers,purchase_lead,length_of_stay,flight_hour,flight_day,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,1.59124,84.94048,23.04456,9.06634,3.81442,0.66878,0.29696,0.42714,7.277561,0.14956
std,1.020165,90.451378,33.88767,5.41266,1.992792,0.470657,0.456923,0.494668,1.496863,0.356643
min,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.67,0.0
25%,1.0,21.0,5.0,5.0,2.0,0.0,0.0,0.0,5.62,0.0
50%,1.0,51.0,17.0,9.0,4.0,1.0,0.0,0.0,7.57,0.0
75%,2.0,115.0,28.0,13.0,5.0,1.0,1.0,1.0,8.83,0.0
max,9.0,867.0,778.0,23.0,7.0,1.0,1.0,1.0,9.5,1.0


The `.describe()` method gives us a summary of descriptive statistics over the entire dataset (only works for numeric columns). This gives us a quick overview of a few things such as the mean, min, max and overall distribution of each column.

From this point, you should continue exploring the dataset with some visualisations and other metrics that you think may be useful. Then, you should prepare your dataset for predictive modelling. Finally, you should train your machine learning model, evaluate it with performance metrics and output visualisations for the contributing variables. All of this analysis should be summarised in your single slide.

## 1 - Quick recap and next steps

This notebook continues from the provided **Getting Started** notebook. Follow the cells below in order. Each code cell includes comments and markdown cells contain instructions and explanation. You'll perform:

- Additional exploratory data analysis (EDA)
- Feature engineering & preprocessing
- Model training (Random Forest)
- Cross-validation & evaluation
- Feature importance visualization
- Saving key charts and generating a single PowerPoint summary slide

Make sure to run the cells sequentially in VS Code or Jupyter.

### 1.1 - Imports and configuration
Run this cell to import required libraries. If a library is missing, install it in your environment (e.g., `pip install scikit-learn matplotlib seaborn python-pptx`).

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import joblib
import os

# Optional: for PowerPoint export
try:
    from pptx import Presentation
    from pptx.util import Inches, Pt
    PPTX_AVAILABLE = True
except Exception as e:
    PPTX_AVAILABLE = False
    print('python-pptx not available; PowerPoint slide creation cell will skip or raise instructions to install it.')

# plotting defaults
# If running in a Jupyter environment you may want to enable the inline magic:
# %matplotlib inline
plt.rcParams['figure.figsize'] = (10,6)
sns.set(style='whitegrid')

### 1.2 - Load the dataset
Update the path if needed. The original starter used `data/customer_booking.csv`; if using an absolute path in VS Code, update the path accordingly.

In [None]:
# Update this path if you run the notebook on your machine
file_path = "data/customer_booking.csv"

# If that path doesn't exist, try the absolute path the user mentioned
if not os.path.exists(file_path):
    alt = r"C:\Users\Sandeep\Desktop\British Airways - Data Science Project\customer_booking.csv"
    if os.path.exists(alt):
        file_path = alt

df = pd.read_csv(file_path, encoding="ISO-8859-1")
print('Loaded:', file_path)
df.head()

## 2 - Extended Exploratory Data Analysis (EDA)
Run the following cells to get a better view of the dataset distributions and relationships with the target variable `booking_complete`.

In [None]:
# Basic info & target distribution
print(df.shape)
print(df.info())
print('\nTarget value counts:')
print(df['booking_complete'].value_counts(normalize=False))
print('\nTarget proportions:')
print(df['booking_complete'].value_counts(normalize=True))

# Numeric summary
display(df.describe().T)

In [None]:
# Visualisations: target balance
plt.figure(figsize=(6,4))
sns.countplot(x='booking_complete', data=df)
plt.title('Booking Complete: Counts')
plt.show()

# Purchase lead distribution
if 'purchase_lead' in df.columns:
    plt.figure(figsize=(8,4))
    sns.histplot(df['purchase_lead'], bins=50, kde=True)
    plt.title('Distribution of Purchase Lead (days)')
    plt.show()

# Booking completion by trip_type (if exists)
if 'trip_type' in df.columns:
    plt.figure(figsize=(8,4))
    sns.barplot(x='trip_type', y='booking_complete', data=df, estimator=np.mean)
    plt.title('Booking completion rate by trip_type')
    plt.show()

## 3 - Preprocessing & Feature Engineering
Suggested steps:

- Map `flight_day` to numeric (starter mapping)
- Create `num_addons` from the add-on boolean columns
- Bin `purchase_lead` into categories (optional)
- Reduce `route` cardinality by grouping into regions (optional)
- One-hot encode categorical features (or use ordinal encoding where appropriate)

The following cell implements these steps in a robust way (it checks columns exist before creating features).

In [None]:
# Copy df to avoid accidental changes
data = df.copy()

# Map flight_day if present and string-coded (starter mapping)
if 'flight_day' in data.columns and data['flight_day'].dtype == object:
    mapping = {'Mon':1,'Tue':2,'Wed':3,'Thu':4,'Fri':5,'Sat':6,'Sun':7}
    data['flight_day'] = data['flight_day'].map(mapping)

# Create num_addons if addon columns exist (assumes 0/1 or boolean)
addon_cols = [c for c in ['wants_extra_baggage','wants_preferred_seat','wants_in_flight_meals'] if c in data.columns]
if addon_cols:
    data['num_addons'] = data[addon_cols].sum(axis=1)
else:
    data['num_addons'] = 0

# Bin purchase_lead into categories
if 'purchase_lead' in data.columns:
    bins = [-1,7,30,90,365, 10000]
    labels = ['last_minute','short_term','medium_term','long_term','very_long']
    data['lead_bucket'] = pd.cut(data['purchase_lead'], bins=bins, labels=labels)

# Reduce route cardinality: extract destination region if present (simple heuristic)
if 'route' in data.columns:
    dest = data['route'].str.split('->').str[-1].str.strip()
    data['dest'] = dest
    def map_region(x):
        if pd.isna(x): return 'Other'
        x = x.upper()
        if x.startswith(('JFK','LAX','YYZ','EWR','SFO','BOS','MIA','ORD')):
            return 'North America'
        if x.startswith(('FRA','CDG','FCO','MAD','IST','DUB','MAN','BCN')):
            return 'Europe'
        if x.startswith(('HND','HKG','DEL','BOM','SIN','NRT','PVG')):
            return 'Asia'
        if x.startswith(('DXB','DOH','AUH')):
            return 'Middle East'
        return 'Other'
    data['dest_region'] = data['dest'].apply(map_region)
else:
    data['dest_region'] = 'Unknown'

# One-hot encode selected categorical columns (drop first to avoid multicollinearity)
cat_cols = []
for c in ['sales_channel','trip_type','lead_bucket','booking_origin','dest_region']:
    if c in data.columns:
        cat_cols.append(c)

data_enc = pd.get_dummies(data, columns=cat_cols, drop_first=True)

# Drop columns not useful for modeling (IDs or text-heavy freeform)
drop_like = [c for c in ['route','dest'] if c in data_enc.columns]
data_enc = data_enc.drop(columns=drop_like)

print('Prepared data shape:', data_enc.shape)
data_enc.head()

## 4 - Train/Test split
We'll stratify by the target to keep class balance across splits.

In [None]:
# Prepare X and y
target = 'booking_complete'
if target not in data_enc.columns:
    raise ValueError(f"Target column '{target}' not found in data_enc columns. Columns: {list(data_enc.columns)}")
X = data_enc.drop(columns=[target])
y = data_enc[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

## 5 - Train a Random Forest classifier
RandomForest provides feature importances for interpretability. You can tune hyperparameters using GridSearchCV if desired.

In [None]:
# Train Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced', n_jobs=-1)
rf.fit(X_train, y_train)

# Save the model for future use
model_path = '/mnt/data/rf_customer_booking.joblib'
joblib.dump(rf, model_path)
print('Model saved to', model_path)

## 6 - Model evaluation
Evaluate on the test set and with cross-validation. Keep ROC AUC as a primary metric for probabilistic output.

In [None]:
# Predictions & metrics
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)[:,1]

print('Classification report:')
print(classification_report(y_test, y_pred))
print('\nConfusion matrix:')
print(confusion_matrix(y_test, y_pred))
print('\nROC AUC:', roc_auc_score(y_test, y_proba))

# Cross-validation AUC
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(rf, X, y, cv=cv, scoring='roc_auc', n_jobs=-1)
print('\nCross-validation AUC scores:', cv_scores)
print('Mean CV AUC:', cv_scores.mean())

## 7 - Feature importance
Plot the top features and save the figure for inclusion in the PowerPoint slide.

In [None]:
# Feature importances
importances = rf.feature_importances_
feat_names = X.columns
indices = np.argsort(importances)[::-1]

top_n = 15
top_idx = indices[:top_n]
plt.figure(figsize=(10,8))
sns.barplot(x=importances[top_idx], y=feat_names[top_idx])
plt.title(f'Top {top_n} Feature Importances - Random Forest')
plt.tight_layout()
fig_path = '/mnt/data/feature_importances.png'
plt.savefig(fig_path, dpi=150)
plt.show()
print('Saved feature importance plot to', fig_path)

## 8 - Save evaluation plots (ROC)
Save ROC curve image.

In [None]:
# ROC curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}')
plt.plot([0,1],[0,1],'--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Test Set')
plt.legend(loc='lower right')
roc_path = '/mnt/data/roc_curve.png'
plt.savefig(roc_path, dpi=150)
plt.show()
print('Saved ROC curve to', roc_path)

## 9 - Create a single-slide PowerPoint summary (optional)
If `python-pptx` is installed, this cell will create a single slide summarising model performance and embedding the feature importance & ROC images. If not installed, you'll see a message with install instructions.

In [None]:
if not PPTX_AVAILABLE:
    print('python-pptx is not installed. To enable PowerPoint creation run: pip install python-pptx')
else:
    prs = Presentation()
    try:
        template_path = 'resources/powerpoint_template.pptx'
        if os.path.exists(template_path):
            prs = Presentation(template_path)
    except Exception:
        pass

    slide_layout = prs.slide_layouts[5] if len(prs.slide_layouts) > 5 else prs.slide_layouts[0]
    slide = prs.slides.add_slide(slide_layout)

    # Title
    try:
        title = slide.shapes.title
        title.text = 'Predicting Customer Bookings — Key Results'
    except Exception:
        pass

    # Add a textbox with summary
    left = Inches(0.5)
    top = Inches(1.4)
    width = Inches(5.0)
    height = Inches(3.5)
    txBox = slide.shapes.add_textbox(left, top, width, height)
    tf = txBox.text_frame
    tf.text = 'Model: Random Forest\n'
    p = tf.add_paragraph()
    p.text = 'Test ROC AUC: {roc:.3f} (insert after running)\nMean CV AUC: {cv:.3f}\nTop features: (insert top features)\nBusiness insight: (add concise insight)'

    # Insert images if available
    img_path = '/mnt/data/feature_importances.png'
    if os.path.exists(img_path):
        slide.shapes.add_picture(img_path, Inches(5.6), Inches(1.4), width=Inches(4.0))
    roc_img = '/mnt/data/roc_curve.png'
    if os.path.exists(roc_img):
        slide.shapes.add_picture(roc_img, Inches(5.6), Inches(4.4), width=Inches(3.5))

    out_pptx = '/mnt/data/Booking_Prediction_Summary.pptx'
    prs.save(out_pptx)
    print('Saved PowerPoint summary to', out_pptx)

## 10 - Next steps and tips
- Consider hyperparameter tuning with GridSearchCV or RandomizedSearchCV to improve performance.
- Try alternative models (XGBoost / LightGBM) and compare AUC.
- If `route` has meaningful codes, build a proper region mapping rather than heuristic string matching.
- Document any caveats (class imbalance, missing contextual features).

---

**Run the notebook locally in VS Code or Jupyter to execute the cells.**