# SAR Rental Car Cancellation Prediction

This notebook replicates analysis previously done in R using Python. It covers:
- Data Cleaning
- Feature Engineering
- Modeling (Logistic Regression, Random Forest)
- Evaluation and Insights

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load cleaned dataset
df = pd.read_csv("SAR_Cleaned_Median.csv")
df.head()

## Data Preparation

In [None]:
# Drop columns with low variance or non-useful identifiers if any
X = df.drop(columns=["Car_Cancellation"])
y = df["Car_Cancellation"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Logistic Regression

In [None]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred_log = log_reg.predict(X_test)

print("Classification Report - Logistic Regression")
print(classification_report(y_test, y_pred_log))

## Random Forest Classifier

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Classification Report - Random Forest")
print(classification_report(y_test, y_pred_rf))

# Feature importance
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.nlargest(10).plot(kind='barh', title='Top 10 Feature Importances')
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()

## Conclusion
- Random Forest performed well with clear interpretability.
- Distance, booking type, and time-based features were key drivers of cancellations.
- This notebook provides an interpretable baseline for modeling rental behavior.