# SAR Rental Cancellation Prediction – Fully Explained

This project aims to predict rental car cancellations using user behavior, booking patterns, and geospatial data.

## Step 1: Import Required Libraries

In [None]:
# Import essential libraries for data handling, modeling, and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load cleaned dataset
df = pd.read_csv("SAR_Cleaned_Median.csv")  # Replace with your dataset if different
df.head()

## Step 2: Understand the Dataset

In [None]:
# Check the shape and overview
print("Dataset Shape:", df.shape)
df.info()
df.describe()

## Step 3: Feature Description


- `trip_distance_km`: Computed using the Haversine formula between pickup and drop-off locations.
- `booking_type`, `package_type`: Categorical predictors affecting booking reliability.
- `booking_created`, `from_date`, `to_date`: Used to derive timing features (booking gap, urgency).
- `Car_Cancellation`: Target variable.


## Step 4: Preprocessing

In [None]:
# Separate features (X) and target (y)
X = df.drop(columns=["Car_Cancellation"])  # Replace with your actual target column
y = df["Car_Cancellation"]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 5: Logistic Regression

In [None]:
# Initialize and train logistic regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

# Make predictions
y_pred_log = log_reg.predict(X_test)

# Evaluate the model
print("Classification Report - Logistic Regression")
print(classification_report(y_test, y_pred_log))

## Step 6: Random Forest Classifier

In [None]:
# Train Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf.predict(X_test)
print("Classification Report - Random Forest")
print(classification_report(y_test, y_pred_rf))

# Feature importance plot
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.nlargest(10).plot(kind='barh', title='Top 10 Important Features')
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()

## Step 7: Conclusion & Next Steps
- Evaluate patterns.
- Improve features or try boosting algorithms like XGBoost.
- Add more domain-specific variables.