<a href="https://www.kaggle.com/code/dipds109/ola-smote-bagging-boosting?scriptVersionId=160922489" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# OLA Driver Churn


In [None]:
import numpy as np 
import pandas as pd 

## Importing the data

In [None]:
df=pd.read_csv('ola_driver_scaler.csv')

In [None]:
df.head(10)

In [None]:
df.info()

Most of the couloms are numerical columns except the city column, which contains the city codes where the driver is from.

In [None]:
df['Driver_ID'].nunique()

Grouping by the driver column we can see that there are unique 2381 driver Ids.

In [None]:
df.isnull().sum()

we haev 61 missing age and 52 gender null values, We willaddress this but b4 that let us convert the date columns to the proper datatypes.

In [None]:
date_columns = ["MMM-YY", "Dateofjoining", "LastWorkingDate"]

for column in date_columns:
    df[column] = pd.to_datetime(df[column])

print(df.dtypes)

#### KNN Imputation to fill up the numerical couloms

In [None]:
from sklearn.impute import KNNImputer

numerical_columns = ['Driver_ID', 'Age', 'Education_Level', 'Income', 'Joining Designation', 'Grade', 'Total Business Value', 'Quarterly Rating']
numerical_data = df[numerical_columns]

imputer = KNNImputer(n_neighbors=5)  # You can adjust the number of neighbors as needed
imputed_numerical_data = imputer.fit_transform(numerical_data)
imputed_numerical_df = pd.DataFrame(imputed_numerical_data, columns=numerical_columns)

df[numerical_columns] = imputed_numerical_df

# Verify that missing values have been imputed
missing_values = df.isnull().sum()
print("Missing values after imputation:")
print(missing_values)


## Feature Engineering

 Creating a column named Quarterly_Rating_Change which targets the drivers whose performance have imroved over time.

In [None]:
df['Quarterly_Rating_Change'] = df.groupby('Driver_ID')['Quarterly Rating'].diff()
df['Rating_Increased'] = (df['Quarterly_Rating_Change'] > 0).astype(int)


Creating a column named target. The drivers where the last working date is not mentioned are believed to have not left the origanization and are alloted a target value of 0. Drivers who left the org are alloted a target variable 1. 

In [None]:
df['target']=0
df.loc[df['LastWorkingDate'].notnull(), 'target'] = 1

Creating a column named Income_Diff, for the drivers who income has increased overtime.

In [None]:
import pandas as pd

# Sort the dataset by 'Driver_ID' and 'MMM-YY' to ensure the data is in the correct order
df.sort_values(by=['Driver_ID', 'MMM-YY'], inplace=True)

# Create a new column 'Income_Increased' with default value 0
df['Income_Increased'] = 0

# Calculate the difference in income for each driver
df['Income_Diff'] = df.groupby('Driver_ID')['Income'].diff()

# Set 'Income_Increased' value to 1 for drivers whose income has increased
df.loc[df['Income_Diff'] > 0, 'Income_Increased'] = 1

# Drop the 'Income_Diff' column if you no longer need it
df.drop(columns=['Income_Diff'], inplace=True)


Finally creating a column to check the total number of days the driver is associated with the organization since their time of joining. The dataset is for the years 2019-20 so i have taken the last day as 12/12/2020. I have calculated the total working days with reference to this day.

In [None]:
df['Dateofjoining']=pd.to_datetime(df['Dateofjoining'])
df['LastWorkingDate']=pd.to_datetime(df['LastWorkingDate'])

# Use '12/12/2020' as the default date for LastWorkingDate where it's None
default_date = pd.to_datetime('12/12/2020')
df['LastWorkingDate'].fillna(default_date, inplace=True)

# Calculate the difference and store it in 'total_num_of_days' column
df['total_num_of_days'] = (df['LastWorkingDate'] - df['Dateofjoining']).dt.days


Checking whether all the couloms are visible properly or not.

In [None]:
df.head(8)

Creating an aggregated data by grouping the data with driver ids. This was done as there was multiple entries for each driver.

In [None]:
aggregated_data = df.groupby('Driver_ID').agg({
    'Age': 'last',
    'Gender': 'max',
    'City': 'last',
    'Education_Level': 'last',
    'Income': 'last',
    'Dateofjoining': 'first',
    'LastWorkingDate': 'last',
    'Joining Designation': 'first',
    'Grade': 'last',
    'Total Business Value': 'sum',
    'Quarterly Rating': 'last',
    'target':'last',
    'Rating_Increased':'max',
    'Income_Increased':'max',
    'total_num_of_days':'max'
    
}).reset_index()

# Displaying the first few rows of the aggregated data
aggregated_data.head(10)

## Univariate Analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame containing the dataset

# List of continuous variables
continuous_vars = ['Age', 'Income', 'Total Business Value','total_num_of_days']

# Create distribution plots for continuous variables
for var in continuous_vars:
    plt.figure(figsize=(8, 4))
    
    # Plot the distribution
    sns.histplot(aggregated_data[var], kde=True)
    
    # Calculate and add average line
    average = aggregated_data[var].mean()
    plt.axvline(average, color='red', linestyle='--', label='Average')
    
    # Calculate and add 25% and 75% lines
    percentile_25 = aggregated_data[var].quantile(0.25)
    percentile_75 = aggregated_data[var].quantile(0.75)
    plt.axvline(percentile_25, color='green', linestyle='--', label='25% Percentile')
    plt.axvline(percentile_75, color='blue', linestyle='--', label='75% Percentile')
    
    plt.title(f'Distribution of {var}')
    plt.xlabel(var)
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()


From the graphs we can see that 
- the average age is a little less than 35. Most of the people are in the age bracket 29~37.
- The average income for the drivers is a little over 50000. Most of them are earning between 37K to 75K.
- The average time a driver is associated with the company is 500 days. Most people stay betweeen 200 to 1000 days.

Suggestions:
- The company can have a day wise incentive for the drivers. The more time they are associated with the company the more rewards/ money they get. This can reduce the churn for the drivers.

In [None]:
city_driver_count= aggregated_data['City'].value_counts()
plt.figure()
city_driver_count.plot(kind='bar')
plt.title('Number of Drivers in Each City')
plt.xlabel('City')
plt.ylabel('Number of Drivers')
plt.xticks(rotation=45)
plt.show()

The company can try to reachout to more peolpe in the cities with less drivers, and provide them with more incentives so increase driver induction in the company. It will help keep a balance when the churn is high.

## Bivariate Analysis 

In [None]:

# Scatter plot for Age vs. Income
plt.figure(figsize=(10, 6))

# Scatter plot
sns.scatterplot(data=aggregated_data, x='Age', y='Income')

# Calculate and add average lines
average_age = aggregated_data['Age'].mean()
average_income = aggregated_data['Income'].mean()

plt.axhline(average_income, color='red', linestyle='--', label='Average Income')
plt.axvline(average_age, color='blue', linestyle='--', label='Average Age')


plt.title('Scatter Plot: Age vs. Income')
plt.xlabel('Age')
plt.ylabel('Income')
plt.legend()
plt.show()


In [None]:
numeric_columns = aggregated_data.select_dtypes(include=['int64', 'float64'])

# Calculate the correlation matrix
corr_matrix = numeric_columns.corr()

# Create a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()


Notable Correlations:

- Income and Grade: There is a strong positive correlation of 0.74 between Income and Grade, suggesting that as one's grade increases, their income tends to increase as well.
- Joining Designation and Grade: This pair also has a strong positive correlation (0.71), indicating a likely trend that individuals with a higher initial designation tend to reach higher grades.
- Total Business Value and several factors: Total Business Value has moderately strong positive correlations with Income (0.38), Joining Designation (0.38), and Grade (0.38). This suggests that higher income, higher joining designation, and higher grade are associated with higher business values generated.
- Target and several factors: The variable 'target' has a moderately strong negative correlation with Total Business Value (-0.38) and Quarterly Rating (-0.51), indicating that as the business value and quarterly ratings increase, the likelihood of hitting the target decreases (or vice versa). This could suggest that the targets might be set higher for individuals with higher business values or ratings.

Weak or No Correlation: Several variables such as Driver_ID, Age, Gender, and Education Level show very little to no correlation with other variables, indicated by the colors close to white. This means that these factors do not have a strong linear relationship with the others in the dataset.

In [None]:
(aggregated_data['target']).value_counts().plot(kind='pie', figsize=(4, 4), colors=['darkcyan','red'], autopct='%1.0f%%')

print('=' * 30)
print((aggregated_data['target']).value_counts())
print('=' * 30)

The pie plot shows there is significant imbalance in the dataset. We will create the base models and then i will try to handle the imbalance and check whether the model performance improves after addressing it.

## Preparing the DataSet for ML-Model

Scaling the data

In [None]:
from sklearn.preprocessing import StandardScaler


# List of columns to standardize- Mostly numerical couloms
columns_to_standardize = ['Driver_ID', 'Age', 'Education_Level', 'Income', 'Joining Designation', 'Grade',
                           'Total Business Value', 'Quarterly Rating', 'Rating_Increased',
                           'Income_Increased', 'total_num_of_days']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the selected columns and transform them
aggregated_data[columns_to_standardize] = scaler.fit_transform(aggregated_data[columns_to_standardize])

# Now, the specified columns in 'aggregated_data' are standardized


One hot encoding of the city column

In [None]:
aggregated_data= pd.get_dummies(aggregated_data, columns=['City'])


# Print the resulting DataFrame
print(aggregated_data)

#### Checking the data before splitting

In [None]:
aggregated_data.info()

In [None]:
aggregated_data.head(10)

Splitting the data

In [None]:
# Split your data into training and testing sets
X = aggregated_data.drop(columns=['target', 'Dateofjoining', 'LastWorkingDate'])
y = aggregated_data['target']

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


In [None]:
import xgboost as xgb

# Create an XGBoost Classifier with scale_pos_weight
xgb_classifier = xgb.XGBClassifier(scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum())

# Fit the classifier on the training data
xgb_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = xgb_classifier.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


The overall accuracy in both the models is pretty good at 88%. Lets see if we can improve the performance by impleminting SMOTE.

### Imbalance treatment

In [None]:
from imblearn.over_sampling import SMOTE
X = aggregated_data.drop(columns=['target', 'Dateofjoining', 'LastWorkingDate'])
y = aggregated_data['target']

# Initialize the SMOTE object
smote = SMOTE(random_state=42)

# Apply SMOTE to the dataset to balance the classes
X_resampled, y_resampled = smote.fit_resample(X, y)

# Create a new DataFrame with the resampled data
resampled_data = pd.concat([X_resampled, y_resampled], axis=1)

# Check the class distribution after applying SMOTE
print("Class distribution after SMOTE:\n", resampled_data['target'].value_counts())

Creating the train test split from the SMOTE dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))

In [None]:
xgb_classifier= xgb.XGBClassifier(
    random_state=42  # Set the random seed for reproducibility
)

xgb_classifier.fit(X_train, y_train)
y_pred = xgb_classifier.predict(X_test)
print(classification_report(y_test, y_pred))

Random forests performed better when applied the imbalance tratment. Now lets perform hyper parameter testting on it and see if i can improve the accuracy.

In [None]:
from sklearn.model_selection import StratifiedKFold
rf_classifier = RandomForestClassifier(random_state=42)

# Initialize StratifiedKFold with 5 folds
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Initialize lists to store evaluation results
classification_reports = []

# Perform Stratified K-Fold cross-validation
for train_index, test_index in stratified_kfold.split(X_resampled, y_resampled):
    X_train, X_test = X_resampled.iloc[train_index], X_resampled.iloc[test_index]
    y_train, y_test = y_resampled.iloc[train_index], y_resampled.iloc[test_index]

    # Fit the classifier on the training data
    rf_classifier.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)

    # Evaluate the model and store the classification report
    classification_reports.append(classification_report(y_test, y_pred))

# Print the classification reports for each fold
for i, report in enumerate(classification_reports, 1):
    print(f"Fold {i} Classification Report:")
    print(report)

## Model Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, rf_classifier.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()


In [None]:
classification_report_str = classification_report(y_test, y_pred)
confusion_matrix_arr = confusion_matrix(y_test, y_pred)

# Print the Classification Report and Confusion Matrix
print("Classification Report:")
print(classification_report_str)

print("\nConfusion Matrix:")
print(confusion_matrix_arr)