<a href="https://colab.research.google.com/github/Harsh-Burande/flipkart_customer_satisfaction_prediction/blob/main/Flipkart_Customer_Support.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Classification - Flipkart Customer Service Satisfaction



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member -**    Harsh Burande

# **Project Summary -**

**Project Summary: Flipkart Customer Support Analysis**

This project focused on analyzing Flipkart’s customer support data to evaluate operational performance, identify factors affecting customer satisfaction (CSAT), and provide actionable insights to enhance service quality. The primary business objective was to understand patterns in customer interactions, pinpoint areas causing dissatisfaction, and suggest interventions to improve overall CSAT scores.

We began with exploratory data analysis (EDA) to gain a clear understanding of the dataset, which included attributes such as channel name, category, sub-category, agent information, response times, and CSAT scores. Initial visualizations, including bar charts, line charts, treemaps, and scatterplots, helped uncover key patterns. For example, the distribution of response times revealed that most tickets were resolved within a reasonable timeframe, but certain channels experienced delays. Additionally, CSAT trends over time showed that satisfaction remained relatively stable between 4.16 and 4.36 for most of August, with a notable drop to 3.19 on 28th August, indicating a specific period of operational strain.

To evaluate the impact of various features on customer satisfaction, we created a Week column to analyze trends at a weekly level and observed differences in performance by channel, agent shift, and ticket category. This step highlighted that faster response times and certain communication channels, like chat, correlated with higher CSAT scores, while slower channels or high-volume categories were more likely to generate dissatisfied customers.

Next, we applied classification modeling to predict satisfaction levels and understand feature importance. Logistic Regression, Random Forest, and Decision Tree models were implemented, with resampling techniques such as SMOTE and Tomek Links used to handle class imbalance between high and low CSAT labels. Logistic Regression provided interpretable coefficients, identifying response time, week, and channel as top predictors. Random Forest offered improved predictive performance and clearer feature importance rankings, while the Decision Tree model, although slightly lower in accuracy, provided high interpretability and a visual structure to understand decision paths impacting satisfaction. Threshold tuning further helped optimize model outputs, ensuring that predictions for low and high CSAT were actionable.

Throughout the modeling process, key insights emerged:

Response time is the dominant factor influencing customer satisfaction. Faster resolutions consistently correlate with higher CSAT scores.

Channel-specific performance varies; some channels naturally achieve faster resolutions and higher satisfaction.

High-volume categories with low satisfaction represent critical areas where interventions can have the greatest impact.

Agent and shift-level variations indicate that training and process standardization can improve overall support quality.

Based on these findings, the project recommends that Flipkart focus on reducing response times across slower channels, targeting low-satisfaction categories for process improvement, monitoring agent performance, and leveraging predictive models to proactively address potential dissatisfaction. Collecting more structured feedback in the future will further refine predictions and enhance operational decision-making.

In conclusion, this project successfully combined EDA, predictive modeling, and visualization to provide a comprehensive view of Flipkart’s customer support operations. By implementing the recommended strategies, Flipkart can improve both customer experience and operational efficiency, directly supporting the business objective of increasing customer satisfaction. The analysis also lays the foundation for continuous monitoring and iterative improvements, enabling data-driven decision-making in support operations.

# **GitHub Link -**

https://github.com/Harsh-Burande/flipkart_customer_satisfaction_prediction


# **Problem Statement**


**Problem Statement.**
Flipkart’s customer support handles thousands of queries daily, but inconsistent response times, varying agent performance, and complex issues impact customer satisfaction (CSAT). This project analyzes historical support data to identify key factors affecting CSAT, evaluate channel and agent performance, and build predictive models to proactively improve customer experience and operational efficiency.

* To analyze Flipkart’s customer support data and leverage insights and predictive models to improve customer satisfaction (CSAT) and optimize operational efficiency.

#### **Define Your Business Objective?**

To improve Flipkart’s customer satisfaction (CSAT) and support efficiency by identifying key factors affecting satisfaction and using predictive models to proactively manage high-risk tickets.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Loading ML libraries
from sklearn.model_selection import train_test_split

from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
flipkart_df = pd.read_csv('/content/drive/MyDrive/Module 6 capstone projecct/Customer_support_data.csv')

### Dataset First View

In [None]:
# Dataset First Look
flipkart_df.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
flipkart_df.shape

### Dataset Information

In [None]:
# Dataset Info
flipkart_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
flipkart_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
flipkart_df.isnull().sum()

In [None]:
# Created a clone of the real flipkart dataset
flipkart_df_cloned = flipkart_df.copy()

In [None]:
# Visualizing the missing values

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***


1. **Unique id -**                Unique identifier for each record                                             
2. **Channel name -**             Customer service channel used (e.g., inbound, outcall, email)                       
3. **Category -**                 Broad category of the customer’s issue                                        
4. **Sub-category -**             Specific sub-category of the issue                                            
5. **Customer Remarks -**         Feedback or comments provided by the customer                                 
6. **Order id**                 Identifier of the order linked to the support interaction                      
7. **Order date time -**          Date and time when the order was placed                                       
8. **Issue reported at -**        Timestamp when the issue was reported                                         
9. **Issue responded -**          Timestamp when the issue was responded to                                      
10. **Survey response date -**     Date of the customer satisfaction survey                                      
11. **Customer city -**            City of the customer                                                          
12. **Product category -**         Category of the product involved in the issue                                 
13. **Item price -**               Price of the product                                                          
14. **Connected handling time -**  Time taken (in minutes/seconds) by the agent to handle the interaction         
15. **Agent name -**              Name of the customer service agent handling the issue                          
16. **Supervisor -**               Name of the supervisor monitoring the agent                                    
17. **Manager -**                  Name of the manager responsible                                                
18. **Tenure Bucket -**            Agent’s experience level (e.g., 0–6 months, 6–12 months, etc.)                 
19. **Agent Shift -**              Shift timing of the agent (e.g., Morning, Evening, Night)                      
20. **CSAT Score -**               Customer Satisfaction score (target variable, numerical or categorical rating)


### Variables Description

In [None]:
flipkart_df.describe()

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
flipkart_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Dropping the columns with huge null values.
flipkart_df.drop(columns=['Order_id', 'order_date_time', 'Customer_City', 'Product_category', 'Item_price', 'connected_handling_time', 'Supervisor'], axis=1, inplace=True)

In [None]:
flipkart_df.isnull().sum()

In [None]:
flipkart_df.head()

In [None]:
# Change type of date columns from object to datetime
flipkart_df['Issue_reported at'] = pd.to_datetime(flipkart_df['Issue_reported at'], format="%d/%m/%Y %H:%M")
flipkart_df['issue_responded'] = pd.to_datetime(flipkart_df['issue_responded'], format="%d/%m/%Y %H:%M")
flipkart_df['Survey_response_Date'] = pd.to_datetime(flipkart_df['Survey_response_Date'])

In [None]:
flipkart_df.info()

In [None]:
# Add a colum to calculate the gap between issue reported and issue responded
flipkart_df['response_time'] = flipkart_df['issue_responded'] - flipkart_df['Issue_reported at']
flipkart_df

In [None]:
# Checking the negative response time and droping the rows.
flipkart_df.loc[flipkart_df['response_time'] < pd.Timedelta(0)]

In [None]:
flipkart_df = flipkart_df[flipkart_df['response_time'] >= pd.Timedelta(0)]

In [None]:
# Converted response time into hours.
flipkart_df['response_time'] = flipkart_df['response_time'].dt.total_seconds() / 3600

In [None]:
# Created a column with unique values representing unique channel_name.
channel_map = {'Outcall':1, 'Inbound':2, 'Email':3}
flipkart_df['channel_name_code'] = flipkart_df['channel_name'].map(channel_map)

In [None]:
flipkart_df.shape

In [None]:
flipkart_df.info()

In [None]:
flipkart_df

In [None]:
flipkart_df.loc[flipkart_df['response_time'] == flipkart_df['response_time'].max()]

In [None]:
flipkart_df['Week'] = flipkart_df['Survey_response_Date'].dt.to_period('W').apply(lambda r: r.start_time)

In [None]:
flipkart_df.describe(include = 'all')

### What all manipulations have you done and insights you found?

# Manipulations
1. Dropping the columns with huge null values ('Order_id', 'order_date_time', 'Customer_City', 'Product_category', 'Item_price', 'connected_handling_time', 'Supervisor').
2. Change type of datetime columns from object to datetime.
3. Add a colum to calculate the gap between issue_reported and issue_responded
4. Checking the negative response_time and droping the rows.
5. Created a column with unique values representing unique channel_name.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Response Time Distribution

In [None]:
# Chart - 1 visualization code
fig = px.histogram(flipkart_df, x='response_time', nbins=50)
fig.update_layout(title_text='Response time distribution', xaxis_title_text='Response time in hours', yaxis_title_text='Issues responded')
fig.show()

In [None]:
flipkart_df.columns

In [None]:
200 / 60

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

We can see that more than 64,600 issues were responded within the first hour of the reporting, showing a good response from the agents' side.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2 Average Response Time Taken By Each Channel

In [None]:
# Chart - 2 visualization code

response_time = flipkart_df.groupby('channel_name')['response_time'].mean().reset_index()
# response_time
fig = px.bar(response_time, x='channel_name', y='response_time')
fig.update_layout(title_text='Response time by channel', xaxis_title_text='Channel', yaxis_title_text='Average response time in hour')
fig.show()

##### 1. Why did you pick the specific chart?

Here, we chose the bar chart because it is very good to shows the differences in the averag time taken to response by each channel.

##### 2. What is/are the insight(s) found from the chart?

## 1. Outcall channel: #
When we look at customer support across different channels, a clear pattern emerges. Outbound calls, or Outcall, stand out as the fastest medium, with an average response time of just `2.87 hours`. This shows that when our agents proactively reach out, they’re able to resolve customer concerns quickly and effectively.

## 2. Inbound channel: #
Inbound calls follow closely behind, averaging `2.91 hours`. The gap between Outcall and Inbound is minimal, which tells us that both types of call-based interactions are handled with consistency and efficiency.

## 3. Email channel: #

But when we shift our attention to the Email channel, the picture changes. Emails take on average `3.61 hours` to receive a response—the highest among all channels. This delay highlights an area where customers might feel neglected or less prioritized compared to call-based support.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3 Response Time By Agent Shifts

In [None]:
# Chart - 3 visualization code
shift_by_response_time = flipkart_df.groupby('Agent Shift')['response_time'].agg(avg_response_time="mean", min_response_time="min", max_response_time="max").reset_index().sort_values('avg_response_time', ascending=False)
shift_by_response_time
fig = px.bar(shift_by_response_time, x='Agent Shift', y='avg_response_time',
             hover_data={
        "min_response_time": True,
        "max_response_time": True
    })
fig.update_layout(title_text='Average response time by agent shift', xaxis_title_text='Agent Shift', yaxis_title_text='Average Response Time In Hour')

fig.show()

##### 1. Why did you pick the specific chart?

Here, we chose the bar chart because it is very good to shows the differences in the averag time taken to response by each agent shift.

##### 2. What is/are the insight(s) found from the chart?

1. **Afternoon shift** records the highest average response time of `3.41 hours`, followed closely by the **Night shift** at `3.35 hours`.

2. The **Morning shift** is the most efficient with an average response time of 2.85 hours, while the Evening shift follows at `2.91 hours`.

3. The **Split shift** averages at `3.14 hours`, slightly higher than the Evening shift.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4 CSAT Score distribution overall

In [None]:
# Chart - 4 visualization code
csat_score_count = flipkart_df['CSAT Score'].value_counts().reset_index()
csat_score_count.columns = ["CSAT Score", "Count"]
fig = px.pie(csat_score_count, names='CSAT Score', values="Count")
fig.update_layout(title_text='CSAT Score distribution',
                  width=800)
fig.show()

##### 1. Why did you pick the specific chart?

We picked-up the pie chart as it is very good to show the proportion of each of the CSAT score.

##### 2. What is/are the insight(s) found from the chart?

1. The **5-star** rating dominates the chart, contributing `69.3%` with a remarkable `57,352 reviews`. This clearly shows that the majority of customers are highly satisfied.

2. Surprisingly, the **1-star** rating stands as the second-largest contributor, accounting for `13.2%` with `10,934 reviews`. This indicates a significant portion of dissatisfied customers that cannot be ignored.

3. The **4-star** reviews follow closely, forming `13%` with `10,774 reviews`, showing that many customers were satisfied but still found room for improvement.

4. Meanwhile, **3-star** ratings `(2,478 reviews)` and **2-star** ratings `(1,241 reviews)` together form a smaller share, highlighting a minority of neutral-to-dissatisfied experiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While the product/service enjoys an overwhelmingly positive response, the presence of a sizable share of 1-star reviews signals a need to investigate recurring issues to further improve customer satisfaction.

#### Chart - 5 CSAT Score by Category and Sub-category

In [None]:
# Chart - 5 visualization code
CSAT_treemap = flipkart_df.groupby(['category', 'Sub-category']).agg(
                                    Avg_csat = ('CSAT Score','mean'),
                                    Count = ('CSAT Score','count')
                                    ).reset_index()
CSAT_treemap
fig = px.treemap(CSAT_treemap, path=['category', 'Sub-category'], values='Count',
                 color='Avg_csat',
                 color_continuous_scale='RdBu',
                 hover_data={'Avg_csat': ':.2f'})
fig.show()

##### 1. Why did you pick the specific chart?

We chose the treemap because it represents hierarchical data in an intuitive and visually appealing way. In our case, the ‘Category’ acts as the parent, while the ‘Sub-category’ serves as the child node. This allows us to easily explore patterns within categories. The treemap not only displays the relative contribution of each sub-category but also encodes the average CSAT Score through color, making it effective for analyzing both distribution and performance at multiple levels simultaneously.

##### 2. What is/are the insight(s) found from the chart?

##Returns##

1. **Total CSAT count:** 42,649 (highest among all categories)

2. **Top sub-category by count:** Reverse Pickup Enquiry → 12,676 cases with avg CSAT 4.19.

3. **Highest avg CSAT:** Return Request & Missing → 4.61.

4. **Lowest avg CSAT:** Service Center – Service Denial → 3.22

##Other Related##

1. **Total CSAT count:** 22,274 (2nd largest share)

2. **Top sub-category by count:** Delay → 7,150 cases with avg CSAT 4.01

3. **Next biggest sub-category:** Order Status Enquiry → 6,546 cases with avg
CSAT 4.20
4. **Highest avg CSAT:** Customer Requested Modification → 4.53

5. **Lowest avg CSAT:** Seller Cancelled Order → 3.57


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6 CSAT Score Over Time

In [None]:
# Chart - 6 visualization code
timely_csat_score = flipkart_df.groupby('Survey_response_Date').agg(Avg_csat=('CSAT Score','mean'),
                                                                    Q1_csat=('CSAT Score', lambda x: x.quantile(0.25)),  # 25th percentile
                                                                    Q3_csat=('CSAT Score', lambda x: x.quantile(0.75)),  # 75th percentile
                                                                    count=('CSAT Score','count')).reset_index()
timely_csat_score

fig = px.line(timely_csat_score, x='Survey_response_Date', y='Avg_csat',
              title='CSAT Score Over Time', hover_data=['Q1_csat', 'Q3_csat', 'count'])
fig.update_xaxes(tickangle=45, dtick="D1")
fig.update_traces(line=dict(width=3))

fig.show()

##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

**Overall trend:** Throughout August, customer satisfaction remained consistently high, with average CSAT scores `between 4.16 and 4.36`.

**Anomaly detected:** On 28th August, the average CSAT dropped sharply to `3.19`, indicating a significant dip in customer satisfaction on that day.

**Next steps:** Investigate the root cause of this drop — could be a system outage, delayed deliveries, high query volume, or specific sub-category issues. This day could be a key pain point for the Flipkart support team.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7 Top & Bottom Performers (Sub-categories)

In [None]:
# Chart - 7 visualization code

top_performing_sub_categories = flipkart_df.groupby('Sub-category').agg(avg_csat_score=('CSAT Score',"mean"),
                                                                        total_reviews=('CSAT Score','count'),
                                                                        min_csat_score=('CSAT Score',"min"),
                                                                        max_csat_score=('CSAT Score',"max")
                                                                        ).reset_index().sort_values("total_reviews", ascending=False)
top_performing_sub_categories
fig = px.bar(top_performing_sub_categories, x='Sub-category', y='avg_csat_score', hover_data=["total_reviews", "min_csat_score", "max_csat_score"], color="avg_csat_score", color_continuous_scale="RdBu")
fig.update_layout(title_text='Top Performing Sub-categories', xaxis_title_text="Sub-category", yaxis_title_text="Average CSAT score")

fig.show()

##### 1. Why did you pick the specific chart?

We selected barchart here as it shows the visible difference among the average CSAT score of all the Sub-categories.

##### 2. What is/are the insight(s) found from the chart?

1. The chart is in descending order by total number of reviews each sub-category holds.
2. There are total 21,676 reviews for Reverse pickup enquiry and has a moderateaverage CSAT score of `4.18`, followed by **return-request** sub-category with the total reviews of 8207 and average CSAT score of 4.16.
3. the worst is performed by **unable to login** sub-category, although it has the total review count of 6 only, it has the lowest average CSAT score of `2`.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

fig = px.histogram(flipkart_df, x='CSAT Score', nbins=6, title='CSAT Score Distribution')
fig.show()

fig = px.box(flipkart_df, x='category', y='CSAT Score', title='CSAT Distribution by Category')
fig.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

##1. In CSAT score Distribution:##
* We can see that the CSAT score of 5 dominates the barchart with `57,352` reviews.
* On the other hand we have CSAT score of 1 with `10,934` reviews.
* CSAT score of 2 has the least count of with just `1241` reviews.

##2. CSAT Distribution by Category:##
* All the categories are doing great while having the median of CSAT score of `5`.
* only the **Others** struggle with the median of CSAT score of `4`.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
query_volume = flipkart_df.groupby('category')['CSAT Score'].agg(['mean','count']).reset_index()
fig_scatter = px.scatter(query_volume, x='count', y='mean',
                         labels={'count':'Query Volume', 'mean':'Average CSAT'},
                         title='CSAT vs Query Volume', hover_data='category',)
fig_scatter.show()


##### 1. Why did you pick the specific chart?

Scatter plot is a great choice to show the correlation between the continuous variables.

##### 2. What is/are the insight(s) found from the chart?

1. Category **Returns** has the highest volume of queries of `42,649` along with the great average CSAT score of `4.34`, followed by **Order Related** with the query volume of `22,274` and average CSAT score of `4,08`.
2. We have our worst perform **Others** with query volume of `96` only and average CSAT score of `3.38`, which is not at all expected with such a low volume of queries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

In [None]:
flipkart_df.isnull().sum()

# Training the Model

In [None]:
# Creating the satisfaction column
flipkart_df["satisfaction_level"] = flipkart_df["CSAT Score"].apply(lambda x: 1 if x >= 4 else 0)

##### 1. Why did you pick the specific chart?

Answer Here.

In [None]:
flipkart_df["satisfaction_level"].value_counts()

In [None]:
flipkart_df.columns

In [None]:
# Encoding the columns
features = ["response_time", "channel_name_code", "Week"]
target = "satisfaction_level"

In [None]:
# splitting the X and y

X = flipkart_df[["response_time", "channel_name_code", "Week"]]
y = flipkart_df["satisfaction_level"]

In [None]:
X["Week"] = X["Week"].astype(int)

In [None]:
# train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# Logistic Regression Starts
smote = SMOTE(sampling_strategy = 'minority', random_state=42)
tomek = TomekLinks(sampling_strategy='majority')

resamplePipeline = Pipeline(steps=[('smote', smote), ('tomek', tomek)])


X_res, y_res = resamplePipeline.fit_resample(X_train, y_train)

# Check balance
print("Before SMOTE:", y_train.value_counts())
print("After SMOTE:", y_res.value_counts())

In [None]:
# Logistic Regression

model=LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_res, y_res)
# model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_mat_lg = confusion_matrix(y_test, y_pred)
Classification_report = classification_report(y_test, y_pred)

print("Accuracy = ",accuracy)
print("\nConfusion_matrix = ",conf_mat_lg)
# print("\nclassification_report = ",Classification_report)

In [None]:
# # Random Forest

# rf_model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)

# param_dist = {
#     'n_estimators': [100, 200, 300, 500, 800, 1000],
#     'max_depth': [None, 5, 10, 20, 30, 50],
#     'min_samples_split': [2, 5, 10, 15],
#     'min_samples_leaf': [1, 2, 4, 6],
#     'max_features': ['sqrt', 'log2', None],
#     'class_weight': [None, 'balanced', 'balanced_subsample']
# }

# # Randomized Search
# random_search = RandomizedSearchCV(
#     estimator=rf_model,
#     param_distributions=param_dist,
#     n_iter=30,              # number of random combinations to try
#     cv=3,                   # 3-fold cross validation
#     verbose=2,
#     random_state=42,
#     n_jobs=-1
# )

# random_search.fit(X_train, y_train)

# # Check balance
# print("Best Parameters:", random_search.best_params_)
# print("Best score:", random_search.best_score_)


# # y_pred = rf_model.predict(X_test)

# # accuracy = accuracy_score(y_test, y_pred)
# # confu_mat_rf = confusion_matrix(y_test, y_pred)
# # classf_rep_rf = classification_report(y_test, y_pred)

# # print("Accuracy = ",accuracy)
# # print("\nConfusion_matrix = ",confu_mat_rf)
# # print("\nClassification report = ",classf_rep_rf)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)

print("Before SMOTE:", y_train.value_counts())
print("After SMOTE:", y_train_res.value_counts())

rf_model = RandomForestClassifier(
              n_estimators=100,
              max_depth=15,
              random_state=42,
              n_jobs=-1,
              class_weight="balanced")

rf_model.fit(X_train_res, y_train_res)

y_pred = rf_model.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

In [None]:
# --- Threshold Tuning Cell ---

# Predict probabilities for High CSAT
y_probs = rf_model.predict_proba(X_test_scaled)[:,1]

# Try different thresholds
thresholds = [0.3, 0.35, 0.4, 0.45, 0.5]

for th in thresholds:
    y_pred_new = (y_probs >= th).astype(int)
    acc = accuracy_score(y_test, y_pred_new)
    confu = confusion_matrix(y_test, y_pred_new)
    print(f"\nThreshold = {th}")
    print("Accuracy =", acc)
    print("Confusion Matrix:\n", confu)


In [None]:
# Displaying Feature importances
importances = rf_model.feature_importances_
feature_names = X_train.columns  # ya X_train_scaled ke original column names

feat_imp_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Top 20 features
feat_imp_df = feat_imp_df.sort_values(by='Importance', ascending=False).head(20)

# Plotly horizontal bar chart
fig = px.bar(feat_imp_df,
             x='Importance',
             y='Feature',
             orientation='h',
             title='Top 3 Feature Importances - Random Forest',
             text='Importance')

fig.update_traces(texttemplate='%{text:.4f}', textposition='outside')
fig.update_layout(yaxis={'categoryorder':'total ascending'}, height=600)
fig.show()

## **5. Solution to Business Objective**

In [None]:
# Decision Tree

dt_model = DecisionTreeClassifier(random_state=42,
                                  max_depth=5,
                                  class_weight="balanced")

dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_mat_dt = confusion_matrix(y_test, y_pred)
class_rept_dt = classification_report(y_test, y_pred)

print("Accuracy = ",accuracy)
print("\nConfusion_matrix = ",conf_mat_dt)
print("\nClassification report = ",class_rept_dt)

In [None]:
# displaying feature importance
feat_importances = dt_model.feature_importances_
features = X_train.columns

# Create DataFrame
feat_df = pd.DataFrame({
    'Feature': features,
    'Importance': feat_importances
})

# Take top 3 features
top3_feat = feat_df.sort_values(by='Importance', ascending=False).head(3)

# --- Plotly horizontal bar chart ---
fig = px.bar(top3_feat,
             x='Importance',
             y='Feature',
             orientation='h',
             title='Top 3 Feature Importances - Decision Tree',
             text='Importance')

fig.update_traces(texttemplate='%{text:.4f}', textposition='outside')
fig.update_layout(yaxis={'categoryorder':'total ascending'}, height=400)
fig.show()

#### What do you suggest the client to achieve Business Objective ?

**1. Prioritize Fast Response Times:** Our analysis shows that response time is the strongest factor affecting CSAT. Reducing delays, especially in high-volume channels, will directly improve customer satisfaction.

**2. Optimize Channel Management:** Channels like chat tend to have faster resolutions and higher CSAT, while slower channels like email show lower satisfaction. Consider reallocating resources or automating certain processes in slower channels.

**3. Focus on High-Impact Categories:** Identify categories or sub-categories with low CSAT and high ticket volume. Targeted training, SOP improvements, or process automation can reduce pain points in these areas.

**4. Monitor Agent Performance:** Some shifts or agents consistently perform better. Recognize top performers and provide coaching to underperforming shifts to maintain uniform quality.

**5. Leverage Predictive Models:** Use the trained classification models to predict potential low CSAT tickets in real-time. This allows proactive intervention, improving overall satisfaction before issues escalate.

**6. Continuous Feedback Analysis:** Even though customer remarks had many missing values, collecting structured feedback regularly can refine predictive models and provide actionable insights.

# **Conclusion**

In this project, we analyzed Flipkart’s customer support data to understand response patterns, channel performance, agent effectiveness, and factors affecting customer satisfaction (CSAT). Through visualizations like bar charts, line charts, treemaps, and scatterplots, we observed trends over time, identified high-volume issues, and pinpointed key pain areas affecting CSAT. Classification models, including Logistic Regression, Random Forest, and Decision Tree, helped predict satisfaction levels and highlighted the most influential features such as response time, week, and channel. Resampling techniques were applied to handle class imbalance, improving model interpretability. Overall, the project provided actionable insights into operational efficiency, revealing that faster responses and certain channels positively impact CSAT, and it laid the groundwork for targeted improvements in customer support processes.

##Overall Model Review:##

Logistic Regression provided interpretable coefficients, highlighting key predictors like response time and channel, but had moderate accuracy due to class imbalance. Random Forest improved predictive performance and handled non-linear relationships well, offering clear feature importance rankings. Decision Tree offered high interpretability and visual decision paths, but lower accuracy compared to Random Forest, making it more suitable for insights than prediction

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***