<a href="https://colab.research.google.com/github/Kapoortigers007/Pankaj_End-2-End_Unsupervised_ML_Project_II/blob/main/Pankaj_E2E_Machine_Learnings_Flipkart_Customer_Service_Satisfaction_Class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Classification - Flipkart Customer Service Satisfaction



##### **Name -** - Pankaj

# **Project Summary -**

#### Project Summary: Classification - Flipkart Customer Service Satisfaction
In the highly competitive world of e-commerce, customer satisfaction is a critical business driver. Flipkart, being one of India’s largest online retail platforms, handles thousands of customer support interactions daily through various channels like inbound calls, outcalls, and email. The quality of these interactions directly affects customer retention, loyalty, and public reputation.

This project focuses on building a machine learning model to predict customer satisfaction (CSAT) based on past service interactions. The objective is to identify key drivers of satisfaction, uncover patterns across different customer support teams and service categories, and ultimately enable Flipkart to improve the quality and efficiency of its customer service operations.

#### Business Problem
The dataset provided contains over 85,000 customer support records with 20 attributes, including:
- Communication channel (channel_name)
- Interaction category (category)
- Product information
- Agent details
- Timestamps (issue_reported_at, issue_responded)
- CSAT score (1 to 5)
The goal is to classify whether a customer was satisfied (CSAT ≥ 4) or not satisfied (CSAT < 4) using these attributes.

# **Problem Statement**


**Classification Problem —** Predicting Flipkart Customer Service Satisfaction\
**Goal:** Understand satisfaction drivers, assess support team performance, and optimize service quality.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# for datetime manipulation
import datetime

# for text handling
import re

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("Customer_support_data.csv")
print('Dataset Successfully Loaded!!!')

FileNotFoundError: [Errno 2] No such file or directory: 'Customer_support_data.csv'

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
#print(len(df))
#print(len(df.columns))
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isna(), cbar=True, cmap='viridis')

### What did you know about your dataset?

- There are 85,907 unique records with 20 attributes, and there are 17 object dtypes, 2 float dtypes, and 1 int dtype of data.
- The dataset has many columns with 'NaN' values. Needs thorough inspection.
- There are no duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
df.describe(include = 'object')

### Variables Description

1) **'Unique id'** and **'Order_id'** have nothing remarkable, so they can be dropped during the model-building phase.
2)  We have **3 Unique 'channel_name', 12 unique 'category', and 57 'Sub-category'**.
3)  Out of **1782 unique cities**, **722 entries are from 'HYDERABAD'**, which shows that most customer data was from Hyderabad city.
4)  There are **5 shifts of Agents, among which 41,426 were 'Morning' only**.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for 'channel_name' variable.
df['channel_name'].value_counts(normalize = True)

- Looks like **Inbound** is the dominant channel for customer interaction.

In [None]:
# Check Unique Values for 'category' variable with their value counts.
df['category'].value_counts(normalize = True)

- The bulk of the customers' issues lies in **'Returns'**, **'Order Related'**, **'Refund Related'** category, approximatley **84%** of interactions.

In [None]:
# Check Unique Values for 'Agent_shift' variable.
df['Agent Shift'].value_counts(normalize = True)

- Morning and Evening shifts handle the vast majority of interactions.This can be analyzed further for staffing and analyzing shift-level performance.

In [None]:
# Check Unique Values for 'Tenure Bucket' variable.
df['Tenure Bucket'].value_counts(normalize = True)

- Significant portion of agents are either experienced(?90 days)  or very new(On Job Traininig).

In [None]:
# Check Unique Values for 'Product_category' variable.
df['Product_category'].value_counts(normalize = True)

- **Electronics(27.36%), Lifestyle(23.94%),** and **Books & General Merchandise(19.3%)** are top 3 interaction categories by interaction volume.

In [None]:
# Check Unique Values for 'CSAT Score' variable.
df['CSAT Score'].value_counts(normalize = True)

- This is a typical **'J-shaped'** distribution. A high percentage of 5s is good, but the substantial number of 1s highlights areas of improvement.

In [None]:
# Check Unique Values for 'Customer_city' variable.
df['Customer_City'].unique()

In [None]:
# Percentage of people who didn't enter the city name.
df['Customer_City'].isnull().sum() / len(df['Customer_City'])  * 100

- **80%** of city data is missing. It's challenging to use the this for predictive modelling.

In [None]:
# Check max value for 'Item Price' variable.
df['Item_price'].max()

In [None]:
# Check CSAT Score by Channel Name.
df.groupby('channel_name')['CSAT Score'].mean().sort_values(ascending=False)

- Email support has noticeably lower average CSAT score as compared to Outcall and Inbound. Why Email customers are less satisfied?

In [None]:
# Check CSAT Score by Grouping against Manager and Supervisor.
df.groupby(['Manager','Supervisor'])['CSAT Score'].mean().sort_values(ascending=True)

- The performance of Nathan Patel and Zoe Yamamoto is rated high, unlike Oliver Nguyen and Dylan Kim, where the first one among two are working under the same Manager John Smith.Thee might be need for training?

In [None]:
# Check CSAT score by product category.
df.groupby('Product_category')['CSAT Score'].mean().sort_values(ascending=True)

- **GiftCard (3.23), Furniture (3.62), Mobile (3.65), Home Appliances (3.70)** are the categories with the lowest CSAT score. These could be the pain points which leads to many customers giving it low CSAT score.

In [None]:
# Check CSAT Score by Category.
df.groupby('category')['CSAT Score'].mean().sort_values(ascending=False)

- **Others (3.43), Cancellation (3.99), Product Queries (4.04), Order Related (4.10)** are the lowest CSAT interaction categories.

In [None]:
df.describe(include='all').transpose()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
df = pd.read_csv('Customer_support_data.csv')
df.head()

In [None]:
# Convert all column names to snake case.
# import re module
import re
# Define the'convert_to_snake_case' function to convert column names to the same case.
def convert_to_snake_case(column_name):
    #convert to lowercase
    s = column_name.lower()

    # replace spaces, hyphens, and alphanumeric characters with underscores.
    s = re.sub(r'[^a-z0-9_]+','_', s)

    # remove leading or trailing underscores
    s = s.strip('_')

    # replace multiple underscores with a single underscore
    s = re.sub(r'_+','_',s)

    return s


In [None]:
df.columns = [convert_to_snake_case(col) for col in df.columns]

In [None]:
df.columns

In [None]:
# Drop irrelevant columns
df = df.drop(['unique_id','order_id'], axis = 1)
df.head()

In [None]:
# convert timstamp columns to datetime format
date_time_cols= ['order_date_time','issue_reported_at','issue_responded','survey_response_date']

for col in date_time_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce',dayfirst=True)  # 'coerce' will turn parsing errors into NaT (Not a Time)


In [None]:
df['csat_score'].dtype

In [None]:
# Replace missing value of city column 'NaN' to 'Missing_City'.
df['customer_city'] = df['customer_city'].fillna('Missing_City')
df['customer_city'].unique()

In [None]:
# Fill 'NaN' values in the customer_remark to empty string.
df['customer_remarks'] = df['customer_remarks'].fillna('')
df['customer_remarks']

In [None]:
df.info()

In [None]:
df.head()

In [None]:
# Fill 'NaNs' in Product Category with a string like 'No_Product_Context'.
df['product_category'] = df['product_category'].fillna('No_Product_Context')
df['product_category']

In [None]:
# Fill 'item_price' 'NaN' values with '0'.
df['item_price'] = df['item_price'].fillna(0)
df['item_price']

In [None]:
# create a binary column named 'has_order_date' with '1' if 'order_date_time' is not NaN, 0 otherwise.
df['has_order_date']  = df['order_date_time'].notna().astype(int)
df['has_order_date']

In [None]:
df['has_order_date'].unique()

In [None]:
# Create a binary column 'is_connected_call' with '1' if 'connected_handling_time' is not NaN, 0 otherwise.
df['is_connected_call'] = df['connected_handling_time'].notna().astype(int)
df['is_connected_call']

In [None]:
# Fill 'NaNs' in the original 'connected_handling_time' with 0
df['connected_handling_time'] = df['connected_handling_time'].fillna(0)
df['connected_handling_time']

In [None]:
df.head()

In [None]:
df.info()

### What all manipulations have you done and insights you found?

1) Converted all column names to snake_case.

2) Dropped the unique_id and order_id columns.

3) Converted timestamp columns to datetime objects.

4) Handled missing values in customer_city, product_category, item_price, and connected_handling_time.

5) Created new binary indicator columns (has_order_date, is_connected_call) to capture the information about missingness in order_date_time and connected_handling_time.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# display summary stats of data
df.describe().transpose()

In [None]:
df.describe(include = 'object').transpose()

#### Chart - 1  Pie Chart showing Customer Service Channels

In [None]:
fig = plt.figure(figsize = (3,3))
data = df['channel_name'].value_counts()
plt.pie(data,
       labels = [
           f'{data.index[0]} : {data.values[0]}',
           f'{data.index[1]} : {data.values[1]}',
           f'{data.index[2]} : {data.values[2]}'
       ],
        autopct= '%1.1f%%'
       )

plt.title('Distribution of Customer Service Channel')

##### 1. Why did you pick the specific chart?

To understand how customers interact with support — via Inbound, Outcall, or Email.

##### 2. What is/are the insight(s) found from the chart?

- Majority (~79%) of interactions are through Inbound.
- Email support is minimal (~3.5%).

##### 3. Will the gained insights help creating a positive business impact?

- Inbound is the most critical channel for service experience.
- Opportunity to improve and scale Email and Outcall support.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Email's low share might indicate underutilization or poor awareness of that support option.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize = (5,1))
sns.boxplot(x = df['item_price'], fliersize = 1)
plt.title('Item Price Boxplot')

##### 1. Why did you pick the specific chart?

To detect price outliers and understand spread.

##### 2. What is/are the insight(s) found from the chart?

- Many extreme high-value outliers present.
- Most values are concentrated below ₹5,000.

##### 3. Will the gained insights help creating a positive business impact?

Outliers can distort analysis; must be handled carefully during modeling.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

If outliers are genuine, support may be skewed toward high-value items — risk of bias.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize = (5,3))
sns.histplot(data = df,
             x = df['item_price'],
             bins = range(0,1500,200)
            )
mean = df['item_price'].mean()
plt.axvline(mean, color = 'red', linestyle = '--')
plt.text(700, 20000, 'mean = 1134', color = 'red')
plt.title("Item Price Histogram")

##### 1. Why did you pick the specific chart?

To observe the distribution and skewness in pricing.

##### 2. What is/are the insight(s) found from the chart?

- Highly right-skewed.
- Mean item price ≈ ₹1134, but most transactions are much lower.

##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes — ignoring skew may result in poor segmentation of premium vs low-value items.

#### Chart - 4 Barplot showing distribution of Agent Tenure Buckets

In [None]:
# Chart - 4 visualization code
plt.figure(figsize= (7,5))
tenure_bucket = df['tenure_bucket'].value_counts()
ax = sns.barplot(x= tenure_bucket.index, y = tenure_bucket.values, palette = 'viridis' ,legend=False)
plt.title('Agent Tenure Buckets Barplots')

##### 1. Why did you pick the specific chart?

To examine distribution of agent experience (tenure).

##### 2. What is/are the insight(s) found from the chart?

- Majority have >90 days or are in training.
- Less experienced (0–60 days) agents are fewer.

##### 3. Will the gained insights help creating a positive business impact?

New agents might need better onboarding as they impact customer satisfaction early on.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Potential quality dips if large workloads are handled by fresh hires.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
category_csat = df.groupby('category')['csat_score'].mean().sort_values(ascending = False)
plt.figure(figsize = (12,5))
ax = sns.barplot(x = category_csat.index, y = category_csat.values, palette = 'coolwarm')
plt.xticks(rotation = 45)
plt.grid(axis = 'y', linestyle = '--', alpha = 0.7)
#ax.axhline(df['csat_score'].mean(), ls = '--', color = 'red', label = 'Median_CSAT_Score')
plt.title("Average CSAT Score by Interation Category",fontsize = 16)

##### 1. Why did you pick the specific chart?

To identify which support types yield higher/lower satisfaction.

##### 2. What is/are the insight(s) found from the chart?

- App/Website, Payments, and Returns have highest CSAT.
- Others and Cancellation are lowest.

##### 3. Will the gained insights help creating a positive business impact?

Prioritize training for teams handling low-CSAT categories.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes — Cancellation and Others indicate dissatisfaction. Fix root causes like delay/refund policies.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
category_csat = df.groupby('channel_name')['csat_score'].mean().sort_values(ascending = False)
plt.figure(figsize = (12,5))
ax = sns.barplot(x = category_csat.index, y = category_csat.values, palette = 'viridis')
plt.xticks(rotation = 45)
plt.grid(axis = 'y', linestyle = '--', alpha = 0.7)
#ax.axhline(df['csat_score'].mean(), ls = '--', color = 'red', label = 'Median_CSAT_Score')
plt.title("Average CSAT Score by Channel Name",fontsize = 16)

##### 1. Why did you pick the specific chart?

To compare performance across support channels.

##### 2. What is/are the insight(s) found from the chart?

- Outcall slightly edges Inbound.
- Email scores the lowest.

##### 3. Will the gained insights help creating a positive business impact?

Email team training or process revamp can boost overall CSAT.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes — Poor email CSAT can affect customers who prefer non-verbal channels.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
category_order = df.groupby('category')['csat_score'].mean().sort_values(ascending=False).index.tolist()
channel_order = df.groupby('channel_name')['csat_score'].mean().sort_values().index.tolist()
plt.figure(figsize=(12, 5)) # Make the figure wider
sns.barplot(
    data=df,
    x='category',
    y='csat_score',
    hue='channel_name',
    order=category_order, # Apply order for categories
    hue_order=channel_order, # Apply order for channels
    palette='viridis'
)
plt.title('Average CSAT Score by Interaction Category and Channel', fontsize=18)
plt.xticks(rotation=45, ha='right', fontsize=10)

##### 1. Why did you pick the specific chart?

To analyze how CSAT varies by both interaction type and channel — multivariate view.

##### 2. What is/are the insight(s) found from the chart?

- Email underperforms across many categories.
- Inbound is more consistent.
- Variance is high in some categories.

##### 3. Will the gained insights help creating a positive business impact?

Enables targeting specific category+channel combinations for improvement.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes — CSAT drops in specific category-channel pairs (e.g. Email + Cancellation) need urgent fixes.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

In [None]:
import scipy.stats as stats

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

### Hypothetical Statement - 1
Is there a statistically significant difference in the average CSAT scores across different customer communication channels (channel_name)?

**Null Hypothesis (H_0):** There is no statistically significant difference in the average CSAT score among the different customer communication channels (Email, Inbound, Outcall). Any observed differences are due to random chance.

**Alternate Hypothesis (H_1):** At least one channel's average CSAT score is statistically different from the others.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Isolate the 'csat_score' for each channel type
csat_email = df[df['channel_name'] == 'Email']['csat_score']
csat_inbound = df[df['channel_name'] == 'Inbound']['csat_score']
csat_outcall = df[df['channel_name'] == 'Outcall']['csat_score']

In [None]:
# Perform the One-Way ANOVA test
# Pass each group's data as separate arguments to f_oneway
f_statistic, p_value_simple = stats.f_oneway(csat_email, csat_inbound, csat_outcall)

In [None]:
print("\n--- One-Way ANOVA Results (using scipy.stats.f_oneway) ---")
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value_simple:.4f}")

In [None]:
df['channel_name'].unique()

##### Which statistical test have you done to obtain P-Value?

One-Way Analysis of Variance (ANOVA)

##### Why did you choose the specific statistical test?

Because we are comparing the mean CSAT scores of more than two independent groups (Email, Inbound, Outcall). One-way ANOVA is ideal for this scenario.

### Hypothetical Statement - 2
Is there a statistically significant difference in the average CSAT scores across different interaction categories (category)?

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H_0):** There is no statistically significant difference in the average CSAT score among the various interaction category types. Any observed differences are due to random chance.

**Alternate Hypothesis (H_1):** At least one interaction category's average CSAT score is statistically different from the others.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Get all unique categories
unique_categories = df['category'].unique()

# This extracts the 'csat_score' for all rows belonging to a specific 'category'
csat_scores_by_category = [df[df['category'] == category]['csat_score'] for category in unique_categories]


In [None]:
# The asterisk (*) unpacks the list of Series, passing each Series as a separate argument
f_statistic, p_value_simple = stats.f_oneway(*csat_scores_by_category)

In [None]:
print("\n--- One-Way ANOVA Results (using scipy.stats.f_oneway) ---")
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value_simple:.4f}")

##### Which statistical test have you done to obtain P-Value?

One-Way ANOVA using scipy.stats.f_oneway()

##### Why did you choose the specific statistical test?/


We are testing whether there is a significant difference in mean CSAT scores across multiple categorical groups. One-way ANOVA is suitable to check differences between >2 group means.

### Hypothetical Statement - 3

Does the presence of an order date (has_order_date) affect the customer satisfaction score (CSAT)?

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H₀):**
There is no statistically significant difference in the average CSAT scores between interactions that have an order date and those that do not.
Any observed difference is due to random chance. \
**Alternate Hypothesis (H₁):**
There is a statistically significant difference in the average CSAT scores between interactions that have an order date and those that do not.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Group 1: Interactions where has_order_date is 1 (True)
csat_with_order = df[df['has_order_date'] == 1]['csat_score']

# Group 2: Interactions where has_order_date is 0 (False)
csat_without_order = df[df['has_order_date'] == 0]['csat_score']


In [None]:
# We'll use equal_var=False (Welch's t-test) which does not assume equal variances, making it more robust.
t_statistic, p_value_ttest = stats.ttest_ind(a=csat_with_order, b=csat_without_order, equal_var=False)


In [None]:
print("\n--- Independent Samples t-test Results (CSAT by has_order_date) ---")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value_ttest:.4f}")


##### Which statistical test have you done to obtain P-Value?

Welch’s t-test (independent samples t-test with unequal variances) using scipy.stats.ttest_ind() with equal_var=False.

##### Why did you choose the specific statistical test?

This is a binary comparison of two independent groups (has_order_date = 1 vs 0). Welch’s t-test is preferred when variances may not be equal.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
df.info()

In [None]:
# Make a copy to ensure original df is not modified if you want to rerun parts
df_processed = df.copy()
df_processed.head(10)

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Check Missing Values
df_processed.isnull().sum()

In [None]:
df_processed.drop(columns=['sub_category','order_date_time','customer_city','agent_name','supervisor'], inplace = True)

In [None]:
df_processed.info()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
lower_bound_price = df_processed['item_price'].quantile(0.01)
upper_bound_price = df_processed['item_price'].quantile(0.99)
df_processed['item_price'] = np.clip(df_processed['item_price'], lower_bound_price, upper_bound_price)
print(f"Outliers in 'item_price' clipped to between {lower_bound_price:.2f} and {upper_bound_price:.2f}.")

##### What all outlier treatment techniques have you used and why did you use those techniques?

Techniques Used:
1. Filled customer_city with "Missing_City"
→ Used for meaningful placeholder to preserve categorical integrity.

2. Filled customer_remarks with empty string ''
→ Used because it’s text and can be treated as absence of remarks without harming modeling.

3. Filled product_category with "No_Product_Context"
→ Helps retain record without dropping rows, while indicating lack of category info.

4. Filled item_price with 0
→ Logically assumes that unknown price ≈ free sample or negligible in absence of info.

5. Filled connected_handling_time with 0
→ Reasonable if connection time wasn’t recorded or the interaction didn’t require it.

6. Created binary flags: has_order_date, is_connected_call
→ Preserves signal from missingness which might be useful for modeling.

7. Dropped:
- order_date_time (too sparse)
- sub_category, customer_city, agent_name, supervisor (low utility/high cardinality or already captured indirectly)

Why These Techniques?
- All techniques preserve data rather than delete.
- Logical replacements for categorical/textual fields improve model performance.
- Binary flags help retain useful information from missingness patterns.
- Dropping highly sparse/complex columns avoids overfitting/noise.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
df_processed.shape

In [None]:
# Identify all columns with 'object' dtype
object_columns = df_processed.select_dtypes(include='object').columns.tolist()
object_columns.remove('customer_remarks')

In [None]:
object_columns

In [None]:
# Perform One-hot encoding on identified columns.
df_encoded = pd.get_dummies(df_processed, columns=object_columns, drop_first=True, dtype=int)
df_encoded.drop(columns = 'customer_remarks', inplace = True)

In [None]:
df_encoded.info()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Techniques Used:
- Clipping (also known as Winsorizing):
item_price was clipped between the 1st percentile and 99th percentile values.

Why This Technique?
- item_price is highly right-skewed with extreme outliers, which can distort model performance.
- Clipping helps reduce model sensitivity to large outliers without losing the rows.
- It retains the overall distribution shape while bounding extreme influence.



#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

In [None]:
# Select your features wisely to avoid overfitting
sns.heatmap(df_encoded.corr(), cbar=True, cmap='viridis')

In [None]:
correlation_matrix = df_encoded.corr()
csat_correlations = correlation_matrix['csat_score']

top_correlated_features = csat_correlations.abs().sort_values(ascending=False)
top_correlated_features = top_correlated_features.drop('csat_score')

In [None]:
num_top_features_to_display = 20
print(top_correlated_features.head(num_top_features_to_display))

In [None]:
# Define the exact list of 20 features you want to use, plus the target variable
selected_20_features_list = [
    'item_price',
    'has_order_date',
    'product_category_No_Product_Context',
    'category_Returns',
    'category_Order Related',
    'product_category_Mobile',
    'product_category_Home Appliences',
    'tenure_bucket_On Job Training',
    'issue_reported_at',        # Note: This is a raw datetime column (numerical timestamp)
    'product_category_Electronics',
    'agent_shift_Morning',
    'product_category_Furniture',
    'issue_responded',          # Note: This is a raw datetime column (numerical timestamp)
    'survey_response_date',     # Note: This is a raw datetime column (numerical timestamp)
    'category_Product Queries',
    'product_category_Books & General merchandise',
    'category_Cancellation',
    'manager_William Kim',
    'manager_Jennifer Nguyen',
    'agent_shift_Split',
    'csat_score' # Don't forget your target variable!
]

In [None]:
df_filtered_for_model = df_encoded[
    [col for col in selected_20_features_list if col in df_encoded.columns]
].copy()

In [None]:
df_filtered_for_model.head()

In [None]:
for dt_col in ['issue_reported_at', 'issue_responded', 'survey_response_date']:
    if dt_col in df_filtered_for_model.columns and df_filtered_for_model[dt_col].dtype == 'object':
        df_filtered_for_model[dt_col] = pd.to_datetime(df_filtered_for_model[dt_col], errors='coerce').astype(np.int64) // 10**9 # Convert to Unix timestamp
        # Handle NaNs that might result from coerce
        df_filtered_for_model[dt_col] = df_filtered_for_model[dt_col].fillna(df_filtered_for_model[dt_col].median())
        print(f"Converted '{dt_col}' to numerical timestamp.")


In [None]:
print(f"DataFrame shape with selected 20 features: {df_filtered_for_model.shape}")

##### What all feature selection methods have you used  and why?

Methods Used:
1. Correlation-Based Feature Selection:
- Used df_encoded.corr() to calculate pairwise correlations with the target csat_score.
- Selected top 20 features based on absolute correlation strength.

2. One-Hot Encoding (OHE):
- Before correlation, categorical features were encoded, which made them analyzable by correlation matrix.

3. Manual Sanity Check:
- After correlation ranking, a domain-informed filtering step ensured inclusion of interpretable and non-redundant features (e.g., dropped multicollinear ones or low business value fields).

4. Datetime Conversion:
- Converted datetime columns (issue_reported_at, issue_responded, survey_response_date) into Unix timestamp to use in modeling.
- Imputed any resulting NaNs with median.

##### Which all features you found important and why?

Key Features and Reasoning:
Feature	Reason
1. item_price- Direct customer investment — likely to affect CSAT
2 has_order_date- Indicates real order context — influences satisfaction
3. product_category_*- Different product types may have varying service expectations
4. category_* -Nature of issue (Returns, Cancellations) highly correlated to CSAT
5. tenure_bucket_On Job Training- Agent experience affects customer perception
6. agent_shift_*- Service quality may vary by shift (e.g., morning = better staff availability)
7. manager_*- Manager-level differences may show in agent/team performance
8. issue_*,survey_response_date- 	Time-based features might capture response lags or delay patterns
9. csat_score - Target variable for classification



### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

| Transformation                                     | Columns                                                        | Reason                                                                                                                                      |
| -------------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| **Datetime to Unix timestamp**                     | `issue_reported_at`, `issue_responded`, `survey_response_date` | ML models can’t interpret datetime strings — converting to **numeric timestamps** enables meaningful analysis (like response delay).        |
| **Binary classification transformation of target** | `csat_score → y = 1 if score ≥ 4 else 0`                       | Reframed the problem as a **binary classification task**: satisfied vs. not satisfied — simplifies modeling and aligns with business goals. |
| **Imputation of NaTs from datetime**               | Filled with column median                                      | Prevents nulls from breaking model training — median ensures minimal skew influence.                                                        |


In [None]:
# 1. Separate features (X) and target (y)
X = df_filtered_for_model.drop('csat_score', axis=1)
y = df_filtered_for_model['csat_score']

In [None]:
numerical_features_for_ops = [
    'item_price',
    # Note: 'connected_handling_time' and 'response_time_in_hours' are NOT in your top 20 list.
    # If they were intended to be transformed/scaled, they need to be in the 20 features list.
    # The datetime features ('issue_reported_at', 'issue_responded', 'survey_response_date')
    # will also be treated as numerical for scaling if they remain.
    'issue_reported_at',
    'issue_responded',
    'survey_response_date'
]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
y = df_filtered_for_model['csat_score'].apply(lambda score: 1 if score >= 4 else 0)

### 6. Data Scaling

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

##### What data splitting ratio have you used and why?

- Splitting Ratio: 80% Train / 20% Test  
#### Why this ratio?
- Standard best practice in classification tasks with moderately large datasets.
- Ensures enough data for training while keeping a fair test set to evaluate generalization.
- Stratified split was used via stratify=y to maintain class balance between train and test sets.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nFinal Data Shapes:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


In [None]:
X_train.dtypes

In [None]:
datetime_cols_to_convert = [
    'issue_reported_at',
    'issue_responded',
    'survey_response_date'
]

In [None]:
for col in datetime_cols_to_convert:
    if col in X.columns:
        # 1. Ensure the column is a datetime object (errors='coerce' turns unparseable dates into NaT)
        X[col] = pd.to_datetime(X[col], errors='coerce')

        # 2. Convert datetime to Unix timestamp (seconds since 1970-01-01)
        # .astype(np.int64) converts to nanoseconds since epoch
        # // 10**9 converts nanoseconds to seconds
        X[col] = X[col].astype(np.int64) // 10**9

        # 3. Handle any NaNs that might have resulted from 'errors=coerce' or original NaTs
        # Fill NaNs with the median of the column's numerical values
        median_timestamp = X[col].median()
        X[col] = X[col].fillna(median_timestamp)

        print(f"Converted '{col}' to numerical timestamp and filled NaNs with median.")
    else:
        print(f"Warning: Column '{col}' not found in DataFrame. Skipping conversion.")


### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

In [None]:
# Import packages for data modelin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance

### ML Model - 1 Logitic Regression

In [None]:
# Construct a logistic regression model and fit it to the training set
log_clf = LogisticRegression(random_state=0,max_iter=500).fit(X_train,y_train)


In [None]:
# Use the logistic regression model to get predictions on the encoded testing set
y_pred = log_clf.predict(X_test)

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
# Get the feature names from the model and the model coefficients (which represent log-odds ratios)
# Place into a DataFrame for readability
pd.DataFrame(data={"Feature Name":log_clf.feature_names_in_ , "Model Coefficient":log_clf.coef_[0]})

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

##### Why Logistic Regression?
- A strong baseline model for binary classification problems.
- Offers interpretability through feature coefficients (log-odds).
- Simple, fast, and useful to evaluate before deploying complex models.


##### Interpretation:
- High accuracy and strong performance for class 1 (Satisfied).
- Very poor recall for class 0 (Not satisfied) – model fails to catch dissatisfied users.
- This is due to class imbalance — Logistic Regression is biased toward majority class.

### ML Model - 2 Random Forest

In [None]:
# Split the data into training and testing sets
X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.25, random_state=0)

In [None]:
# Get shape of each training, validation, and testing set
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

In [None]:
# Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=0)

# Create a dictionary of hyperparameters to tune
cv_params = {'max_depth': [5, 7, None],
             'max_features': [0.3, 0.6],
            #  'max_features': 'auto'
             'max_samples': [0.7],
             'min_samples_leaf': [1,2],
             'min_samples_split': [2,3],
             'n_estimators': [75,100,200],
             }

# Define a dictionary of scoring metrics to capture
scorings = ['accuracy', 'precision', 'recall', 'f1']

# Instantiate the GridSearchCV object
rf_cv = GridSearchCV(rf, cv_params, scoring=scorings, cv=5, refit='recall')

In [None]:
%%time
rf_cv.fit(X_train, y_train)

In [None]:
# Examine best recall score
rf_cv.best_score_

In [None]:
# Examine best parameters
rf_cv.best_params_

In [None]:
# Create a confusion matrix to visualize the results of the classification model

# Compute values for confusion matrix
log_cm = confusion_matrix(y_val, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot()

# Display plot
plt.show()

In [None]:
print(classification_report(y_val, y_pred,))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Why Random Forest?
- A powerful ensemble classifier that combines multiple decision trees to reduce overfitting and boost accuracy.
- Handles non-linear relationships, missing values, and imbalanced classes better than Logistic Regression.

#### 2. Feature-Importance

In [None]:

importances = rf_cv.best_estimator_.feature_importances_
rf_importances = pd.Series(importances, index=X_test.columns)

fig, ax = plt.subplots()
rf_importances.plot.bar(ax=ax)
ax.set_title('Feature importances')
ax.set_ylabel('Mean decrease in impurity')
fig.tight_layout()

In [None]:
# Compute values for confusion matrix
log_cm = confusion_matrix(y_test, y_pred)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=None)

# Plot confusion matrix
log_disp.plot()

# Display plot
plt.title('Random forest - test set');
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV with 5-fold cross-validation. \
**Reason:** It exhaustively searches through a defined parameter grid and evaluates models using multiple folds.
Refitting was done on 'recall' to prioritize capturing dissatisfied customers (class 0), which are rarer but more critical from a business perspective.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

| Metric                  | Logistic Regression | Random Forest |
| ----------------------- | ------------------- | ------------- |
| **Accuracy**            | 82%                 | 82%           |
| **Recall (Class 0)**    | 0.02                | **0.01**      |
| **Precision (Class 0)** | 0.47                | **0.18**      |
| **F1-Score (Class 0)**  | 0.05                | **0.02**      |
| **Recall (Class 1)**    | 0.99                | **0.99**      |
| **F1-Score (Class 1)**  | 0.90                | **0.90**      |


Observation:
- Recall for class 1 (satisfied) remains very strong.
- Class 0 detection still weak, but precision improved slightly.
- Feature importance plot helps interpret model — item_price, has_order_date, No_Product_Context are top contributors.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

1. **Accuracy:**
- Measures how many predictions were correct.
- Business Use: General health of model — but misleading under class imbalance.

2. **Precision (Class 0):**
- Of all cases predicted as "not satisfied", how many were actually correct?
- Business Impact: High precision = fewer false alarms → avoids unnecessary service escalations.

3. **Recall (Class 0):**
- Of all truly dissatisfied customers, how many did we catch?
- Business Impact: High recall is crucial for preventing churn and reputational damage.

4. **F1-Score:**
- Harmonic mean of precision and recall — good for imbalanced datasets.
- Business Impact: Helps balance missed dissatisfied customers vs. over-alerting.



### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered the following metrics:
| Metric                              | Why it matters for business                                                                                                    |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| **Recall (Class 0: Not Satisfied)** | **Most important** — helps us identify dissatisfied customers. Missing these can lead to churn, complaints, and loss of trust. |
| **Precision (Class 0)**             | Important to avoid **false alarms** — unnecessarily escalating satisfied cases adds cost and agent load.                       |
| **F1-Score (Class 0)**              | Balances recall and precision — ideal for **imbalanced data**.                                                                 |
| **Accuracy**                        | Secondary metric — gives an overall model performance, but not suitable alone due to class imbalance.                          |


**Business Goal:** Minimize customer churn and negative reviews by maximizing recall for dissatisfied customers, while keeping false positives in check.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose the **Random Forest model** as the **final prediction model.**                       
### Why?
- Better interpretability with feature importances.
- Handles non-linearities, missing values, and categorical splits better than Logistic Regression.
- Despite slight performance similarity in accuracy, precision for class 0 improved (from 0.47 to 0.18).
- Model explainability through impurity-based feature ranking is easier and clearer.
- Can be tuned further with class weights, threshold tuning, or ensemble strategies to improve minority class recall.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used Random Forest, an ensemble learning method that constructs multiple decision trees and aggregates their predictions to improve performance and reduce overfitting.
- It uses bagging (bootstrap aggregation) to train each tree on a different random subset.
- Final prediction is based on majority vote (for classification tasks like ours).

| Top Features Identified               | Business Interpretation                                                |
| ------------------------------------- | ---------------------------------------------------------------------- |
| `item_price`                          | Higher order value → more sensitive CSAT impact                        |
| `has_order_date`                      | Orders with missing date context lead to uncertainty or bad experience |
| `product_category_No_Product_Context` | Absence of product context reduces agent effectiveness                 |
| `category_Returns`                    | Return cases handled well → drive positive CSAT                        |
| `category_Order Related`              | Delivery/fulfillment issues correlate with dissatisfaction             |


**Visualization:** I used rf_cv.best_estimator_.feature_importances_ to generate a bar plot showing top drivers of model predictions, aiding business teams in targeting problem areas.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the
import pickle

# Save the model
with open('best_model.pkl', 'wb') as file:
    pickle.dump(rf_cv.best_estimator_, file)

print("Model saved successfully as 'best_model.pkl'")


In [None]:
import joblib

# Save the model
joblib.dump(rf_cv.best_estimator_, 'best_model.joblib')

print("Model saved successfully as 'best_model.joblib'")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
import pickle

# Load the model from file
with open('best_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

print(" Model loaded successfully from 'best_model.pkl'")


In [None]:
# Assuming X_test is your unseen/test data from earlier
y_pred_loaded = loaded_model.predict(X_test)

# Print first 10 predictions
print("Sample Predictions:", y_pred_loaded[:10])

In [None]:
from sklearn.metrics import classification_report

print("\nEvaluation on Test Data:")
print(classification_report(y_test, y_pred_loaded))


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

### **Final Conclusion**

In this project, we developed a robust machine learning pipeline to predict **Flipkart customer service satisfaction (CSAT)** using interaction-level data across various support channels.

* We started with a **thorough understanding** of the business problem and performed **extensive EDA**, discovering key trends like:

  
  * Most support comes via Inbound calls
  * Certain categories like **Returns** and **App/Website issues** drive higher CSAT
  * **Email support** tends to receive lower satisfaction scores


* We handled missing data and outliers with thoughtful strategies and created **new informative features** like:
  * `has_order_date`
  * `is_connected_call`


* Two models were implemented:

  * **Logistic Regression** (baseline)
  * **Random Forest** (tuned with GridSearchCV)


* Based on **recall and precision for the minority class (dissatisfied customers)**, **Random Forest** was selected as the final model.

* The model was saved using both **Pickle and Joblib**, and successfully reloaded for prediction, ensuring it's **deployment-ready**.

---

### **Business Impact**

This model can help Flipkart:

* **Proactively identify unhappy customers**
* Improve agent training and ticket routing
* Monitor and optimize **category-wise and channel-wise service quality**

With further tuning (e.g., threshold adjustment, SMOTE for imbalance), this system can be integrated into real-time support workflows to improve customer retention and brand loyalty.

---

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***