# **CSAT Intel: Forecasting Flipkart Customer Satisfaction**    -



##### **Project Type**    : Classification
##### **Contribution**    : Individual
##### **- R.KAMALI**      

# **Project Summary**

Customer satisfaction is a critical performance metric in the fast-paced world of e-commerce today. Knowing how users feel after interacting with support is essential for a top platform like Flipkart in order to retain users and increase operational efficiency. Direct information about customer sentiment can be obtained from the **CSAT (Customer Satisfaction) Score**, which is usually obtained through feedback surveys and ranges from 1 to 5. But not all clients complete the survey, and waiting for answers causes delays. This project uses machine learning to try to close that gap.

"**CSAT Intel**" is a predictive analytics tool that uses classification models to forecast customer satisfaction ratings from support interaction data. Building a system that can reliably categorize CSAT scores (1–5) based on a variety of operational characteristics, including issue type, agent profile, shift timing, and interaction response time.


#### **KEY COLUMNS:**
The dataset used contains over **85,000** historical support records.

**Unique id** – Unique identifier for each support ticket or customer interaction.

**channel_name** – Type of interaction channel (e.g., Inbound, Outcall, Self-service).

**category** – High-level classification of the issue (e.g., Product Queries, Returns, Order Related).

**Sub-category** – More specific sub-type of issue within the main category (e.g., Life Insurance, Installation/demo).

**Customer Remarks** – Text remarks or complaints provided by the customer.

**Order_id** – Unique Flipkart order ID associated with the support interaction.

**order_date_time** – Date and time when the customer placed the order.

**Issue_reported at** – Timestamp indicating when the issue was first raised.

**issue_responded** – Timestamp indicating when the agent responded to the issue.

**Survey_response_Date** – Date when the customer completed the CSAT feedback survey.

**Customer_City** – City from which the customer contacted Flipkart support.

**Product_category**– The category of the product involved in the support ticket.

**Item_price** – Price of the item mentioned in the order (in INR).

**connected_handling_time** – Total connected time (in minutes) the agent spent handling the issue.

**Agent_name** – Name of the customer service agent handling the ticket.

**Supervisor** – Supervisor who oversees the performance of the agent.

**Manager** – Manager responsible for the agent’s overall performance.

**Tenure Bucket** – Experience range of the agent (e.g., 0–30 days, 31–90 days, >90 days).

**Agent Shift** – Shift timing when the agent handled the ticket (Morning, Evening, Night).

**CSAT Score** – Customer Satisfaction Score, ranging from 1 (Very Dissatisfied) to 5 (Very Satisfied) — [ target variable for prediction ]

#### **CLASSIFICATION MODELS USED:**
1.Random Forest Classifier :

Robust, easy to tune, works well with categorical and numerical data, and handles imbalanced classes better than many others.
Low risk of overfitting, good performance without much parameter tuning.

2️. Logistic Regression :

Simple, interpretable, and fast. Great for establishing a baseline performance.
Helps understand the impact of individual features on prediction. Useful for explaining model predictions to non-technical stakeholders.

#### **EVALUATION METRICS:**
**Accuracy** – Measures how often the model predicts correctly overall.

**Precision, Recall, F1-Score** – Precision shows correct positives, Recall shows missed positives, and F1 balances both.

**Confusion Matrix** – Shows actual vs. predicted values to understand model errors.

**Cross-Validation** – Tests model reliability by training on multiple data splits.

**Hyperparameter Tuning (GridSearchCV)**– Finds the best model settings for improved performance

#### **TARGET VARIABLE:**
The CSAT values were optionally divided into three segments to help make the insights more actionable:

Low (1–2): Customers who are at risk

Neutral (3): Moderately dangerous

High (4–5): Contented clients

# **GitHub Link:**

https://github.com/Kamali-836/CSAT-Intel-Forecasting-Flipkart-Customer-Satisfaction

# **Problem Statement**


To predict Customer Satisfaction (CSAT) scores using Flipkart’s customer support data by analyzing operational features such as issue category, agent profile, and response time, with the goal of identifying dissatisfied customers early and enabling timely service improvements.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd                 # For working with dataframes
import numpy as np                  # For numerical operations

import matplotlib.pyplot as plt     # For basic plotting
import seaborn as sns               # For statistical visualizations
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")   # Suppress warnings for cleaner output

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV  # For splitting and model validation
from sklearn.preprocessing import OneHotEncoder, StandardScaler                      # For encoding and scaling
from sklearn.pipeline import Pipeline                                                # For combining preprocessing + model
from sklearn.compose import ColumnTransformer                                        # For column-wise preprocessing
from scipy import stats                                                              # For hypothesis testing
from scipy.stats import spearmanr
from sklearn.decomposition import PCA                                                # Dimensionality Reduction

from sklearn.linear_model import LogisticRegression              # Baseline model
from sklearn.ensemble import RandomForestClassifier              # Tree-based model
from xgboost import XGBClassifier                                # Boosted tree model

import joblib

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # For measuring model performance

from google.colab import files
uploaded = files.upload()

### Dataset Loading

In [None]:
try:
    df = pd.read_csv("Customer_support_data.csv")
    print("Dataset loaded !")
except FileNotFoundError:
    print("File not found. Please check the file name or path.")
except Exception as e:
    print("An unexpected error occurred:", str(e))

### Dataset First View

In [None]:
# first 5 rows of the dataset

df.head()

### Dataset Rows & Columns count

In [None]:
# Total number of rows and columns in the dataset

print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

### Dataset Information

In [None]:
# Display data types, non-null counts, and memory usage

df.info()

#### Duplicate Values

In [None]:
# Duplicate rows in the dataset

dcount = df.duplicated().sum()
print("Number of duplicate rows:", dcount)

#### Missing Values/Null Values

In [None]:
# Count of missing/null values in each column

miss = df.isnull().sum()
miss = miss[miss > 0].sort_values()

# Display the columns with missing values

print("Number of columns with missing values:", len(miss))
print("\nColumns with missing values:\n")
print(miss)


In [None]:
# Bar chart showing count of missing values per column

plt.figure(figsize=(10, 5))
miss.plot(kind='barh', color='coral')
plt.title("Missing Values Count per Column")
plt.xlabel("Number of Missing Values")
plt.ylabel("Columns")
plt.show()


### What did you know about your dataset?

The dataset includes 85,907 rows and 20 columns, without any duplicate records. The target variable is CSAT Score, which ranges from 1 to 5, and it has no missing values. Seven columns contain missing values; for example, connected_handling_time is missing in more than 99% of the rows. Most features are categorical. Columns like Issue_reported and issue_responded can help determine response time. The dataset is suitable for classification, but it needs handling of missing values, encoding, and feature engineering before modeling.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns.tolist()

In [None]:
# Dataset Describe

df.describe()

### Check Unique Values for each variable.

In [None]:
# Display the number of unique values in each column

ucount = df.nunique().sort_values()
print("Number of unique values per column:\n")
print(ucount)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Handling Missing Values & Missing Value Imputation

# 1. Drop highly sparse columns

df = df.drop(columns=[
    'connected_handling_time',  # 99.7% missing
    'Customer Remarks',         # 66% missing
    'order_date_time'           # 80% missing
])


# 2. Drop rows with missing values in important features

df = df.dropna(subset = [
    'Order_id', 'Product_category', 'Item_price',
    'Customer_City'
])




# Handling Outliers & Outlier Treatments

# 1. Convert date columns to datetime format

df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], errors='coerce')
df['issue_responded'] = pd.to_datetime(df['issue_responded'], errors='coerce')


# 2. Create a new feature — response time in minutes

df['response_time_mins'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / 60


# 3. Drop rows with invalid or negative response times

df = df[df['response_time_mins'] >= 0]





# Reset index after cleaning

df.reset_index(drop=True, inplace=True)


print("Data is cleaned and ready for analysis.")
print("New shape:", df.shape)
miss = df.isnull().sum()
print(miss)

### What all manipulations have you done and insights you found?

**Manipulations Done:**

1. Removed connected_handling_time, Customer Remarks, and order_date_time due to extremely high missing values (>65%–99%).

2. Removed records with nulls in Order_id, Product_category, Item_price, and Customer_City.

3. Converted date columns to datetime format. Parsed Issue_reported at and issue_responded to enable time-based analysis.

4. A new feature created, response_time_mins = time taken (in minutes) for the agent to respond to the issue.

5. Filtered out rows where response time was negative or invalid.

6. Reindexed the cleaned dataset for consistency.

**Insights Found:**

1. Several columns were not usable due to excessive missing data, confirming the need for targeted data collection.

2. Response time is now a measurable metric, which can be analyzed against CSAT scores.

3. The dataset after cleaning is now free of nulls in critical fields, enabling smooth EDA and modeling.

4. Many columns are categorical, and will require encoding before modeling.

5. The dataset is now in a structured, analysis-ready state with consistent formatting and well-defined features.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# UNIVARIATE ANALYSIS

# CSAT Score Distribution

sns.countplot(x='CSAT Score', data=df, palette='pastel')
plt.title("CSAT Score Distribution")
plt.xlabel("CSAT Score")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A countplot is great for visualizing the frequency of each CSAT score so that we can see the class imbalance within the target variable.

##### 2. What is/are the insight(s) found from the chart?

Most customers used a score of 5 (very satisfied) while there were few occurrences of scores 1-3, and hence, a strong sign of class imbalance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this helps us focus more on the customers who gave low scores. If we don’t fix the imbalance, the model may always guess score 5, which can miss unhappy customers — and that can hurt the business.

#### Chart - 2

In [None]:
# UNIVARIATE ANALYSIS

# Issue Category Distribution

sns.countplot(y='category', data=df, order=df['category'].value_counts().index, palette='Set2')
plt.title("Support Issue Categories")
plt.xlabel("Count")
plt.ylabel("Category")
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal countplot is perfect to show how many tickets fall under each issue category, especially when the labels are long.

##### 2. What is/are the insight(s) found from the chart?

Most support issues are about Returns and Order Related problems. Other categories like Refunds and Product Queries are much less frequent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, knowing that Returns and Order issues are the most common helps Flipkart focus its resources, improve those processes, and train agents better.
If these high-volume categories are not handled well, it can lead to more low CSAT scores and negative business impact.

#### Chart - 3

In [None]:
# UNIVARIATE ANALYSIS

# Response Time Distribution

sns.histplot(df['response_time_mins'], bins=50, kde=True, color='skyblue')
plt.title("Distribution of Agent Response Time (minutes)")
plt.xlabel("Response Time (minutes)")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with KDE helps us understand how response times are spread out and whether there are outliers or long delays.

##### 2. What is/are the insight(s) found from the chart?

Most tickets were responded to quickly (under 10,000 minutes), but a few took extremely long — which causes a skewed distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The long response time outliers highlight inefficiencies. Fixing these can improve CSAT scores. If ignored, they may hurt customer trust and satisfaction (negative impact).

#### Chart - 4

In [None]:
# BIVARIATE ANALYSIS

# CSAT Score vs Channel Name (Categorical + Categorical)

sns.countplot(x='channel_name', hue='CSAT Score', data=df, palette='Set2')
plt.title("CSAT Score by Support Channel")
plt.xlabel("Support Channel")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A grouped countplot with hue='CSAT Score' helps compare CSAT levels across different support channels.

##### 2. What is/are the insight(s) found from the chart?

The Inbound channel has the highest number of responses, both good and bad. Most 5-star scores also come from Inbound, while Outcall and Email channels receive fewer ratings overall.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — this tells Flipkart that improving the Inbound channel can impact CSAT the most. However, neglecting Email or Outcall channels may leave some customers unsatisfied, risking negative growth from overlooked segments.

#### Chart - 5

In [None]:
# BIVARIATE ANALYSIS

# Item Price vs Response Time (Numerical + Numerical)

sns.scatterplot(x='Item_price', y='response_time_mins', data=df, hue='CSAT Score', palette='tab10')
plt.title("Item Price vs Response Time (colored by CSAT Score)")
plt.xlabel("Item Price (INR)")
plt.ylabel("Response Time (minutes)")
plt.show()

##### 1. Why did you pick the specific chart?

A scatterplot is the best way to observe the relationship between two numeric variables — in this case, price and response time — while also showing CSAT scores using color

##### 2. What is/are the insight(s) found from the chart?

There is no strong visible correlation between item price and response time. Most values are clustered near zero response time, but some outliers exist with very long wait times — even for expensive items.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — this helps identify that long delays are not tied to price, meaning all orders need equal attention. Ignoring these high-delay outliers, especially for premium orders, could reduce satisfaction and damage trust.

#### Chart - 6

In [None]:
# MULTIVARIATE ANALYSIS

# Avg Response Time by Category and Shift

pivot = df.pivot_table(values='response_time_mins', index='category', columns='Agent Shift', aggfunc='mean')
sns.heatmap(pivot, annot=True, fmt=".1f", cmap="YlGnBu")
plt.title("Avg Response Time by Category & Shift")
plt.xlabel("Agent Shift")
plt.ylabel("Issue Category")
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap with a pivot table helps visualize the average response time across two dimensions : issue categories and agent shifts , making it a true multivariate view.

##### 2. What is/are the insight(s) found from the chart?

Some categories like "Product Queries and Payments" show extremely high response times during specific shifts , while others are handled faster in shifts like Morning or Split. “Offers & Cashback” at Night also stands out with a huge delay.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — identifying which categories and shifts have slower response times helps Flipkart optimize shift scheduling and staffing. If not corrected, these delays can lower customer satisfaction and lead to negative reviews or churn.

#### Chart - 7

In [None]:
# MULTIVARIATE ANALYSIS

# CSAT by Agent Shift and Tenure Bucket

pivot = df.pivot_table(index='Agent Shift', columns='Tenure Bucket', values='CSAT Score', aggfunc='count')
pivot.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='tab20')
plt.title("CSAT Count by Agent Shift and Tenure Bucket")
plt.xlabel("Agent Shift")
plt.ylabel("Number of Tickets")
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart helps compare how many CSAT scores came from different agent experience levels across all shifts , showing 3 variables at once.

##### 2. What is/are the insight(s) found from the chart?

The Morning and Evening shifts have the highest number of tickets. Most are handled by agents with >90 days experience, but there’s also a strong presence of less experienced agents , especially in Morning shifts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — it shows which shifts are busiest and which tenure levels are carrying the load. If new agents are overburdened or not trained enough during high-traffic shifts, it can lead to errors and lower CSAT, which may hurt customer trust.

#### Chart - 8 - Correlation Heatmap

In [None]:
# Correlation Matrix

plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix of Numeric Features")
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap of correlation values is the clearest way to show how strongly numerical features are related to each other (positive or negative).

##### 2. What is/are the insight(s) found from the chart?

All correlations are weak. CSAT Score has a slight negative correlation with both item price and response time, meaning longer delays or higher prices may reduce satisfaction — but the effect is minimal.

#### Chart - 9 - Pair Plot

In [None]:
# Pair Plot – Numeric Feature Relationships
sns.pairplot(df[['CSAT Score', 'Item_price', 'response_time_mins']], hue='CSAT Score', palette='Set2')
plt.suptitle("Pair Plot of Key Numeric Features", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot is great for exploring relationships between multiple numeric features at once, while using color (hue) to show CSAT Score.

##### 2. What is/are the insight(s) found from the chart?

There’s no clear linear relationship between CSAT and either price or response time, but most data is concentrated at lower values. It also shows how different CSAT scores are spread across price and time ranges.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*Average response time is the same across all agent shifts.*

**Null Hypothesis (H₀)**: There is no difference in the average agent response time across different agent shifts.

**Alternate Hypothesis (H₁)**: At least one agent shift has a significantly different average response time.

#### 2. Perform an appropriate statistical test.

In [None]:
# Grouping response times by agent shift

groups = [
    df[df['Agent Shift'] == shift]['response_time_mins'].dropna()
    for shift in df['Agent Shift'].unique()
]

# Perform One-way ANOVA

f_stat, p_val = stats.f_oneway(*groups)
print(f"F-Statistic: {f_stat:.2f}")
print(f"P-Value: {p_val:.4f}")


##### Which statistical test have you done to obtain P-Value?

One-way ANOVA (Analysis of Variance)

##### Why did you choose the specific statistical test?

One-way ANOVA is appropriate because:

1.   The dependent variable (response_time_mins) is numerical.
2.   The independent variable (Agent Shift) is categorical with more than 2 groups.
3.   We are testing if the means differ across those groups.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*The distribution of CSAT scores is independent of the issue category.*

**Null Hypothesis (H₀)**: CSAT scores are independent of the issue category.

**Alternate Hypothesis (H₁)**: CSAT scores are associated with the issue category.

#### 2. Perform an appropriate statistical test.

In [None]:
# Create a contingency table (Issue Category × CSAT Score)

ctable = pd.crosstab(df['category'], df['CSAT Score'])

# Perform Chi-square test of independence

chi2_stat, p_val, dof, expected = stats.chi2_contingency(ctable)

print(f"Chi-Square Statistic: {chi2_stat:.2f}")
print(f"P-Value: {p_val:.4f}")


##### Which statistical test have you done to obtain P-Value?

Chi-Square Test of Independence

##### Why did you choose the specific statistical test?

Because both variables : category and CSAT Score are categorical, and we are checking whether there is an association between them. The Chi-square test is designed exactly for this kind of comparison.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*There is no correlation between response time and CSAT score.*

**Null Hypothesis (H₀)**: There is no monotonic correlation between response time (response_time_mins) and CSAT score.

**Alternate Hypothesis (H₁)**: There is a monotonic correlation between response time and CSAT score.

#### 2. Perform an appropriate statistical test.

In [None]:
# Drop missing values

subset = df[['response_time_mins', 'CSAT Score']].dropna()

# Spearman Rank Correlation

corr, p_val = spearmanr(subset['response_time_mins'], subset['CSAT Score'])

print(f"Spearman Correlation: {corr:.2f}")
print(f"P-Value: {p_val:.4f}")


##### Which statistical test have you done to obtain P-Value?

Spearman Rank Correlation Test

##### Why did you choose the specific statistical test?

CSAT Score is an ordinal variable (1 to 5). Response time is numeric, but skewed with many outliers. We are interested in checking if there's a monotonic (not necessarily linear) relationship. Spearman is best suited for non-linear, non-normal data.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Categorical Encoding

In [None]:
# Select categorical features for encoding
categorical_features = ['channel_name', 'category', 'Sub-category', 'Agent Shift', 'Tenure Bucket', 'Product_category', 'Customer_City']
categorical_features = [col for col in categorical_features if col in df.columns]

# Select numerical features
numerical_features = ['Item_price', 'response_time_mins']
numerical_features = [col for col in numerical_features if col in df.columns]

# Encode target variable CSAT Score into 3 categories for better handling of imbalance
def categorize_csat(score):
    if score <= 2:
        return 'Low'
    elif score == 3:
         return 'Neutral'
    else:
        return 'High'
df['CSAT_Category'] = df['CSAT Score'].apply(categorize_csat)

# Prepare features and target
X = df[categorical_features + numerical_features]
y = df['CSAT_Category']

print("Categorical features for encoding:", categorical_features)
print("Numerical features for scaling:", numerical_features)
print("Target variable categories:", y.unique())

#### What all categorical encoding techniques have you used & why did you use those techniques?

I chose OneHotEncoder because:

It handles nominal categorical features without implying any order.

It ensures machine learning models interpret categories correctly as separate entities.

It avoids data distortion and works well with algorithms like Logistic Regression, Random Forest, and XGBoost.

### 2. Data Transformation

In [None]:
# Log transformation for response_time_mins to handle skewness
if 'response_time_mins' in X.columns:
    X['response_time_mins'] = np.log1p(X['response_time_mins'])
    print("Applied log transformation to response_time_mins to reduce skewness.")

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, the data needed transformation due to skewness in the response_time_mins feature. I have used log transformation (np.log1p) to reduce skewness and normalize the distribution. This helps improve model performance by making the data more symmetric and less sensitive to extreme values.

### 3. Data Scaling

In [None]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ])

# Apply preprocessing
X_processed = preprocessor.fit_transform(X)

print("Data preprocessing completed. Shape after preprocessing:", X_processed.shape)

### 4. Dimesionality Reduction

In [None]:
# Apply PCA to reduce dimensionality
pca = PCA(n_components=0.95)  # Keep 95% of variance
X_pca = pca.fit_transform(X_processed)

print(f"PCA reduced dimensions from {X_processed.shape[1]} to {X_pca.shape[1]} components")
print(f"Explained variance ratio: {sum(pca.explained_variance_ratio_):.2f}")

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction was needed to reduce the high number of features after encoding, which can lead to overfitting and increased computation time.


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I used PCA (Principal Component Analysis) to retain 95% of the variance while reducing dimensions.
PCA was chosen because it removes redundant features, improves model efficiency, and helps in handling multicollinearity.

### 5. Data Splitting

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.2, random_state=42, stratify=y
)

print("Data splitting completed:")
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
print(f"Class distribution in training set:\n{y_train.value_counts(normalize=True)}")
print(f"Class distribution in testing set:\n{y_test.value_counts(normalize=True)}")

##### What data splitting ratio have you used and why?

I have used a 80:20 train-test split ratio, meaning 80% of the data was used for training and 20% for testing.
This is a standard and balanced choice that ensures the model has enough data to learn patterns while reserving sufficient data to evaluate its performance.
Also used stratification to maintain the class distribution across both sets, which is important for imbalanced datasets like CSAT categories.

### 6. Handling Imbalanced Dataset

In [None]:
# Apply SMOTE to handle class imbalance
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("After SMOTE resampling:")
print(f"Training set shape: {X_train_resampled.shape}")
print(f"Class distribution after resampling:\n{y_train_resampled.value_counts(normalize=True)}")

##### Do you think the dataset is imbalanced? Explain Why.

Yes, the dataset is imbalanced because some CSAT categories (like 'Low' or 'Neutral') occur much less frequently than others.
An imbalanced dataset can cause the model to be biased toward the majority class and perform poorly on minority classes.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

To handle this, SMOTE (Synthetic Minority Over-sampling Technique) is used, which generates synthetic samples for minority classes to balance the dataset and improve model performance.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Initialize and train Logistic Regression model

log_reg = LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced')
log_reg.fit(X_train_resampled, y_train_resampled)

# Cross-validation

cv_scores = cross_val_score(log_reg, X_train_resampled, y_train_resampled, cv=5, scoring='f1_weighted')
print(f"Logistic Regression Cross-Validation F1 Score: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")

# Hyperparameter tuning
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_search = GridSearchCV(
    LogisticRegression(random_state=42, class_weight='balanced'),
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1
)

grid_search.fit(X_train_resampled, y_train_resampled)
best_log_reg = grid_search.best_estimator_
print(f"Best Logistic Regression parameters: {grid_search.best_params_}")
print(f"Best F1 Score: {grid_search.best_score_:.4f}")

# Evaluate on test set
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = best_log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Logistic Regression Test Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Low', 'Neutral', 'High'], yticklabels=['Low', 'Neutral', 'High'])
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### ML Model - 2

In [None]:
"""### ML Model - 2: Random Forest Classifier (Optimized for Speed)"""
# Initialize with fewer trees for faster training
rf = RandomForestClassifier(
    n_estimators=100,  # Reduced from default 100
    random_state=42,
    class_weight='balanced',
    n_jobs=-1
)
rf.fit(X_train_resampled, y_train_resampled)

In [None]:
# Cross-validation with fewer folds
cv_scores = cross_val_score(rf, X_train_resampled, y_train_resampled, cv=3, scoring='f1_weighted')
print(f"Random Forest Cross-Validation F1 Score: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")

In [None]:
# Hyperparameter tuning with RandomizedSearchCV (much faster)
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [100, 150],  # Reduced options
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt']
}

random_search = RandomizedSearchCV(
    rf,
    param_dist,
    n_iter=10,  # Test only 10 random combinations
    cv=3,       # 3-fold instead of 5-fold
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search.fit(X_train_resampled, y_train_resampled)
best_rf = random_search.best_estimator_
print(f"Best Random Forest parameters: {random_search.best_params_}")
print(f"Best F1 Score: {random_search.best_score_:.4f}")

In [None]:
# Evaluate on test set
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Random Forest Test Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
joblib.dump(best_rf, 'random_forest_model.joblib')
print("✅ random_forest_model.joblib created successfully!")
files.download('random_forest_model.joblib')

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',
            xticklabels=['Low', 'Neutral', 'High'],
            yticklabels=['Low', 'Neutral', 'High'])
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Used accuracy, weighted precision, recall, F1 score, and the confusion matrix to evaluate model performance.
Precision helps reduce costly false positives, while recall minimizes missed important cases (false negatives).
F1 score balances precision and recall, which is crucial for imbalanced, multi-class problems.
This approach aligns with business needs by accurately capturing risks across all classes, especially critical ones like “High.”

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I picked the Random Forest model because it works better with complex data than Logistic Regression.
It gave higher F1 scores during testing, meaning it made more accurate predictions overall.
Random Forest uses many decision trees which helps avoid mistakes from overfitting.
It also handles imbalanced classes well, so it treats all groups fairly.
Because of this, it’s the best choice for making reliable business decisions.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Random Forest is a group of decision trees that work together to make better predictions.
Each tree looks at different parts of the data, and the final decision is based on the majority vote.
This helps the model handle complex patterns and reduce errors.
We can see which features are most important by checking the model’s feature importance scores.
Tools like SHAP can also explain how each feature affects the model’s predictions in detail.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# LABMENTIX/
# ├── app.py
# ├── random_forest_model.joblib
# └── svd_transformer.joblib


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
import requests
from IPython.display import Image

url = 'https://github.com/Kamali-836/CSAT-Intel-Forecasting-Flipkart-Customer-Satisfaction/blob/main/Screenshot%202025-07-11%20015304.png'

img_data = requests.get(url).content
with open('your_image.png', 'wb') as handler:
    handler.write(img_data)

Image('Screenshot 2025-07-11 015304.png')


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project predicts CSAT scores using machine learning on Flipkart support data. It helps identify customer satisfaction levels even when survey responses are missing. Important features like issue category, agent shift, and response time were used. OneHotEncoding, log transformation, and PCA were applied for effective preprocessing. SMOTE was used to handle class imbalance and improve prediction accuracy. Models like Random Forest and Logistic Regression were trained and tuned. Evaluation metrics like Accuracy, F1-Score, and Confusion Matrix ensured reliable results. Here the finalized model is **RANDOM FOREST** with accuracy approx. **0.72**. The CSAT score was categorized into Low, Neutral, and High for actionable insights. The model helps Flipkart prioritize customer support and retain unhappy users. Overall, the system boosts customer satisfaction tracking and service quality at scale.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***