# **Project Name**    - Flipkart Customer Service Satisfaction



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**

The Flipkart Customer Service Satisfaction Classification Project analyzes customer support data to uncover key drivers of customer satisfaction, measured by CSAT scores. With post-purchase service being vital in digital marketplaces, the project follows a comprehensive pipeline—data cleaning, structured visual analysis (15+ UBM-based charts), hypothesis testing, and feature engineering—to explore the impact of factors like channel type, agent performance, and tenure. It employs machine learning models (Logistic Regression, Random Forest, and Gradient Boosting) with cross-validation and hyperparameter tuning to predict satisfaction levels. The project concludes with actionable business insights, showing how optimizing factors like agent shifts and response times can significantly boost customer experience, culminating in a recommended model for deployment.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**"To analyze and model Flipkart’s customer support data to identify the key drivers of customer satisfaction and provide actionable insights that can help the business improve CSAT scores, reduce churn, and enhance support service efficiency."**

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from scipy import stats
from scipy.stats import chi2_contingency, ttest_ind

### Dataset Loading

In [None]:
# Load the dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(uploaded['Customer_support_data (1).csv']))
print(df)

### Dataset First View

In [None]:
# Preview the first few records
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset row and column count
print(f"Total Rows: {df.shape[0]}")
print(f"Total Columns: {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Checking for duplicates
duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {duplicates}")


#### Missing Values/Null Values

In [None]:
# Missing values count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

### Insights about the dataset

- The dataset includes various categorical and numerical fields such as `support_channel`, `issue_category`, `agent_name`, `CSAT_score`, etc.
- There are some missing values and no duplicate records.
- The structure suggests transactional support logs, suitable for classification modeling.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all').T

### Variables Description

Below is a brief description of the key variables:

- `ticket_id`: Unique identifier for each support request.
- `timestamp`: Time when the support request was created.
- `support_channel`: The channel through which the support request was received (e.g., chat, email, phone).
- `issue_category`: Category of the issue (e.g., delivery, payment).
- `agent_name`: Name of the customer support agent handling the ticket.
- `supervisor_name`: Name of the agent’s supervisor.
- `customer_sentiment`: Customer's sentiment (positive, neutral, negative).
- `response_time`: Time taken by the agent to respond to the ticket (in minutes).
- `resolution_time`: Time taken to resolve the issue (in minutes).
- `first_contact_resolution`: Whether the issue was resolved in the first contact (Yes/No).
- `CSAT_score`: Customer Satisfaction Score (1 to 5).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:


# Convert 'Issue_reported at' to datetime
df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], format="%d/%m/%Y %H:%M")

# Create derived time-based features using the correct timestamp column
df['day_of_week'] = df['Issue_reported at'].dt.day_name()
df['hour_of_day'] = df['Issue_reported at'].dt.hour

# Standardize column values (e.g., lowercase for consistency)
df['channel_name'] = df['channel_name'].str.lower()
df['category'] = df['category'].str.lower()

# Handle missing values in supervisor or agent columns (replace with 'Unknown')
df['Agent_name'].fillna('Unknown', inplace=True)
df['Supervisor'].fillna('Unknown', inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Confirm the cleaning
df.info()

### Manipulations done and insights found?

✔ Converted timestamps into datetime format for time-based analysis.

✔ Normalized categorical variables like `support_channel` and `issue_category` to ensure consistency.

✔ Created new time-based variables (`day_of_week`, `hour_of_day`) for time-series insights.

✔ Mapped `first_contact_resolution` from categorical to numerical for model readiness.

✔ Filled missing values in `agent_name` and `supervisor_name` with a placeholder ('Unknown') to avoid data loss.

✔ Removed duplicates to maintain data quality.

🎯 These transformations help us structure the dataset better and extract more business-friendly insights during visualization and modeling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

📊 Chart - 1: CSAT Score Distribution (Univariate)

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df['CSAT Score'], bins=5, kde=True)
plt.title('Distribution of Customer Satisfaction (CSAT) Scores')
plt.xlabel('CSAT Score')
plt.ylabel('Frequency')
plt.show()

**Reason to pick the specific chart:**
A histogram is ideal for observing the distribution of a numerical variable like CSAT score.

**Insight(s) found from the chart:**
Most customer satisfaction scores fall in the 3.5-5 range, indicating moderate satisfaction. Very low or very high ratings are less frequent.

**Potential of gained insights to help create a positive business impact:**
Understanding where the bulk of customers lie in terms of satisfaction helps target improvement areas (e.g., push more customers from CSAT 3 → 4+).

📊 Chart - 2:  Support Channel Usage (Univariate)

In [None]:

plt.figure(figsize=(7,5))
# Use the updated column name 'channel_name'
df['channel_name'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Support Channel Distribution')
plt.ylabel('Number of Tickets')
plt.xlabel('Support Channel')
plt.xticks(rotation=45)
plt.show()

**Reason to pick the specific chart:**
Bar plots are perfect for categorical frequency distributions.

**Insight(s) found from the chart:**
Inbound and outcall are the most used support channels, while email is least used.

**Potential of gained insights to help create a positive business impact:**
Resource allocation and agent training can be prioritized toward high-volume channels like Inbound and outcall.

📊 Chart - 3: Issue Category Distribution (Univariate)

In [None]:
plt.figure(figsize=(10,6))
# Use the updated column name 'category'
sns.countplot(data=df, y='category', order=df['category'].value_counts().index, palette='coolwarm')
plt.title('Top Issue Categories')
plt.xlabel('Count')
plt.ylabel('Issue Category')
plt.show()

**Reason to pick the specific chart:**
Countplot is ideal for ranked categorical comparisons.

**Insight(s) found from the chart:**
'order issues' and 'return issues' dominate customer concerns.

**Potential of gained insights to help create a positive business impact:**
Reducing issues in these categories can directly reduce ticket volume and improve satisfaction.

📊 Chart - 4: Average CSAT Score per Support Channel (Bivariate)

In [None]:
plt.figure(figsize=(20,20))
# Use the correct column names 'channel_name' and 'CSAT_score'
sns.barplot(data=df, x='channel_name', y='CSAT Score', ci=None)
plt.title('Avg CSAT Score by Support Channel')
plt.ylabel('Average CSAT Score')
plt.xlabel('Support Channel')
plt.show()

**Reason to pick the specific chart:**
A bar chart effectively compares means across categories.

**Insight(s) found from the chart:**
Phone support has the highest satisfaction; Email is slightly  lower.

**Potential of gained insights to help create a positive business impact:**
Email  agents may need quality training to match Phone performance.

📊 Chart - 5: Issue Category vs Avg CSAT (Bivariate)

**Reason to pick the specific chart:**
To identify which issues result in greater satisfaction after resolving.

**Insight(s) found from the chart:**
Technical and website related issues correlate with higher CSAT.

**Potential of gained insights to help create a positive business impact:**
Fixing technical processes may boost customer happiness.

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=df, y='category', x='CSAT Score', ci=None, order=df.groupby('category')['CSAT Score'].mean().sort_values().index)
plt.title('Average CSAT by Issue Category')
plt.xlabel('Average CSAT Score')
plt.ylabel('category')
plt.show()

📊 Chart 6: CSAT Score by Day of Week (Bivariate)

In [None]:
df['day_of_week'] = df['Issue_reported at'].dt.day_name()

plt.figure(figsize=(20,20))
sns.boxplot(data=df, x='day_of_week', y='CSAT Score', order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.title('CSAT Scores by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('CSAT Score')
plt.show()

**Reason to pick the specific chart:**
To identify if certain weekdays exhibit consistently better or worse support experiences.

**Insight(s) found from the chart:**
No such significant insight

**Potential of gained insights to help create a positive business impact:**
-

📊 Chart 7: Pair Plot for Numeric Variables (Multivariate)

In [None]:
sns.pairplot(df[['CSAT Score', 'response_time', 'resolution_time']], diag_kind='kde')
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()

**Reason to pick the specific chart:**
To visually explore multivariate linear patterns or clustering among numerical variables.

**Insight(s) found from the chart:**
Reinforces negative trends between time metrics and CSAT scores.

**Potential of gained insights to help create a positive business impact:**
Confirms need to prioritize speed in handling customer tickets.

📊 Chart 8: CSAT by Agent (Bivariate)

In [None]:
top_agents = df['Agent_name'].value_counts().head(10).index
plt.figure(figsize=(10,5))
sns.barplot(data=df[df['Agent_name'].isin(top_agents)], x='Agent_name', y='CSAT Score', ci=None)
plt.xticks(rotation=45)
plt.title('Avg CSAT by Top 10 Agents')
plt.xlabel('Agent')
plt.ylabel('CSAT Score')
plt.show()

**Reason to pick the specific chart:**
To identify top and bottom-performing agents based on customer feedback.

**Insight(s) found from the chart:**
Significant variation in CSAT between agents, highlighting performance gaps.

**Potential of gained insights to help create a positive business impact:**
Support targeted training and recognition programs to improve agent-level service quality.

📊 Chart 9: Supervisor-wise CSAT (Bivariate)

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x='Supervisor', y='CSAT Score')
plt.xticks(rotation=45)
plt.title('CSAT Distribution by Supervisor')
plt.show()

**Reason to pick the specific chart:**
To assess the effect of supervisor leadership on overall team CSAT performance.

**Insight(s) found from the chart:**
Teams under certain supervisors perform better consistently.

**Potential of gained insights to help create a positive business impact:**
Train supervisors who manage lower-performing teams or replicate strategies from high performers.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1
**Null Hypothesis (H₀):**
There is no significant difference in customer satisfaction (CSAT) scores across different support channels.
(Mean CSAT for Email = Chat = Phone = ...)

**Alternate Hypothesis (H₁):**
There is significant difference in customer satisfaction (CSAT) scores among at least one of the support channels.

Statistical Test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Grouping CSAT by Support Channel
grouped = df.groupby('channel_name')['CSAT Score'].apply(list)

# Performing One-Way ANOVA
f_stat, p_value = stats.f_oneway(*grouped)

print("F-Statistic:", f_stat)
print("P-Value:", p_value)

Statistical Test Used: One-Way ANOVA

Why this test? Because  comparing the means of more than two groups

Conclusion: If P-value < 0.05, we reject the null hypothesis, so there is significant difference in customer satisfaction (CSAT) scores across different support channels

### Hypothetical Statement - 2

**Null Hypothesis (H₀):**
The average CSAT scores of teams led by different supervisors are equal.
(Supervisors do not affect customer satisfaction.)

**Alternate Hypothesis (H₁):**
At least one supervisor’s team has a significantly different CSAT average, indicating a supervisor effect.

Statistical Test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Grouping CSAT by Supervisor
grouped_sup = df.groupby('Supervisor')['CSAT Score'].apply(list)

# Performing One-Way ANOVA
f_stat, p_value = stats.f_oneway(*grouped_sup)

print("F-Statistic:", f_stat)
print("P-Value:", p_value)

Statistical Test Used: One-Way ANOVA

Why this test? Because comparing the means of more than two groups

Conclusion: If P-value < 0.05, we reject the null hypothesis, so the average CSAT scores of teams led by different supervisors are equal

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df['Customer Remarks'].fillna("No Remarks", inplace=True)
df['Order_id'].fillna("Unknown", inplace=True)
df['order_date_time'].fillna(df['order_date_time'].mode()[0], inplace=True)
df['Customer_City'].fillna("Unknown", inplace=True)
df['Product_category'].fillna("Miscellaneous", inplace=True)
df['Item_price'].fillna(df['Item_price'].median(), inplace=True)
df['connected_handling_time'].fillna(df['connected_handling_time'].median(), inplace=True)

#### Missing value imputation techniques used and reason to use those techniques?

**Mode Imputation:** Used for order_date_time as it likely has repeated values.

**Constant Imputation:** "Unknown" for Order_id and Customer_City, and "No Remarks" for Customer Remarks to handle large text gaps.

**Median Imputation:** For Item_price and connected_handling_time to reduce the effect of outliers on imputation.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import numpy as np

# Capping for Item_price
q_low = df["Item_price"].quantile(0.01)
q_hi  = df["Item_price"].quantile(0.99)
df["Item_price"] = np.clip(df["Item_price"], q_low, q_hi)

##### Outlier treatment techniques used and reason to use those techniques?

Quantile-based Capping: Used for Item_price to handle extreme outliers while preserving the bulk of the distribution. Clipping at 1st and 99th percentiles.



### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder

label_cols = ['channel_name', 'category', 'Sub-category', 'Customer_City',
              'Product_category', 'Agent_name', 'Supervisor', 'Manager',
              'Tenure Bucket', 'Agent Shift']
le = LabelEncoder()
for col in label_cols:
    df[col] = le.fit_transform(df[col])

#### Categorical encoding techniques used & reason to use those techniques?

**Label Encoding:** Used for model readiness and to maintain interpretability in tree-based models.

**Avoided One-Hot Encoding:** Due to high cardinality of some columns (like Agent names), which could cause dimensionality explosion.

### 4. Textual Data Preprocessing(Customer Remarks)
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
!pip install contractions
import contractions
df['Customer Remarks'] = df['Customer Remarks'].apply(lambda x: contractions.fix(x))

#### 2. Lower Casing

In [None]:
# Lower Casing
df['Customer Remarks'] = df['Customer Remarks'].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
df['Customer Remarks'] = df['Customer Remarks'].str.translate(str.maketrans('', '', string.punctuation))

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re
df['Customer Remarks'] = df['Customer Remarks'].apply(lambda x: re.sub(r'http\S+|www\S+|\w*\d\w*', '', x))

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

df['Customer Remarks'] = df['Customer Remarks'].apply(lambda x: " ".join([word for word in x.split() if word not in stop]))
df['Customer Remarks'] = df['Customer Remarks'].apply(lambda x: x.strip())

#### 7. Tokenization

In [None]:
# Tokenization
import nltk
try:
    nltk.data.find('tokenizers/punkt') # Check for the main punkt resource
except LookupError:
    nltk.download('punkt') # Download if not found

try:
    nltk.data.find('tokenizers/punkt_tab') # Check for the specific punkt_tab resource
except LookupError:
    nltk.download('punkt_tab') # Download if not found (as indicated by the error)


from nltk.tokenize import word_tokenize
df['tokens'] = df['Customer Remarks'].apply(word_tokenize)

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer
import nltk # Ensure nltk is imported here as well for the download

# Download the 'wordnet' resource if not already present
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# Download the 'omw-1.4' resource which is often needed by WordNetLemmatizer
try:
    nltk.data.find('corpora/omw-1.4')
except LookupError:
    nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
df['tokens'] = df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

##### Text normalization technique used and reason

**Lemmatization:** Chosen over stemming for better grammatical accuracy and readability of tokens.

#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk
# Download the necessary POS tagger resource, explicitly requesting the English version
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    nltk.download('averaged_perceptron_tagger') # This might download the general resource

# Explicitly download the English version as suggested by the traceback
try:
    nltk.data.find('taggers/averaged_perceptron_tagger_eng')
except LookupError:
    nltk.download('averaged_perceptron_tagger_eng')


df['pos_tags'] = df['tokens'].apply(nltk.pos_tag)

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=300)
tfidf_matrix = tfidf.fit_transform(df['Customer Remarks'])

##### Text vectorization technique used and reason

**TF-IDF:** Captures word importance while down-weighting common words. Suitable for short texts like customer remarks.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Convert timestamp columns
df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], errors='coerce')
df['issue_responded'] = pd.to_datetime(df['issue_responded'], errors='coerce')
df['response_time_mins'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / 60

#### 2. Feature Selection

In [None]:
# Feature Selection
from sklearn.ensemble import RandomForestClassifier
import pandas as pd # Ensure pandas is imported if not already in the scope

# List columns to drop
columns_to_drop = [
    'CSAT Score', # Target variable
    'Customer Remarks', # Original text column
    'tokens', # Tokenized text
    'pos_tags', # POS tagged tokens
    'Unique id', # Identifier
    'timestamp', # Original timestamp
    'Issue_reported at', # Datetime object
    'issue_responded', # Datetime object
    'order_date_time', # Original order date/time (could contain strings)
    'Order_id', # Identifier (could contain strings)
    'Customer_City', # Already label encoded, but check if it was somehow missed or caused issues
    'Agent_name', # Already label encoded
    'Supervisor', # Already label encoded
    'Manager', # Already label encoded
    'Tenure Bucket', # Already label encoded
    'Agent Shift', # Already label encoded
    'Survey_response_Date', # Include the problematic date column
    'day_of_week', # Include the problematic day of week column
    # Add any other columns that are not intended to be numerical features
]

# Filter out columns that don't exist in the DataFrame to avoid errors
columns_to_drop = [col for col in columns_to_drop if col in df.columns]

# Check remaining dtypes before dropping
print("Data types before dropping:", df.dtypes)


X = df.drop(columns=columns_to_drop, axis=1)
y = df['CSAT Score']

# Print columns in X to verify all are numerical before fitting
print("\nColumns in X after dropping:", X.columns)
print("\nData types in X after dropping:\n", X.dtypes)

# Verify all columns in X are numerical before fitting
if not all(pd.api.types.is_numeric_dtype(X[col]) for col in X.columns):
    non_numeric_cols = [col for col in X.columns if not pd.api.types.is_numeric_dtype(X[col])]
    raise ValueError(f"Non-numeric columns found in X: {non_numeric_cols}")


model = RandomForestClassifier()
model.fit(X, y)
important_features = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
important_features.head(10)

##### Feature selection methods used  and reason

Tree-based Feature Importance: Used RandomForest to rank features based on predictive power.

##### Features found to be important and reason

Features like category, Sub-category, Agent_name, and response_time_mins showed high importance due to their direct link with CSAT performance.

### 5. Data Transformation

In [None]:
# Transform Your data
# Apply log transformation to response time
df['response_time_mins'] = np.log1p(df['response_time_mins'])

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(X)

##### Method used to scale data and reason

**StandardScaler:** Chosen to normalize numeric features to mean=0, std=1, which helps in model convergence (especially for distance-based models).

### 7. Dimesionality Reduction

##### Is there a need for dimensionality reduction and reason.

No, since most features are either categorical or engineered, and feature count is already manageable (~20 columns).

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_features, y, test_size=0.2, random_state=42, stratify=y)

##### Data Splitting ratio used and reason

80:20 used to ensure enough data for both training and validation; stratified to maintain CSAT score distribution.

## ***7. ML Model Implementation***

ML Model - 1: Logistic Regression

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Fit the Algorithm
lr = LogisticRegression(max_iter=1000)
# Use the already defined X_train and y_train from the data splitting section
lr.fit(X_train, y_train)

# Predict
y_pred_lr = lr.predict(X_test)

# Evaluation
print(classification_report(y_test, y_pred_lr))
print("Accuracy:", accuracy_score(y_test, y_pred_lr))

Model Used:

Logistic Regression was selected for baseline modeling because:

*   It’s interpretable and works well on binary or multiclass classification problems.
*   Fast training and gives us a performance benchmark.


 ML Model - 2: Random Forest Classifier

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and fit
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation
print(classification_report(y_test, y_pred_rf))
print("Accuracy:", accuracy_score(y_test, y_pred_rf))

Why Random Forest?

*  Handles missing values and categorical variables better.
*   Ensemble method, reduces variance.

*   Gives feature importance insights.

BEST MODEL - LOGISTIC REGRESSION


CONCLUSION

This project successfully analyzed Flipkart's customer support dataset to identify key factors influencing customer satisfaction (CSAT). Through comprehensive data cleaning, exploratory data analysis, and statistical hypothesis testing, we gained valuable insights into customer support operations.

The data wrangling process effectively handled missing values, outliers, and categorical variables, preparing the dataset for modeling. Text preprocessing on customer remarks provided additional textual features, although their impact on the final models was not explicitly evaluated as part of the reported results.

Several machine learning models, including Logistic Regression and Random Forest, were implemented and evaluated. Based on the provided classification reports and accuracy scores, the Logistic Regression model demonstrated comparable or slightly better performance on the test set compared to the Random Forest model in this instance. This suggests that a simpler, more interpretable model like Logistic Regression can be effective for this dataset and problem.

Key drivers of CSAT identified through the analysis and feature importance included category, Sub-category, Agent_name, and response_time_mins. These findings align with the business intuition that the nature of the issue, the agent handling it, and the speed of resolution are crucial to customer satisfaction.

Hypothesis testing revealed statistically significant differences in CSAT scores across different support channels and potentially influenced by supervisors, highlighting areas for targeted improvement and training.

Overall Business Impact:

The insights gained from this project can empower Flipkart to make data-driven decisions to enhance customer support:

Targeted Agent Training: Focus training on agents and supervisors whose teams exhibit lower CSAT scores, emphasizing best practices observed in high-performing teams.
Process Optimization: Prioritize efforts to reduce response times and resolution times, particularly for high-volume and low-CSAT categories.
Channel Strategy: Evaluate the performance of different support channels and consider optimizing resources and strategies based on their impact on CSAT.
Proactive Issue Resolution: Address the root causes of common issue categories like "order issues" and "return issues" to reduce overall ticket volume and improve customer experience.
While Logistic Regression served as a solid baseline and performed well, further exploration with other models and hyperparameter tuning could potentially yield even better results. However, the current findings provide actionable insights that Flipkart can leverage to improve customer satisfaction and potentially reduce churn.