<a href="https://colab.research.google.com/github/BaraShowCode/Customer-Satisfaction-CSAT-Prediction-/blob/main/FlipkartCustomer_ML_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name** - EDA and ML Models for Flipkart Customer Satisfaction

##### **Project Type** - EDA & Classification
##### **Contribution** - Individual

# **Project Summary -**

This project undertakes a comprehensive analysis of the Flipkart Customer Support dataset, delivering a full spectrum of data science workflow from initial exploration to predictive modeling. The first half of the project is a detailed Exploratory Data Analysis (EDA) which involves rigorous data cleaning, imputation of missing values, and the creation of 15 visualizations to uncover key patterns. Insights from the EDA reveal that customer satisfaction is generally high, with 'Order Related' issues being the most common reason for contact. The second half transitions to predictive modeling, beginning with formal hypothesis testing to statistically validate observations made during EDA. A thorough feature engineering process is documented, covering techniques for handling missing values, encoding categorical variables, scaling numerical data, and managing the imbalanced nature of the dataset's target variable (CSAT Score). Three distinct machine learning models are implemented and evaluated: a Logistic Regression baseline, a robust RandomForest Classifier, and a high-performance XGBoost Classifier. Hyperparameter tuning using RandomizedSearchCV is performed on the RandomForest model to optimize its performance. The models are compared based on key classification metrics, with the tuned Random Forest selected as the final model for its strong balance of precision and recall. The project concludes by saving the best-performing model for potential deployment and summarizing the actionable business insights derived from both the EDA and the predictive models.

# **Problem Statement**

The primary business problem is to leverage customer support data to gain a deep, analytical understanding of customer satisfaction drivers and to build a predictive tool for proactive customer service. While raw data exists, there is a need to translate it into actionable business intelligence. This involves answering key questions through EDA: What are the main reasons for customer contact? Which support channels are most effective? Do factors like agent experience or item price impact satisfaction? Following the exploratory phase, the challenge is to develop a reliable machine learning model that can predict which customer interactions are likely to result in low satisfaction. This predictive capability would enable the support team to intervene proactively, potentially turning negative experiences into positive ones, thereby improving customer retention and brand loyalty.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Install necessary libraries
!pip install squarify xgboost

# Import Libraries for Data Handling and Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import squarify # for treemaps
from scipy.stats import ttest_ind, chi2_contingency, f_oneway

# Import Libraries for Machine Learning
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

# Set default styles for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 7)
import warnings
warnings.filterwarnings('ignore')

print("All libraries installed and imported successfully!")

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/Customer_support_data.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"There are {df.duplicated().sum()} duplicate rows in the dataset.")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

### What did you know about your dataset?

The dataset is a rich collection of customer support interactions, containing 87,030 records and 28 features. The features are a mix of numerical, categorical, and datetime types. Key variables include `CSAT Score`, `channel_name`, `category`, `Item_price`, and `connected_handling_time`. A preliminary check reveals significant data quality issues: there are 5,320 duplicate entries and substantial missing data in columns like `connected_handling_time`, `Customer_City`, and several agent-related fields. This indicates that a thorough data wrangling phase will be essential before any meaningful analysis or modeling can be performed.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

- **unique_id**: A unique identifier for each interaction.
- **channel_name**: The channel through which the customer contacted support (e.g., Inbound, Outcall, Chat).
- **category**: The main reason for the customer's inquiry (e.g., Order Related, Product Queries).
- **Sub-category**: A more specific reason for the inquiry.
- **CSAT Score**: Customer Satisfaction score, ranging from 1 (very dissatisfied) to 5 (very satisfied). This is our primary target variable.
- **Item_price**: The price of the item related to the inquiry.
- **connected_handling_time**: The time in seconds the agent spent connected with the customer.
- **Agent Shift**: The shift the agent was working (e.g., Morning, Afternoon, Night).
- **Tenure Bucket**: The agent's experience level in days (e.g., 0-30, >90).

## 3. ***Data Wrangling***

In [None]:
# Write your code to make your dataset analysis ready.
print("Starting data wrangling...")
df_wrangled = df.copy()
initial_rows = len(df_wrangled)
df_wrangled.drop_duplicates(inplace=True)
print(f"Removed {initial_rows - len(df_wrangled)} duplicate rows.")

numerical_cols = df_wrangled.select_dtypes(include=np.number).columns
for col in numerical_cols:
    if df_wrangled[col].isnull().any():
        median_val = df_wrangled[col].median()
        df_wrangled[col].fillna(median_val, inplace=True)

categorical_cols = df_wrangled.select_dtypes(include=['object']).columns
for col in categorical_cols:
     if df_wrangled[col].isnull().any():
        mode_val = df_wrangled[col].mode()[0]
        df_wrangled[col].fillna(mode_val, inplace=True)

date_cols_to_convert = ['order_date_time', 'Issue_reported at', 'issue_responded', 'Survey_response_Date']
for col in date_cols_to_convert:
    if col in df_wrangled.columns:
        df_wrangled[col] = pd.to_datetime(df_wrangled[col], errors='coerce')

cols_to_drop = ['unique_id', 'Agent_name']
existing_cols_to_drop = [col for col in cols_to_drop if col in df_wrangled.columns]
df_cleaned = df_wrangled.drop(columns=existing_cols_to_drop)
print(f"Dropped identifier columns: {existing_cols_to_drop}")
print("\nData wrangling complete.")

### What all manipulations have you done and insights you found?

The data wrangling process involved several critical steps:
1.  **Duplicate Removal:** All 5,320 duplicate rows were removed to ensure each record is unique.
2.  **Missing Value Imputation:** Missing numerical values were filled with the median to provide a robust central tendency measure that is not skewed by outliers. Missing categorical values were filled with the mode, the most frequent category, which is a standard practice.
3.  **Data Type Conversion:** Date-related columns were converted to the `datetime` format.
4.  **Column Dropping:** High-cardinality identifier columns like `unique_id` and `Agent_name` were dropped as they do not provide generalizable patterns for modeling.

The key insight is that the raw dataset was not suitable for direct analysis. These cleaning steps were crucial for creating a reliable foundation for all subsequent visualizations and modeling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts***

#### Chart - 1: Distribution of CSAT Scores

In [None]:
sns.countplot(data=df_cleaned, x='CSAT Score', order=df_cleaned['CSAT Score'].value_counts().index, palette='viridis')
plt.title('Distribution of Customer Satisfaction (CSAT) Scores', fontsize=16)
plt.xlabel('CSAT Score', fontsize=12)
plt.ylabel('Number of Responses', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **count plot** is the most effective choice for visualizing the distribution of a discrete, categorical variable like `CSAT Score`. It clearly and simply shows the frequency of each score.

##### 2. What is/are the insight(s) found from the chart?
The vast majority of customer interactions result in a **CSAT score of 5**. This indicates a very high level of overall customer satisfaction. This also shows the dataset is highly imbalanced.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** Yes. This confirms that the support team is performing very well. For machine learning, it highlights the need to handle class imbalance to build an effective model.

#### Chart - 2: Support Requests by Channel

In [None]:
sns.countplot(data=df_cleaned, y='channel_name', order=df_cleaned['channel_name'].value_counts().index, palette='plasma')
plt.title('Number of Support Requests by Channel', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Channel', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?
A **horizontal count plot** was chosen to clearly display the volume of requests for each support channel, preventing text overlap.

##### 2. What is/are the insight(s) found from the chart?
**'Inbound'** calls are the most frequently used support channel, followed by **'Outcall'**.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** Yes. This is crucial for resource allocation. The business should ensure the 'Inbound' channel is well-staffed. It also highlights an opportunity to promote more cost-effective digital channels.

#### Chart - 3: CSAT Score vs. Channel Name

In [None]:
sns.boxplot(data=df_cleaned, x='CSAT Score', y='channel_name', palette='GnBu')
plt.title('CSAT Score Distribution by Support Channel', fontsize=16)
plt.xlabel('CSAT Score', fontsize=12)
plt.ylabel('Channel', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **box plot** is ideal for comparing the distribution of a numerical variable (`CSAT Score`) across different categories (`channel_name`).

##### 2. What is/are the insight(s) found from the chart?
While all channels have a very high median CSAT score of 5, **'Chat'** and **'Outcall'** channels have a wider range of scores and more outliers with lower scores.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** Yes. This allows the business to focus quality assurance efforts on the 'Chat' and 'Outcall' channels to improve consistency.

#### Chart - 4: CSAT Score vs. Agent Shift

In [None]:
sns.boxplot(data=df_cleaned, x='CSAT Score', y='Agent Shift', palette='crest')
plt.title('CSAT Score Distribution by Agent Shift', fontsize=16)
plt.xlabel('CSAT Score', fontsize=12)
plt.ylabel('Agent Shift', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **box plot** is used to effectively compare the distribution of `CSAT Score` across the different `Agent Shift` categories.

##### 2. What is/are the insight(s) found from the chart?
The distribution of CSAT scores is remarkably similar across all three shifts. The quality of customer service does not degrade during different times of the day.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This is a very positive insight. It confirms that operational standards and agent performance are consistent 24/7.

#### Chart - 5: Support Requests by Category

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(data=df_cleaned, y='category', order=df_cleaned['category'].value_counts().index, palette='magma')
plt.title('Number of Support Requests by Category', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?
A **horizontal count plot** is used to clearly show the frequency of requests for each category.

##### 2. What is/are the insight(s) found from the chart?
**"Order Related"** issues are, by a large margin, the most common reason for customers to contact support, followed by "Product Queries" and "Refund Related" issues.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** Absolutely. The business can focus on improving processes for order tracking, delivery, and returns to reduce the majority of customer inquiries.

#### Chart - 6: Distribution of Agent Tenure

In [None]:
tenure_order = ['On Job Training', '0-30', '31-60', '61-90', '>90']
plt.figure(figsize=(10, 6))
sns.countplot(data=df_cleaned, x='Tenure Bucket', order=tenure_order, palette='rocket')
plt.title('Distribution of Agent Tenure Buckets', fontsize=16)
plt.xlabel('Agent Tenure (Days)', fontsize=12)
plt.ylabel('Number of Interactions Handled', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **count plot** shows the number of interactions handled by agents in different experience brackets.

##### 2. What is/are the insight(s) found from the chart?
The support team is largely composed of experienced agents, with the **">90"** days bucket handling the highest volume of interactions.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** Yes. This shows that the company has good agent retention, indicating a stable and experienced support team.

#### Chart - 7: Distribution of Item Price

In [None]:
plt.figure(figsize=(12, 7))
sns.histplot(df_cleaned['Item_price'], bins=50, kde=True, color='purple')
plt.title('Distribution of Item Prices in Support Inquiries', fontsize=16)
plt.xlabel('Item Price', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xlim(0, df_cleaned['Item_price'].quantile(0.95))
plt.show()

##### 1. Why did you pick the specific chart?
A **histogram with a KDE** is perfect for understanding the distribution of a continuous variable like `Item_price`.

##### 2. What is/are the insight(s) found from the chart?
The vast majority of support inquiries are related to **lower-priced items**, with a large concentration of products under â‚¹5,000.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This helps in resource design. Self-service solutions can be targeted at these high-volume, low-price items, while experienced agents handle high-value products.

#### Chart - 8: CSAT Score vs. Issue Category

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=df_cleaned, x='CSAT Score', y='category', palette='viridis')
plt.title('CSAT Score Distribution by Issue Category', fontsize=16)
plt.xlabel('CSAT Score', fontsize=12)
plt.ylabel('Issue Category', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **box plot** is ideal to compare the distribution of `CSAT Score` across multiple `category` groups.

##### 2. What is/are the insight(s) found from the chart?
**"Refund Related"** and **"Cancellation"** issues show slightly more variability and a larger proportion of lower scores compared to other categories.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This is highly actionable. The business should investigate and simplify the refund and cancellation processes to improve customer satisfaction in these sensitive areas.

#### Chart - 9: CSAT Score vs. Agent Tenure

In [None]:
tenure_order = ['On Job Training', '0-30', '31-60', '61-90', '>90']
plt.figure(figsize=(12, 7))
sns.violinplot(data=df_cleaned, x='CSAT Score', y='Tenure Bucket', order=tenure_order, palette='rocket')
plt.title('CSAT Score Distribution by Agent Tenure Bucket', fontsize=16)
plt.xlabel('CSAT Score', fontsize=12)
plt.ylabel('Tenure Bucket', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **violin plot** provides a richer understanding of the distribution's shape than a standard box plot.

##### 2. What is/are the insight(s) found from the chart?
Agents across **all tenure buckets**, including those in training, are achieving overwhelmingly high CSAT scores.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This indicates that the agent onboarding and training programs are highly effective, and new agents perform well quickly.

#### Chart - 10: Handling Time by Issue Category

In [None]:
handling_time_by_category = df_cleaned.groupby('category')['connected_handling_time'].mean().sort_values(ascending=False)
plt.figure(figsize=(12, 8))
sns.barplot(y=handling_time_by_category.index, x=handling_time_by_category.values, palette='crest')
plt.title('Average Handling Time by Issue Category', fontsize=16)
plt.xlabel('Average Handling Time (Seconds)', fontsize=12)
plt.ylabel('Issue Category', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **bar plot** is perfect for comparing a single numerical metric (average handling time) across different categories.

##### 2. What is/are the insight(s) found from the chart?
**"Product Queries"** and **"Cancellation"** issues tend to have the longest average handling times.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This helps diagnose inefficiencies. The business can investigate why product queries take so long and provide agents with better knowledge base tools or training.

#### Chart - 11: Item Price vs. Issue Category

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=df_cleaned, x='Item_price', y='category', palette='magma')
plt.title('Item Price Distribution by Issue Category', fontsize=16)
plt.xlabel('Item Price (Log Scale)', fontsize=12)
plt.ylabel('Issue Category', fontsize=12)
plt.xscale('log')
plt.show()

##### 1. Why did you pick the specific chart?
A **box plot** with a logarithmic scale is used to compare the distribution of item prices across different issue categories, handling the wide range of price data.

##### 2. What is/are the insight(s) found from the chart?
Inquiries related to **"Refund Related"** and **"Cancellation"** issues tend to involve higher-priced items.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This adds context to previous findings. Since these issues involve more expensive items and have slightly lower CSAT, the business should prioritize making these processes as smooth as possible to retain valuable customers.

#### Chart - 12: CSAT Score vs. Handling Time

In [None]:
df_cleaned['handling_time_bins'] = pd.cut(df_cleaned['connected_handling_time'], bins=5, labels=['Quick', 'Medium', 'Slow', 'Very Slow', 'Longest'])
plt.figure(figsize=(12, 7))
sns.boxplot(data=df_cleaned, x='handling_time_bins', y='CSAT Score', palette='plasma')
plt.title('CSAT Score vs. Binned Handling Time', fontsize=16)
plt.xlabel('Handling Time Category', fontsize=12)
plt.ylabel('CSAT Score', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
Handling time is binned into categories, and a **box plot** is used to clearly reveal the trend between how long a call takes and how satisfied the customer is.

##### 2. What is/are the insight(s) found from the chart?
Customer satisfaction **dips slightly** for the longest handling times. However, the median CSAT score remains high at 5 for almost all bins.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This is reassuring. It means the priority should be on **First Call Resolution**, even if it takes a bit longer, as customers are generally patient.

#### Chart - 13: Category and Sub-Category Breakdown

In [None]:
category_sub_counts = df_cleaned.groupby(['category', 'Sub-category']).size().reset_index(name='counts')
top_categories = category_sub_counts.nlargest(20, 'counts')

plt.figure(figsize=(16, 10))
squarify.plot(sizes=top_categories['counts'],
              label=[f'{c}\n({s})\n{n}' for c, s, n in zip(top_categories['category'], top_categories['Sub-category'], top_categories['counts'])],
              alpha=0.8,
              color=sns.color_palette("viridis", len(top_categories)))
plt.title('Treemap of Top 20 Sub-Categories within Categories', fontsize=18)
plt.axis('off')
plt.show()

##### 1. Why did you pick the specific chart?
A **treemap** is an excellent choice for visualizing hierarchical data, showing the proportion of each `Sub-category` within the broader `category` structure.

##### 2. What is/are the insight(s) found from the chart?
Within the dominant "Order Related" category, the sub-categories **"Order status"** and **"Delivery related"** are the largest contributors. For "Product Queries," **"Product quality"** is the most significant.

##### 3. Will the gained insights help creating a positive business impact?
**Positive Business Impact:** This is extremely actionable. The business can now focus on specific sub-problems, like improving the automated order tracking system to reduce "Order status" inquiries.

#### Chart - 14 - Correlation Heatmap

In [None]:
numerical_cols = df_cleaned.select_dtypes(include=np.number)
correlation_matrix = numerical_cols.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features', fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?
A **correlation heatmap** is the most effective way to visualize the linear relationships between multiple numerical variables at once.

##### 2. What is/are the insight(s) found from the chart?
There is a **moderate positive correlation of 0.44 between `Item_price` and `connected_handling_time`**, suggesting inquiries for more expensive items take longer. `CSAT Score` has very weak correlations with all other numerical features.

#### Chart - 15 - Pair Plot

In [None]:
pairplot_df = df_cleaned[['CSAT Score', 'Item_price', 'connected_handling_time']]
sns.pairplot(pairplot_df, hue='CSAT Score', palette='viridis', plot_kws={'alpha': 0.1})
plt.suptitle('Pair Plot of Key Numerical Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?
A **pair plot** is excellent for visualizing relationships between multiple numerical variables at once.

##### 2. What is/are the insight(s) found from the chart?
The plot visually confirms there are no strong linear relationships between the key numerical variables. The distributions show that most interactions have high `CSAT Score`, low `Item_price`, and low `connected_handling_time`.

##### 3. Will the gained insights help creating a positive business impact?
This reinforces that simple linear models might not be sufficient, justifying the use of more advanced models like Random Forest.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis ($H_0$):** The average `connected_handling_time` for 'Refund Related' issues is the same as the average handling time for 'Order Related' issues.
**Alternate Hypothesis ($H_1$):** The average `connected_handling_time` for 'Refund Related' issues is different from the average handling time for 'Order Related' issues.

In [None]:
# Perform Statistical Test to obtain P-Value
refund_times = df_cleaned[df_cleaned['category'] == 'Refund Related']['connected_handling_time']
order_times = df_cleaned[df_cleaned['category'] == 'Order Related']['connected_handling_time']

t_stat, p_value = ttest_ind(refund_times, order_times, equal_var=False) # Welch's t-test

if p_value < 0.05:
    print(f"P-value: {p_value:.4f}. We reject the null hypothesis. There is a significant difference in handling times.")
else:
    print(f"P-value: {p_value:.4f}. We fail to reject the null hypothesis.")

##### Why did you choose the specific statistical test?
An **Independent Samples T-test** was used because we are comparing the means of a continuous variable (`connected_handling_time`) between two independent groups ('Refund Related' and 'Order Related').

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis ($H_0$):** There is no association between `Agent Shift` and `CSAT Score` (i.e., they are independent).
**Alternate Hypothesis ($H_1$):** There is an association between `Agent Shift` and `CSAT Score` (i.e., they are dependent).

In [None]:
# Perform Statistical Test to obtain P-Value
contingency_table = pd.crosstab(df_cleaned['Agent Shift'], df_cleaned['CSAT Score'])
chi2, p_value, _, _ = chi2_contingency(contingency_table)

if p_value < 0.05:
    print(f"P-value: {p_value:.4f}. We reject the null hypothesis. There is a significant association.")
else:
    print(f"P-value: {p_value:.4f}. We fail to reject the null hypothesis. There is no significant association.")

##### Why did you choose the specific statistical test?
A **Chi-Square Test for Independence** was used. This test is appropriate for determining if there is a significant association between two categorical variables (`Agent Shift` and `CSAT Score`).

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis ($H_0$):** The average `Item_price` for interactions with low CSAT scores (1-3) is the same as for interactions with high CSAT scores (4-5).
**Alternate Hypothesis ($H_1$):** The average `Item_price` for interactions with low CSAT scores (1-3) is different from those with high CSAT scores (4-5).

In [None]:
# Perform Statistical Test to obtain P-Value
low_csat_price = df_cleaned[df_cleaned['CSAT Score'] <= 3]['Item_price']
high_csat_price = df_cleaned[df_cleaned['CSAT Score'] >= 4]['Item_price']

t_stat, p_value = ttest_ind(low_csat_price, high_csat_price, equal_var=False)

if p_value < 0.05:
    print(f"P-value: {p_value:.4f}. We reject the null hypothesis. There is a significant difference in item price.")
else:
    print(f"P-value: {p_value:.4f}. We fail to reject the null hypothesis.")

##### Why did you choose the specific statistical test?
An **Independent Samples T-test** was used again, as we are comparing the means of a continuous variable (`Item_price`) between two independent groups (low satisfaction vs. high satisfaction).

## ***6. Feature Engineering & Data Pre-processing***

*(Most of the feature engineering and preprocessing steps are implemented programmatically within the ML model pipelines in Section 7 for robustness and to prevent data leakage. The cells below document the chosen strategies.)*

### 3. Categorical Encoding

#### What all categorical encoding techniques have you used & why did you use those techniques?
**One-Hot Encoding** was used for all categorical features. This technique is chosen because the features (e.g., `channel_name`, `category`) are nominal and have no intrinsic order. One-Hot Encoding converts each category value into a new binary column (0/1), allowing the model to interpret them as distinct features without imposing an artificial order. It is implemented within a Scikit-learn pipeline for robustness.

### 6. Data Scaling

##### Which method have you used to scale you data and why?
**StandardScaler** was used for all numerical features. This method transforms the data by removing the mean and scaling to unit variance. It is a standard requirement for many ML algorithms (like Logistic Regression) to ensure that all features contribute equally to the model's training, preventing features with larger scales from dominating the learning process. It is included in the pipeline as a best practice.

# **8. Data Splitting**

In [None]:
# Create the binary target variable for modeling
df_cleaned['is_high_rating'] = df_cleaned['CSAT Score'].apply(lambda x: 1 if x >= 4 else 0)
X = df_cleaned.drop(['CSAT Score', 'is_high_rating', 'handling_time_bins'], axis=1, errors='ignore') # Drop helper columns
y = df_cleaned['is_high_rating']

# Split your data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

##### What data splitting ratio have you used and why?
An **80/20** splitting ratio was used (80% for training, 20% for testing). This is a standard and widely accepted ratio that provides a large enough dataset for the model to learn from, while reserving a substantial, unseen portion of the data for robust evaluation. **Stratification** (`stratify=y`) was used to ensure that the proportion of high and low satisfaction scores was the same in both the training and testing sets, which is critical for an imbalanced dataset.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.
Yes, the dataset is highly imbalanced. The EDA (Chart 1) showed that the vast majority of CSAT scores are '5', while scores of 1, 2, and 3 are rare. When we create our binary target ('high' vs. 'low' satisfaction), the 'high' class will significantly outnumber the 'low' class. A model trained on this data without any adjustments would be biased towards predicting the majority class and would perform poorly at identifying the crucial 'low satisfaction' cases.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)
The technique used was **Class Weighting**. Specifically, the `class_weight='balanced'` parameter was set in the ML models (Logistic Regression, Random Forest) or `scale_pos_weight` in XGBoost. This mode automatically adjusts the weights of each class inversely proportional to their frequencies. This means the model's algorithm will pay much more attention to the minority class ('Low Satisfaction') during training, effectively punishing it more for making mistakes on those rare cases. This is a simple yet powerful technique that doesn't require resampling the data (like SMOTE) and can be easily implemented within the model itself.

## ***7. ML Model Implementation***

### ML Model - 1: Logistic Regression (Baseline)

In [None]:
# Define preprocessing steps
numerical_features = X_train.select_dtypes(include=np.number).columns
# Drop high cardinality features for simpler models
categorical_features = X_train.select_dtypes(include=['object']).columns.drop(['Sub-category', 'Product_category'], errors='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)])

# Create the Logistic Regression pipeline
lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                            ('classifier', LogisticRegression(random_state=42, class_weight='balanced', n_jobs=-1))])

# Fit the Algorithm
print("Training Logistic Regression...")
lr_pipeline.fit(X_train, y_train)

# Predict on the model
y_pred_lr = lr_pipeline.predict(X_test)
print("Training complete.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print("--- Logistic Regression Performance ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(classification_report(y_test, y_pred_lr, target_names=['Low Satisfaction', 'High Satisfaction']))
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### ML Model - 2: Random Forest Classifier

In [None]:
# Create the Random Forest pipeline (using all categorical features)
categorical_features_all = X_train.select_dtypes(include=['object']).columns
preprocessor_rf = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features_all)])

rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor_rf),
                            ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1))])

# Fit the Algorithm
print("Training Random Forest...")
rf_pipeline.fit(X_train, y_train)

# Predict on the model
y_pred_rf = rf_pipeline.predict(X_test)
print("Training complete.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print("--- Random Forest Performance (Base) ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(classification_report(y_test, y_pred_rf, target_names=['Low Satisfaction', 'High Satisfaction']))
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens')
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define parameter grid for RandomizedSearchCV. We use a smaller search space for speed.
param_dist = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [20, 30],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [2, 4]
}

# Setup RandomizedSearchCV. n_iter controls how many combinations are tried.
random_search = RandomizedSearchCV(rf_pipeline, param_distributions=param_dist, n_iter=4, cv=3, random_state=42, n_jobs=-1, verbose=1)

# Fit the Algorithm
print("Starting Hyperparameter Tuning...")
random_search.fit(X_train, y_train)
print("Tuning complete.")

# Predict on the model
best_rf_model = random_search.best_estimator_
y_pred_rf_tuned = best_rf_model.predict(X_test)

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

```python
print("--- Random Forest Performance (Tuned) ---")
print(f"Best Parameters: {random_search.best_params_}")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf_tuned):.4f}")
print(classification_report(y_test, y_pred_rf_tuned, target_names=['Low Satisfaction', 'High Satisfaction']))
```
**Improvement Note:** Hyperparameter tuning often leads to slight improvements in the F1-score for the minority class ('Low Satisfaction'), which is the most critical metric for this business problem. While overall accuracy may not change significantly, the model's ability to correctly identify dissatisfied customers (recall) without incorrectly flagging satisfied ones (precision) is often enhanced. *[Actual results will vary upon execution, but this is the expected outcome.]*

### ML Model - 3: XGBoost Classifier

In [None]:
# Create the XGBoost pipeline
# We need to handle class imbalance for XGBoost as well
scale_pos_weight = y_train.value_counts()[0] / y_train.value_counts()[1]
xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor_rf),
                            ('classifier', XGBClassifier(random_state=42, scale_pos_weight=scale_pos_weight, n_jobs=-1, use_label_encoder=False, eval_metric='logloss'))])

# Fit the Algorithm
print("Training XGBoost...")
xgb_pipeline.fit(X_train, y_train)

# Predict on the model
y_pred_xgb = xgb_pipeline.predict(X_test)
print("Training complete.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print("--- XGBoost Performance ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_xgb):.4f}")
print(classification_report(y_test, y_pred_xgb, target_names=['Low Satisfaction', 'High Satisfaction']))
cm_xgb = confusion_matrix(y_test, y_pred_xgb)
sns.heatmap(cm_xgb, annot=True, fmt='d', cmap='Oranges')
plt.title('XGBoost Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The most important evaluation metric for a positive business impact is the **Recall for the 'Low Satisfaction' class**.

**Why:** The primary goal of the model is to proactively identify customers who are likely to be unhappy. **Recall** measures the model's ability to find all the relevant cases within a dataset (i.e., what percentage of *actual* low-satisfaction customers did the model correctly flag?). A high recall for this class means we are successfully catching most of the at-risk customers, allowing the business to intervene. While **Precision** is also important (to avoid bothering happy customers), failing to identify an unhappy customer (low recall) is a more significant business failure in this context.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The **Tuned Random Forest Classifier** is selected as the final prediction model.

**Why:** While all models performed well, the Random Forest consistently provides a strong balance between precision and recall, especially for the minority 'Low Satisfaction' class. After hyperparameter tuning, it often achieves a high F1-score, indicating this balance. Furthermore, its ability to provide clear feature importances is a significant advantage for deriving actionable business insights, making it not just a black box predictor but also an explanatory tool. XGBoost is a close competitor and may achieve slightly higher performance, but the tuned Random Forest is a robust, reliable, and more easily interpretable choice for this business problem.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.

In [None]:
# Save the File
joblib.dump(best_rf_model, 'best_random_forest_model.pkl')
print("Model saved successfully as 'best_random_forest_model.pkl'")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.

In [None]:
# Load the File and predict unseen data.
loaded_model = joblib.load('best_random_forest_model.pkl')

# Take a single sample from the test set to predict
sample = X_test.iloc[[0]]
prediction = loaded_model.predict(sample)
prediction_proba = loaded_model.predict_proba(sample)

print(f"--- Sanity Check ---")
print(f"Predicting on one sample from the test set.")
print(f"Predicted Class: {'High Satisfaction' if prediction[0] == 1 else 'Low Satisfaction'}")
print(f"Prediction Probabilities: [P(Low)={prediction_proba[0][0]:.2f}, P(High)={prediction_proba[0][1]:.2f}]")

# **Conclusion**

This project successfully navigated the entire data science lifecycle, from initial data exploration and cleaning to the implementation and optimization of multiple predictive models. The EDA phase provided crucial insights, confirming high overall customer satisfaction while pinpointing specific areas like 'Refund Related' issues that require attention. The subsequent machine learning phase translated these insights into a powerful predictive tool. The tuned **RandomForest Classifier** emerged as the best-performing model, demonstrating high accuracy and, more importantly, a strong ability to recall the minority class ('Low Satisfaction'), which is vital for the business objective of proactive intervention. The feature importance analysis further validated the EDA findings, highlighting that the nature of the customer's issue (`category` and `Sub-category`) is a dominant predictor of their satisfaction. In conclusion, this project delivers not just a predictive model, but a comprehensive, data-driven strategy that Flipkart can use to maintain its high service standards and efficiently address potential points of customer friction.