<a href="https://colab.research.google.com/github/Kesanisaicharan/-Generative-Adversarial-Network/blob/main/Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Flipkart customer service satisfaction



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -KESANI. SAI CHARAN
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

Project Summary -

This project focuses on analyzing Flipkart e-commerce data to gain valuable insights into customer behavior, product trends, and pricing patterns. With the rapid growth of online shopping platforms, understanding customer preferences and product performance has become essential for improving business strategies and enhancing user experience. The main objective of this project is to explore Flipkart product data and apply data analytics and machine learning techniques to extract meaningful information and build predictive models that support better decision-making.

The project begins with data collection and preprocessing, where raw product data is cleaned by handling missing values, removing duplicate entries, and transforming categorical attributes into numerical formats suitable for analysis. This ensures data consistency and improves model accuracy. After preprocessing, exploratory data analysis (EDA) is performed to understand product distributions across categories, pricing variations, discount trends, and customer rating patterns. Visualizations such as bar charts, histograms, and scatter plots are used to identify relationships between variables like price, discount, and ratings, helping uncover hidden trends in customer purchasing behavior.

Following data analysis, feature engineering techniques are applied to select important attributes that influence product pricing and customer engagement. A machine learning model is then developed to predict product prices or customer ratings based on selected features such as category, brand, discount percentage, and user ratings. Algorithms such as Linear Regression or Logistic Regression are used due to their simplicity, interpretability, and effectiveness for structured data. The dataset is divided into training and testing sets to evaluate the model’s performance objectively.

The trained model is evaluated using standard performance metrics such as accuracy, mean absolute error, root mean square error, and R² score, depending on whether the task is classification or regression. These metrics help measure how well the model generalizes to unseen data. The project demonstrates that machine learning can effectively capture pricing trends and customer behavior patterns within Flipkart’s product ecosystem.

The final stage of the project involves result interpretation and deployment readiness. The findings from data analysis and model predictions can assist businesses in optimizing product pricing strategies, improving inventory management, and enhancing customer satisfaction. Additionally, the project provides a foundation for building recommendation systems, sentiment analysis tools, or real-time dashboards for e-commerce platforms.

In conclusion, this Flipkart data analysis project successfully demonstrates how data science techniques can be applied to real-world e-commerce datasets to derive actionable insights and build predictive systems. It highlights the importance of data-driven decision-making in today’s competitive digital marketplace. Future enhancements may include integrating real-time product data scraping, applying deep learning models for better prediction accuracy, and deploying the solution as a web-based application for business users.Overall, this project offers a practical and scalable approach to understanding and improving e-commerce operations through analytics and machine learning.

# **GitHub Link -**

Provide your GitHub Link here.



# **Problem Statement**

Write Problem Statement Here.

In today’s competitive e-commerce environment, platforms like Flipkart handle large volumes of customer interactions and service requests. Managing customer issues efficiently while maintaining high satisfaction levels is a major challenge. Raw customer support data alone does not provide clear insights into customer behavior, service performance, or potential dissatisfaction risks. Therefore, there is a need for a data-driven system that can analyze customer support records and predict customer satisfaction outcomes. This project aims to apply machine learning techniques to customer support data to identify key factors influencing satisfaction and improve service quality and operational efficiency.

GOAL - The main goal of this project is to analyze customer support data and build a machine learning model that predicts customer satisfaction based on ticket attributes such as issue type, support channel, response time, and resolution status. This helps organizations improve customer service performance, reduce response times, enhance customer experience, and make informed business decisions using data-driven insights


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()

import pandas as pd
data = pd.read_csv("Customer_support_data.csv")
data.head()

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
data.isnull().sum().plot(kind='bar', figsize=(8,4))
plt.title("Missing Values Count per Column")
plt.ylabel("Number of Missing Values")
plt.xlabel("Columns")
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

df = data.copy()

# Handling missing values

# Drop columns with extremely high missing values (over ~80%)
# connected_handling_time: 85665 missing out of 85907 (99.7% missing)
# order_date_time: 68693 missing out of 85907 (79.9% missing)
df.drop(columns=['connected_handling_time', 'order_date_time'], inplace=True)

# Impute numerical column Item_price with median
df['Item_price'].fillna(df['Item_price'].median(), inplace=True)

# Impute categorical/object columns
# Customer Remarks: 57165 missing. Fill with 'No Remarks'
df['Customer Remarks'].fillna('No Remarks', inplace=True)
# Order_id: 18232 missing. Fill with 'Unknown'
df['Order_id'].fillna('Unknown', inplace=True)
# Customer_City: 68828 missing. Fill with 'Unknown'
df['Customer_City'].fillna('Unknown', inplace=True)
# Product_category: 68711 missing. Fill with 'Unknown'
df['Product_category'].fillna('Unknown', inplace=True)

# Convert date columns to datetime objects for easier manipulation
df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], format='%d/%m/%Y %H:%M')
df['issue_responded'] = pd.to_datetime(df['issue_responded'], format='%d/%m/%Y %H:%M')
# Survey_response_Date format is different, handle with errors='coerce'
df['Survey_response_Date'] = pd.to_datetime(df['Survey_response_Date'], errors='coerce')

# Display the info after wrangling to show changes
print("DataFrame Info after Data Wrangling:")
df.info()
print("\nMissing values count after wrangling:")
print(df.isnull().sum())

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visuals
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

# Count tickets by category
category_counts = df['category'].value_counts()

# Plot
sns.barplot(x=category_counts.index, y=category_counts.values, palette='viridis')

# Add labels and title
plt.title('Ticket Count by Category', fontsize=16)
plt.xlabel('Category', fontsize=12)
plt.ylabel('Number of Tickets', fontsize=12)
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visuals
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12,6)

# Count tickets by sub-category and select top 10
subcat_counts = df['Sub-category'].value_counts().head(10)

# Plot horizontal bar chart
sns.barplot(x=subcat_counts.values, y=subcat_counts.index, palette='magma')

# Add labels and title
plt.title('Top 10 Sub-Categories by Ticket Count', fontsize=16)
plt.xlabel('Number of Tickets', fontsize=12)
plt.ylabel('Sub-Category', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visuals
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12,6)

# Count tickets by city and select top 10
city_counts = df['Customer_City'].value_counts().head(10)

# Plot horizontal bar chart
sns.barplot(x=city_counts.values, y=city_counts.index, palette='coolwarm')

# Add labels and title
plt.title('Top 10 Cities by Ticket Volume', fontsize=16)
plt.xlabel('Number of Tickets', fontsize=12)
plt.ylabel('Customer City', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate response time in days
df['response_time'] = df['issue_responded'] - df['Issue_reported at']
df['response_time_days'] = df['response_time'].dt.total_seconds() / (3600 * 24)

# Set style for better visuals
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Boxplot of response time
sns.boxplot(x=df['response_time_days'], color='skyblue')

# Add labels and title
plt.title('Distribution of Response Time (Days)', fontsize=16)
plt.xlabel('Response Time (Days)', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

# Count CSAT scores
csat_counts = df['CSAT Score'].value_counts().sort_index()

# Plot
sns.barplot(x=csat_counts.index, y=csat_counts.values, palette='Set2')

# Labels and title
plt.title('Distribution of CSAT Scores', fontsize=16)
plt.xlabel('CSAT Score', fontsize=12)
plt.ylabel('Number of Responses', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

# Count tickets by shift
shift_counts = df['Agent Shift'].value_counts()

# Plot
sns.barplot(x=shift_counts.index, y=shift_counts.values, palette='pastel')

# Labels and title
plt.title('Ticket Volume by Agent Shift', fontsize=16)
plt.xlabel('Agent Shift', fontsize=12)
plt.ylabel('Number of Tickets', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

# Boxplot: CSAT vs Response Time
sns.boxplot(x='CSAT Score', y='response_time_days', data=df, palette='Set3')

# Labels and title
plt.title('CSAT Score vs Response Time', fontsize=16)
plt.xlabel('CSAT Score', fontsize=12)
plt.ylabel('Response Time (Days)', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12,6)

# Calculate average CSAT per agent and select top 10
agent_csat = df.groupby('Agent_name')['CSAT Score'].mean().sort_values(ascending=False).head(10)

# Plot
sns.barplot(x=agent_csat.values, y=agent_csat.index, palette='viridis')

# Labels and title
plt.title('Top 10 Agents by Average CSAT Score', fontsize=16)
plt.xlabel('Average CSAT Score', fontsize=12)
plt.ylabel('Agent Name', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12,6)

# Calculate average CSAT per category
category_csat = df.groupby('category')['CSAT Score'].mean().sort_values(ascending=False)

# Plot
sns.barplot(x=category_csat.index, y=category_csat.values, palette='coolwarm')

# Labels and title
plt.title('Average CSAT Score by Category', fontsize=16)
plt.xlabel('Category', fontsize=12)
plt.ylabel('Average CSAT Score', fontsize=12)
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:

# Chart - 10 visualization code
import matplotlib.pyplot as plt

# Get top 6 product categories
product_counts = df['Product_category'].value_counts().head(6)

# Plot pie chart
plt.figure(figsize=(8,8))
plt.pie(product_counts.values, labels=product_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Ticket Distribution by Product Category')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Top 10 product categories
product_counts = df['Product_category'].value_counts().head(10).sort_values()

# Set style
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))

# Lollipop chart
plt.hlines(y=product_counts.index, xmin=0, xmax=product_counts.values, color='skyblue', alpha=0.7, linewidth=3)
plt.plot(product_counts.values, product_counts.index, "o", markersize=10, color='orange')

# Labels and title
plt.title('Top 10 Product Categories by Ticket Volume', fontsize=16)
plt.xlabel('Number of Tickets', fontsize=12)
plt.ylabel('Product Category', fontsize=12)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import matplotlib.pyplot as plt

# Calculate average response time and ticket count per category
category_stats = df.groupby('category').agg({'response_time_days':'mean', 'category':'count'})
category_stats.rename(columns={'category':'ticket_count'}, inplace=True)
category_stats = category_stats.sort_values('response_time_days')

# Scatter/Bubble plot
plt.figure(figsize=(12,6))
plt.scatter(category_stats['response_time_days'], category_stats.index,
            s=category_stats['ticket_count']*0.5,  # scale bubble size
            color='skyblue', alpha=0.7, edgecolor='black')

# Labels and title
plt.title('Average Response Time by Category (Bubble Size = Ticket Count)', fontsize=16)
plt.xlabel('Average Response Time (Days)', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# CSAT Score by product category
plt.figure(figsize=(12,8)) # fixes the size of the visualization
# plot the boxplot
sns.boxplot(x='Product_category', y = 'CSAT Score', data=df, palette='Set1')
# title and labels
plt.title('CSAT Score by Product Category')
plt.xlabel('Product category')
plt.ylabel('CSAT Score')
plt.xticks(rotation=90) # For better readability
# show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select numeric columns for correlation
numeric_cols = ['Item_price', 'response_time_days', 'CSAT Score']

# Compute correlation matrix
corr_matrix = df[numeric_cols].corr()

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', center=0, linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Variables', fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select numeric columns for pair plot
numeric_cols = ['Item_price', 'response_time_days', 'CSAT Score']

# Plot pair plot
sns.pairplot(df[numeric_cols], kind='scatter', diag_kind='kde', plot_kws={'alpha':0.6, 's':50, 'edgecolor':'k'}, diag_kws={'shade':True})
plt.suptitle('Pair Plot of Numeric Variables', fontsize=16, y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-value
import pandas as pd
from scipy.stats import pearsonr

# Calculate ticket count and average CSAT per supervisor
supervisor_stats = df.groupby('Supervisor').agg(
    ticket_count=('Supervisor', 'count'),
    avg_csat=('CSAT Score', 'mean')
).reset_index()

# Perform Pearson correlation
corr_coeff, p_value = pearsonr(supervisor_stats['ticket_count'], supervisor_stats['avg_csat'])

print("Correlation coefficient:", corr_coeff)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Ensure response_time_days column exists
# If not, calculate it
df['response_time_days'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / (3600*24)

# Drop rows with missing values in CSAT or response time
df_test = df.dropna(subset=['CSAT Score', 'response_time_days'])

# Perform Pearson correlation
corr_coeff, p_value = pearsonr(df_test['response_time_days'], df_test['CSAT Score'])

print("Correlation coefficient:", corr_coeff)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Calculate response time in days if not already done
df['response_time_days'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / (3600*24)

# Drop missing values
df_test = df.dropna(subset=['response_time_days', 'Product_category'])

# Separate Electronics and Other categories
electronics = df_test[df_test['Product_category'] == 'Electronics']['response_time_days']
others = df_test[df_test['Product_category'] != 'Electronics']['response_time_days']

# Perform independent t-test
t_stat, p_value = ttest_ind(electronics, others, equal_var=False)  # Welch's t-test

print("T-statistic:", t_stat)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df = data.copy()

# Drop columns with very high missing values
df.drop(columns=['connected_handling_time', 'order_date_time'], inplace=True)

# Fill missing numerical values with median
df['Item_price'].fillna(df['Item_price'].median(), inplace=True)

# Fill missing categorical values
df['Customer Remarks'].fillna('No Remarks', inplace=True)
df['Order_id'].fillna('Unknown', inplace=True)
df['Customer_City'].fillna('Unknown', inplace=True)
df['Product_category'].fillna('Unknown', inplace=True)

# Check remaining missing values
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Define the cap_outliers function
def cap_outliers(series, lower_bound_factor=1.5, upper_bound_factor=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - (lower_bound_factor * IQR)
    upper_bound = Q3 + (upper_bound_factor * IQR)
    return series.clip(lower=lower_bound, upper=upper_bound)

# Before outlier treatment
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.boxplot(x=df['Item_price'])
plt.title('Before Outlier Treatment - Item Price')

# After outlier treatment
df['Item_price_capped'] = cap_outliers(df['Item_price'])

plt.subplot(1,2,2)
sns.boxplot(x=df['Item_price_capped'])
plt.title('After Outlier Treatment - Item Price')

plt.tight_layout()
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

label_cols = ['Agent_name', 'Supervisor', 'Manager', 'Customer_City', 'Product_category']

for col in label_cols:
    df[col] = le.fit_transform(df[col].astype(str))
df = pd.get_dummies(df, columns=['category', 'Agent Shift'], drop_first=True)

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re

contractions_dict = {
    "can't": "cannot", "won't": "will not", "don't": "do not",
    "didn't": "did not", "isn't": "is not", "aren't": "are not",
    "wasn't": "was not", "weren't": "were not", "hasn't": "has not",
    "haven't": "have not", "hadn't": "had not", "wouldn't": "would not",
    "shouldn't": "should not", "couldn't": "could not", "mustn't": "must not",
    "it's": "it is", "that's": "that is", "what's": "what is",
    "there's": "there is", "I'm": "I am", "you're": "you are",
    "they're": "they are", "we're": "we are", "I've": "I have",
    "you've": "you have", "we've": "we have", "they've": "they have",
    "I'll": "I will", "you'll": "you will", "he'll": "he will",
    "she'll": "she will", "they'll": "they will", "we'll": "we will"
}

def expand_contractions(text):
    for contraction, full_form in contractions_dict.items():
        text = re.sub(r"\b{}\b".format(contraction), full_form, text, flags=re.IGNORECASE)
    return text

# Apply to textual column (example: Customer Remarks)
df['Customer Remarks'] = df['Customer Remarks'].astype(str).apply(expand_contractions)


#### 2. Lower Casing

In [None]:
# Lower Casing
# Convert text to lowercase
df['Customer Remarks'] = df['Customer Remarks'].astype(str).str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Remove punctuation from text
df['Customer Remarks'] = df['Customer Remarks'].astype(str).str.translate(str.maketrans('', '', string.punctuation))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

def clean_text(text):
    text = re.sub(r'http\S+|www\S+', '', text)     # Remove URLs
    text = re.sub(r'\b\w*\d\w*\b', '', text)       # Remove words containing digits
    return text

df['Customer Remarks'] = df['Customer Remarks'].astype(str).apply(clean_text)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

df['Customer Remarks'] = df['Customer Remarks'].astype(str).apply(
    lambda x: " ".join([word for word in x.split() if word not in stop_words])
)


In [None]:
# Remove White spaces
# Remove extra spaces
df['Customer Remarks'] = df['Customer Remarks'].astype(str).str.strip().str.replace(r'\s+', ' ', regex=True)


#### 6. Rephrase Text

In [None]:
# Rephrase Text
rephrase_dict = {
    "pls": "please",
    "asap": "as soon as possible",
    "u": "you",
    "ur": "your",
    "thx": "thanks",
    "btw": "by the way",
    "msg": "message"
}

def rephrase_text(text):
    words = text.split()
    return " ".join([rephrase_dict.get(word, word) for word in words])

df['Customer Remarks'] = df['Customer Remarks'].astype(str).apply(rephrase_text)


#### 7. Tokenization

In [None]:
# Tokenization
# Simple whitespace-based tokenization
df['Customer Remarks_tokens'] = df['Customer Remarks'].astype(str).apply(lambda x: x.split())


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Example: lemmatize tokenized text
df['Customer Remarks_lemmatized'] = df['Customer Remarks_tokens'].apply(
    lambda tokens: [lemmatizer.lemmatize(word) for word in tokens]
)


##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk

# Download the necessary tokenizer
nltk.download('punkt')  # This is the standard tokenizer
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab') # Added to resolve LookupError: Resource punkt_tab not found
nltk.download('averaged_perceptron_tagger_eng') # Added to resolve LookupError: Resource averaged_perceptron_tagger_eng not found

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

# Fill missing remarks and convert to string
df['Customer Remarks'] = df['Customer Remarks'].fillna('').astype(str)

# Expand contractions (simple example, you can add more)
contractions_dict = {"can't": "cannot", "don't": "do not", "i'm": "i am"}

def expand_contractions(text):
    for contraction, full_form in contractions_dict.items():
        text = re.sub(r'\b{}\b'.format(contraction), full_form, text, flags=re.IGNORECASE)
    return text

df['Customer Remarks'] = df['Customer Remarks'].apply(expand_contractions)

# Lowercase
df['Customer Remarks'] = df['Customer Remarks'].str.lower()

# Remove URLs and words with digits
df['Customer Remarks'] = df['Customer Remarks'].apply(lambda x: re.sub(r'http\S+|www\S+', '', x))
df['Customer Remarks'] = df['Customer Remarks'].apply(lambda x: re.sub(r'\b\w*\d\w*\b', '', x))

# Remove punctuation
df['Customer Remarks'] = df['Customer Remarks'].str.translate(str.maketrans('', '', string.punctuation))

# Remove stopwords
stop_words = set(stopwords.words('english'))
df['Customer Remarks'] = df['Customer Remarks'].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))

# Remove extra spaces
df['Customer Remarks'] = df['Customer Remarks'].str.strip().str.replace(r'\s+', ' ', regex=True)

# Tokenization using nltk.word_tokenize (now it will work)
df['Customer Remarks_tokens'] = df['Customer Remarks'].apply(nltk.word_tokenize)

# POS tagging
df['Customer Remarks_POS'] = df['Customer Remarks_tokens'].apply(nltk.pos_tag)

# Lemmatization
lemmatizer = WordNetLemmatizer()
df['Customer Remarks_lemmatized'] = df['Customer Remarks_tokens'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])

# Check result
print(df[['Customer Remarks', 'Customer Remarks_tokens', 'Customer Remarks_POS', 'Customer Remarks_lemmatized']].head())

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Ensure no missing values
df['Customer Remarks'] = df['Customer Remarks'].fillna('').astype(str)

#  Bag of Words
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(df['Customer Remarks'])

#  TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['Customer Remarks'])

# N-grams (TF-IDF)
ngram_vectorizer = TfidfVectorizer(ngram_range=(1,2))
X_ngram = ngram_vectorizer.fit_transform(df['Customer Remarks'])

# Output shapes
print("Bag of Words Shape:", X_bow.shape)
print("TF-IDF Shape:", X_tfidf.shape)
print("N-gram TF-IDF Shape:", X_ngram.shape)

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
import pandas as pd
import numpy as np

#  CREATE NEW FEATURES

# Ensure datetime format
df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], errors='coerce')
df['issue_responded'] = pd.to_datetime(df['issue_responded'], errors='coerce')

# Response time features
df['response_time_hours'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / 3600
df['response_time_days'] = df['response_time_hours'] / 24

# Price category
df['price_category'] = pd.cut(df['Item_price'],
                               bins=[0,500,2000,5000,10000],
                               labels=['Low','Medium','High','Premium'])

# Agent workload feature
df['agent_ticket_load'] = df.groupby('Agent_name')['Order_id'].transform('count')

# CSAT band feature
df['csat_band'] = pd.cut(df['CSAT Score'],
                          bins=[0,2,4,5],
                          labels=['Low','Medium','High'])

# REMOVE HIGHLY CORRELATED FEATURES

num_cols = df.select_dtypes(include=np.number)
corr_matrix = num_cols.corr().abs()

upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.85)]

df.drop(columns=to_drop, inplace=True)

print("Dropped highly correlated columns:", to_drop)
#  FINAL DATAFRAME CHECK
print(df.head())
print("Final shape:", df.shape)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Separate features and target
X = df.select_dtypes(include=np.number).drop(columns=['CSAT Score'])
y = df['CSAT Score']

# Train Random Forest
model = RandomForestRegressor(random_state=42)
model.fit(X, y)

# Get feature importance
importances = model.feature_importances_

# Select top 10 features
indices = np.argsort(importances)[-10:]
selected_features = X.columns[indices]

print("Top Selected Features:", selected_features.tolist())

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
import numpy as np

# Log transformation to reduce skewness
df['Item_price_log'] = np.log1p(df['Item_price'])
df['response_time_hours_log'] = np.log1p(df['response_time_hours'])

# Check transformed data
print(df[['Item_price', 'Item_price_log', 'response_time_hours', 'response_time_hours_log']].head())    #Yes, the data needed transformation. I used log transformation on skewed numerical features like Item_price and response_time_hours to reduce skewness, handle outliers, and improve model performance by making the data more normally distributed.



### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# Select numerical columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Apply scaling
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# Check result
print(df[num_cols].head())

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import TruncatedSVD

# Example: using TF-IDF matrix from Customer Remarks
# Assuming X_tfidf is your TF-IDF vectorized text
# X_tfidf = tfidf_vectorizer.fit_transform(df['Customer Remarks'])

# Reduce dimensions to 10 components
svd = TruncatedSVD(n_components=10, random_state=42)
X_svd = svd.fit_transform(X_tfidf)

# Create a DataFrame with reduced components
df_svd = pd.DataFrame(X_svd, columns=[f'SVD{i+1}' for i in range(X_svd.shape[1])])

# Check result
print(df_svd.head())
print("Explained Variance Ratio:", svd.explained_variance_ratio_)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Features (X) and target (y)
# Assume 'CSAT Score' is the target
X = df.drop(columns=['CSAT Score'])
y = df['CSAT Score']

# Split data: 80% train, 20% test (common for real-world datasets)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Check shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier

# Separate features and target
X = df.drop('CSAT Score', axis=1)
y = df['CSAT Score']

# Convert datetime columns to numeric
for col in X.select_dtypes(include=['datetime64[ns]', 'datetime64[ns, UTC]']):
    X[col] = X[col].view('int64')  # convert to timestamp

# Encode categorical columns
for col in X.select_dtypes(include=['object', 'category']):
    X[col] = LabelEncoder().fit_transform(X[col].astype(str))

# Fill missing values
X = X.fillna(X.median(numeric_only=True))

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Encode y_train and y_test to integer labels for classification
# This is necessary because CSAT Score was scaled to floats, but classifiers need discrete integers
le_csat = LabelEncoder()
y_train_encoded = le_csat.fit_transform(y_train)
y_test_encoded = le_csat.transform(y_test)

# Handle imbalance using class weights
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train_encoded)

print(" Model trained successfully without errors")

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 (Random Forest) Implementation

# Fit the Algorithm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Fit the Algorithm
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train_encoded)

# Predict on the model
y_pred_rf = rf.predict(X_test)

# Evaluate
print("Random Forest classifier : ")
print("Accuracy Score:", accuracy_score(y_test_encoded, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test_encoded, y_pred_rf, zero_division=0))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

# Get classification report as dictionary
report = classification_report(y_test_encoded, y_pred_rf, output_dict=True, zero_division=0)

# Convert to DataFrame
report_df = pd.DataFrame(report).iloc[:-1, :].T

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(report_df[['precision', 'recall', 'f1-score']], annot=True, cmap='Blues', fmt=".2f")
plt.title("Model Evaluation Metrics Heatmap")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Define model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameters
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

# GridSearchCV
grid_rf = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train, y_train_encoded)

# Best model
best_rf = grid_rf.best_estimator_

# Predict
y_pred_rf = best_rf.predict(X_test)

# Evaluate
print("Best Parameters:", grid_rf.best_params_)
print("Accuracy:", accuracy_score(y_test_encoded, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test_encoded, y_pred_rf, zero_division=0))

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# ML Model - 2 (Logistic Regression)

# Fit the Algorithm
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train_encoded)

# Predict
y_pred_lr = lr.predict(X_test)

# Evaluate
print("Logistic Regression (Default Parameters): ")
print("Accuracy Score: ", accuracy_score(y_test_encoded, y_pred_lr))
print("\nClassification Report:\n", classification_report(y_test_encoded, y_pred_lr, zero_division=0))

# Confusion Matrix
conf_mat = confusion_matrix(y_test_encoded, y_pred_lr)
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues')
plt.title("Logistic Regression - Confusion Matrix (Default)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# ------------------------------------------
# Hyperparameter Tuning using GridSearch CV
# ------------------------------------------

param_grid = {
    'C' : [0.01, 0.1, 1, 10],
    'penalty' : ['l2'],
    'solver' : ['lbfgs', 'saga'],
    'max_iter' : [500, 1000]
}

grid_search_lr = GridSearchCV(
    LogisticRegression(multi_class='multinomial', random_state = 42),
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# ------------------------------------------
# Fit the algorithm
# ------------------------------------------

grid_search_lr.fit(X_train, y_train_encoded)

# Best parameters found
print("Best parameters found: ", grid_search_lr.best_params_)

# ------------------------------------------
# Predict the model
# ------------------------------------------

best_lr = grid_search_lr.best_estimator_
y_pred = best_lr.predict(X_test)

# ------------------------------------------
# Evaluation
# ------------------------------------------

print("\nLogistic Regression with GridSearchCV: ")
print("Accuracy Score: ", accuracy_score(y_test_encoded, y_pred))
print("\nClassification Report:\n", classification_report(y_test_encoded, y_pred, zero_division = 0))

# Confusion Matrix
conf_mat = confusion_matrix(y_test_encoded, y_pred)
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues')
plt.title("Logistic Regression - Confusion Matrix (GridSearchCV)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

# The target variable y_train and y_test currently hold scaled CSAT scores
# with unique values like [-2, -1, 0].
# XGBoost Classifier expects class labels to be consecutive integers starting from 0 (e.g., 0, 1, 2).
# We need to encode these existing scaled scores into 0-indexed labels.

# Use LabelEncoder to transform the scaled y values into 0-indexed classes
le_xgb = LabelEncoder()
y_train_xgb = le_xgb.fit_transform(y_train)
y_test_xgb = le_xgb.transform(y_test)

# Determine the number of classes for XGBoost
num_classes_xgb = len(le_xgb.classes_)

# Initialize XGBClassifier
xgb = XGBClassifier(objective='multi:softmax', num_class=num_classes_xgb,
                    random_state=42, eval_metric='mlogloss')
xgb.fit(X_train, y_train_xgb)

# Predict on the model (raw 0-indexed predictions)
y_pred_xgb_raw = xgb.predict(X_test)

# To evaluate against the original scaled y_test values, we need to inverse transform the predictions
# and compare them with the inverse transformed y_test for meaningful classification metrics.
# Or, keep y_test as 0-indexed for comparison.
# Let's compare y_pred_xgb_raw with y_test_xgb (both 0-indexed)

# Evaluate using the 0-indexed labels
print("XGBoost classifier (Default parameters): ")
print("Accuracy Score: ", accuracy_score(y_test_xgb, y_pred_xgb_raw))
print("\nClassification Report:\n", classification_report(y_test_xgb, y_pred_xgb_raw, zero_division=0))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

# Get classification report as dictionary for the XGBoost predictions
report_xgb = classification_report(y_test_xgb, y_pred_xgb_raw, output_dict=True, zero_division=0)

# Convert to DataFrame
report_xgb_df = pd.DataFrame(report_xgb).iloc[:-1, :].T

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(report_xgb_df[['precision', 'recall', 'f1-score']], annot=True, cmap='Blues', fmt=".2f")
plt.title("XGBoost Model Evaluation Metrics Heatmap")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# ------------------------------------------
# Hyperparameter Tuning using GridSearch CV
# ------------------------------------------

param_grid = {
    'n_estimators' : [100, 150],
    'max_depth' : [3, 6, 10],
    'learning_rate' : [0.01, 0.1],
    'subsample' : [0.8, 1],
    'colsample_bytree' : [0.8, 1]
}

grid_search_xgb = GridSearchCV(
    estimator=XGBClassifier(eval_metric='mlogloss', objective='multi:softmax', random_state = 42),
    param_grid=param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# ------------------------------------------
# Fit the algorithm
# ------------------------------------------

grid_search_xgb.fit(X_train, y_train_xgb)

# Best parameters found
print("Best parameters found: ", grid_search_xgb.best_params_)

# ------------------------------------------
# Predict the model
# ------------------------------------------

best_xgb = grid_search_xgb.best_estimator_
# Make predictions as 0-indexed labels for evaluation against y_test_xgb
y_pred_xgb_eval = best_xgb.predict(X_test)

# ------------------------------------------
# Evaluation
# ------------------------------------------

print("\nXGBClassifier (with GridSearchCV): ")
print("Accuracy Score: ", accuracy_score(y_test_xgb, y_pred_xgb_eval))
print("\nClassification Report:\n", classification_report(y_test_xgb, y_pred_xgb_eval, zero_division = 0))

# Confusion matrix
conf_mat = confusion_matrix(y_test_xgb, y_pred_xgb_eval)
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Greens')
plt.title("XGBoost - Confusion Matrix (GridSearchCV): ")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the file

import pickle

# Save the model to a .pkl file
with open ('best_xgb.pkl', 'wb') as file:
  pickle.dump(best_xgb, file)

print("Model saved as: 'best_xgb.pkl'")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
#Load the file and predict the unseen data.
import pickle

# Save the model to a .pkl file
with open ('best_xgb.pkl', 'rb') as file:
  loaded_model = pickle.load(file)

# Select a few samples from test data
unseen_samples = X_test.sample(5, random_state = 1)

# Predict using the loaded model
predictions = loaded_model.predict(unseen_samples) + 1  # Add 1 to match CSAT score scale (1-5)

# Display predictions
print("Predicted CSAT score for unseen data: ")
print(predictions)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***