Project Title------>DeepCSAT â€“ Ecommerce Customer Satisfaction Score Prediction
-------------------------------------------------------------------------------

Project Summary----->> 
======================
                      The project DeepCSAT focuses on predicting customer satisfaction scores in the e-commerce domain using advanced data analytics and deep learning techniques. In todayâ€™s highly competitive online marketplace, understanding customer sentiment and predicting satisfaction levels are crucial for improving service quality, customer retention, and business growth.

The objective of this project is to develop a predictive model that can estimate a customerâ€™s satisfaction score based on various factors such as order experience, delivery time, product quality, pricing, customer support, and feedback text. The project utilizes a deep learning approach, integrating natural language processing (NLP) for text-based reviews and numerical data analysis for structured features.

The workflow includes data collection, preprocessing, and feature extraction, followed by model training using neural network architectures like Deep Neural Networks (DNNs) or Recurrent Neural Networks (RNNs). The modelâ€™s performance is evaluated using key metrics such as Mean Squared Error (MSE) and RÂ² score to ensure high predictive accuracy.

The outcome of DeepCSAT is a system capable of automatically predicting satisfaction levels and identifying potential problem areas, enabling e-commerce companies to make data-driven decisions to enhance the overall customer experience.
                       

Key Highlights====>
-------------------

>Domain: E-commerce & Customer Analytics

>Tech Stack: Python, TensorFlow/Keras, Scikit-learn, Pandas, NLP (TextBlob or BERT)

>Model Used: Deep Learning (DNN/RNN)

>Goal: Predict customer satisfaction score accurately

>Impact: Improved customer retention, service quality, and business insights

ðŸŽ¯ Project Goal----->
----------------------

The primary goal of the DeepCSAT project is to predict customer satisfaction scores in the e-commerce sector using Artificial Neural Networks (ANNs) and data-driven insights.

                   This project aims to develop an intelligent predictive system that can automatically estimate how satisfied a customer is likely to be based on various factors such as:

  >Product quality and pricing

  >Delivery performance

  >Customer support interaction

  >User feedback and review sentiment

  >Overall shopping experience

By accurately predicting the Customer Satisfaction (CSAT) Score, the model helps e-commerce platforms:

  >Identify dissatisfied customers early

  >Understand key pain points affecting satisfaction

  >Improve decision-making in product quality, logistics, and service

  >Enhance customer retention and build long-term loyalty


In [1]:
#import used libraries
# deepcsat_full_pipeline.py
# Run as a notebook or script. It inspects the CSV, trains baseline XGBoost and a Keras multimodal model.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import joblib
from datetime import datetime
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# ML libraries
import xgboost as xgb
import shap

# Deep learning
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, optimizers

In [2]:
#import the raw data
df=pd.read_csv('c:/labmentix/5th project/eCommerce_Customer_support_data.csv')
df

Unnamed: 0,Unique id,channel_name,category,Sub-category,Customer Remarks,Order_id,order_date_time,Issue_reported at,issue_responded,Survey_response_Date,Customer_City,Product_category,Item_price,connected_handling_time,Agent_name,Supervisor,Manager,Tenure Bucket,Agent Shift,CSAT Score
0,7e9ae164-6a8b-4521-a2d4-58f7c9fff13f,Outcall,Product Queries,Life Insurance,,c27c9bb4-fa36-4140-9f1f-21009254ffdb,,1/8/2023 11:13,1/8/2023 11:47,1-Aug-23,,,,,Richard Buchanan,Mason Gupta,Jennifer Nguyen,On Job Training,Morning,5
1,b07ec1b0-f376-43b6-86df-ec03da3b2e16,Outcall,Product Queries,Product Specific Information,,d406b0c7-ce17-4654-b9de-f08d421254bd,,1/8/2023 12:52,1/8/2023 12:54,1-Aug-23,,,,,Vicki Collins,Dylan Kim,Michael Lee,>90,Morning,5
2,200814dd-27c7-4149-ba2b-bd3af3092880,Inbound,Order Related,Installation/demo,,c273368d-b961-44cb-beaf-62d6fd6c00d5,,1/8/2023 20:16,1/8/2023 20:38,1-Aug-23,,,,,Duane Norman,Jackson Park,William Kim,On Job Training,Evening,5
3,eb0d3e53-c1ca-42d3-8486-e42c8d622135,Inbound,Returns,Reverse Pickup Enquiry,,5aed0059-55a4-4ec6-bb54-97942092020a,,1/8/2023 20:56,1/8/2023 21:16,1-Aug-23,,,,,Patrick Flores,Olivia Wang,John Smith,>90,Evening,5
4,ba903143-1e54-406c-b969-46c52f92e5df,Inbound,Cancellation,Not Needed,,e8bed5a9-6933-4aff-9dc6-ccefd7dcde59,,1/8/2023 10:30,1/8/2023 10:32,1-Aug-23,,,,,Christopher Sanchez,Austin Johnson,Michael Lee,0-30,Morning,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85902,505ea5e7-c475-4fac-ac36-1d19a4cb610f,Inbound,Refund Related,Refund Enquiry,,1b5a2b9c-a95f-405f-a42e-5b1b693f3dc9,,30/08/2023 23:20,31/08/2023 07:22,31-Aug-23,,,,,Brandon Leon,Ethan Tan,William Kim,On Job Training,Morning,4
85903,44b38d3f-1523-4182-aba2-72917586647c,Inbound,Order Related,Seller Cancelled Order,Supported team customer executive good,d0e8a817-96d5-4ace-bb82-adec50398e22,,31/08/2023 08:15,31/08/2023 08:17,31-Aug-23,,,,,Linda Foster,Noah Patel,Emily Chen,>90,Morning,5
85904,723bce2c-496c-4aa8-a64b-ca17004528f0,Inbound,Order Related,Order status enquiry,need to improve with proper details.,bdefe788-ccec-4eda-8ca4-51045e68db8a,,31/08/2023 18:57,31/08/2023 19:02,31-Aug-23,,,,,Kimberly Martinez,Aiden Patel,Olivia Tan,On Job Training,Evening,5
85905,707528ee-6873-4192-bfa9-a491f1c08ab5,Inbound,Feedback,UnProfessional Behaviour,,a031ec28-0c5e-450e-95b2-592342c40bc4,,31/08/2023 19:59,31/08/2023 20:00,31-Aug-23,,,,,Daniel Martin,Olivia Suzuki,Olivia Tan,>90,Morning,4


In [7]:
#use command for checking the datatypes of the columns
df.dtypes


Unique id                   object
channel_name                object
category                    object
Sub-category                object
Customer Remarks            object
Order_id                    object
order_date_time             object
Issue_reported at           object
issue_responded             object
Survey_response_Date        object
Customer_City               object
Product_category            object
Item_price                 float64
connected_handling_time    float64
Agent_name                  object
Supervisor                  object
Manager                     object
Tenure Bucket               object
Agent Shift                 object
CSAT Score                   int64
dtype: object

In [8]:
#calculate total null values in the raw data
df.isnull().sum()

Unique id                      0
channel_name                   0
category                       0
Sub-category                   0
Customer Remarks           57165
Order_id                   18232
order_date_time            68693
Issue_reported at              0
issue_responded                0
Survey_response_Date           0
Customer_City              68828
Product_category           68711
Item_price                 68701
connected_handling_time    85665
Agent_name                     0
Supervisor                     0
Manager                        0
Tenure Bucket                  0
Agent Shift                    0
CSAT Score                     0
dtype: int64

In [9]:
#drop all duplicates values present in the raw data
df=df.drop_duplicates()
df

Unnamed: 0,Unique id,channel_name,category,Sub-category,Customer Remarks,Order_id,order_date_time,Issue_reported at,issue_responded,Survey_response_Date,Customer_City,Product_category,Item_price,connected_handling_time,Agent_name,Supervisor,Manager,Tenure Bucket,Agent Shift,CSAT Score
0,7e9ae164-6a8b-4521-a2d4-58f7c9fff13f,Outcall,Product Queries,Life Insurance,,c27c9bb4-fa36-4140-9f1f-21009254ffdb,,1/8/2023 11:13,1/8/2023 11:47,1-Aug-23,,,,,Richard Buchanan,Mason Gupta,Jennifer Nguyen,On Job Training,Morning,5
1,b07ec1b0-f376-43b6-86df-ec03da3b2e16,Outcall,Product Queries,Product Specific Information,,d406b0c7-ce17-4654-b9de-f08d421254bd,,1/8/2023 12:52,1/8/2023 12:54,1-Aug-23,,,,,Vicki Collins,Dylan Kim,Michael Lee,>90,Morning,5
2,200814dd-27c7-4149-ba2b-bd3af3092880,Inbound,Order Related,Installation/demo,,c273368d-b961-44cb-beaf-62d6fd6c00d5,,1/8/2023 20:16,1/8/2023 20:38,1-Aug-23,,,,,Duane Norman,Jackson Park,William Kim,On Job Training,Evening,5
3,eb0d3e53-c1ca-42d3-8486-e42c8d622135,Inbound,Returns,Reverse Pickup Enquiry,,5aed0059-55a4-4ec6-bb54-97942092020a,,1/8/2023 20:56,1/8/2023 21:16,1-Aug-23,,,,,Patrick Flores,Olivia Wang,John Smith,>90,Evening,5
4,ba903143-1e54-406c-b969-46c52f92e5df,Inbound,Cancellation,Not Needed,,e8bed5a9-6933-4aff-9dc6-ccefd7dcde59,,1/8/2023 10:30,1/8/2023 10:32,1-Aug-23,,,,,Christopher Sanchez,Austin Johnson,Michael Lee,0-30,Morning,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85902,505ea5e7-c475-4fac-ac36-1d19a4cb610f,Inbound,Refund Related,Refund Enquiry,,1b5a2b9c-a95f-405f-a42e-5b1b693f3dc9,,30/08/2023 23:20,31/08/2023 07:22,31-Aug-23,,,,,Brandon Leon,Ethan Tan,William Kim,On Job Training,Morning,4
85903,44b38d3f-1523-4182-aba2-72917586647c,Inbound,Order Related,Seller Cancelled Order,Supported team customer executive good,d0e8a817-96d5-4ace-bb82-adec50398e22,,31/08/2023 08:15,31/08/2023 08:17,31-Aug-23,,,,,Linda Foster,Noah Patel,Emily Chen,>90,Morning,5
85904,723bce2c-496c-4aa8-a64b-ca17004528f0,Inbound,Order Related,Order status enquiry,need to improve with proper details.,bdefe788-ccec-4eda-8ca4-51045e68db8a,,31/08/2023 18:57,31/08/2023 19:02,31-Aug-23,,,,,Kimberly Martinez,Aiden Patel,Olivia Tan,On Job Training,Evening,5
85905,707528ee-6873-4192-bfa9-a491f1c08ab5,Inbound,Feedback,UnProfessional Behaviour,,a031ec28-0c5e-450e-95b2-592342c40bc4,,31/08/2023 19:59,31/08/2023 20:00,31-Aug-23,,,,,Daniel Martin,Olivia Suzuki,Olivia Tan,>90,Morning,4


In [10]:
#drop all null values from raw data
df=df.dropna()
df

Unnamed: 0,Unique id,channel_name,category,Sub-category,Customer Remarks,Order_id,order_date_time,Issue_reported at,issue_responded,Survey_response_Date,Customer_City,Product_category,Item_price,connected_handling_time,Agent_name,Supervisor,Manager,Tenure Bucket,Agent Shift,CSAT Score
13565,08c6a929-a403-4f14-810f-2275fe591230,Outcall,Returns,Return request,Good,0258d703-8287-428d-9e48-446e29eec3e5,29/07/2023 03:44,5/8/2023 11:11,5/8/2023 11:50,5-Aug-23,BETTIAH,Electronics,899.0,367.0,Katelyn Horton,Mason Gupta,Olivia Tan,0-30,Morning,5
13603,eae37bb3-91ec-4154-ba3c-7e4a6309a946,Outcall,Returns,Return request,Not good,d5d704c1-7d8e-4573-8e9d-aec29e6d3b40,31/07/2023 20:21,4/8/2023 12:04,5/8/2023 7:02,5-Aug-23,PATNA,Electronics,899.0,604.0,Michael Ruiz,Madison Kim,John Smith,>90,Morning,2
18445,296c5c93-6a4f-4fa5-b276-3feecfeaf636,Outcall,Returns,Return request,Call was helpful.,6a318938-1b06-4394-82d7-83a0d7c18f29,29/07/2023 22:52,9/8/2023 12:40,9/8/2023 12:43,9-Aug-23,AGRA,Electronics,868.0,233.0,Dillon Miller,Mason Gupta,Olivia Tan,0-30,Morning,4
19678,5155d369-7bc3-4c40-9e26-6dcee58ace99,Outcall,Returns,Return request,Good service,feb6c5f8-5418-4abc-aec4-494ce0df2791,1/8/2023 13:35,8/8/2023 12:35,8/8/2023 12:39,8-Aug-23,Birbhum,Electronics,899.0,458.0,Dillon Miller,Mason Gupta,Olivia Tan,0-30,Morning,5
20080,ddaae943-3535-452e-857f-591d4b9ae0c8,Outcall,Cancellation,Return cancellation,Thanks,68b1ae3c-a0f9-48c7-96f0-2e17a460814c,3/8/2023 13:42,7/8/2023 13:55,8/8/2023 9:21,8-Aug-23,BAREILLY,Electronics,799.0,362.0,Patricia Cross,Mason Gupta,Olivia Tan,0-30,Afternoon,5
20363,00c52c83-d40a-4b0f-80ea-fcffe23cef76,Outcall,Returns,Return request,Good job,8f14a636-094a-4810-8d61-c14b2478c9bd,30/07/2023 18:10,8/8/2023 11:13,8/8/2023 16:01,8-Aug-23,BASTI,Electronics,699.0,384.0,Daniel Ball,Ethan Nakamura,Olivia Tan,>90,Morning,5
20439,5b3a9dfe-8c26-4c5f-9824-02db5daf7ecb,Outcall,Returns,Return request,Super customer care,fe85b0a1-5abb-4f02-8750-61937e039bfd,4/8/2023 0:03,8/8/2023 11:03,8/8/2023 11:07,8-Aug-23,MADHUGIRI,Electronics,899.0,655.0,Evelyn Petersen,Olivia Wang,Olivia Tan,0-30,Morning,5
22514,8d093faf-8359-494d-9955-224cf8a5de8d,Outcall,Returns,Return request,Very genuine person ??????,9c7ac0e3-2c31-42d6-903e-ad11439ed523,3/8/2023 15:52,8/8/2023 15:29,8/8/2023 15:32,8-Aug-23,BALESHWAR,Electronics,899.0,460.0,Christopher Anderson,Mason Gupta,Olivia Tan,0-30,Afternoon,5
26301,150a36e8-48bc-47d2-bce6-436b65d2fab8,Outcall,Returns,Return request,Product is not best.delivery good.costmor care...,f89b47a1-c44e-48bd-98d8-037db0ee5d53,21/07/2023 20:12,11/8/2023 19:38,11/8/2023 19:53,11-Aug-23,GOHANA,Electronics,899.0,429.0,Patricia Cross,Mason Gupta,Olivia Tan,0-30,Afternoon,4
26387,1c0c98aa-2fdc-4899-af96-de7cbec1977c,Outcall,Returns,Return request,Shopzilla Is the Best,9e94dd00-4e52-4389-91c2-646027ad8680,3/8/2023 11:00,11/8/2023 12:52,11/8/2023 12:56,11-Aug-23,BERHAMPORE,Electronics,799.0,217.0,Michael Dunlap,Charlotte Suzuki,Jennifer Nguyen,On Job Training,Evening,5


In [11]:
#saved cleaned file
df=df.to_csv("c:/labmentix/5th project/clean_data1.csv", index=False)
df

In [12]:
#import cleaned file
df=pd.read_csv("c:/labmentix/5th project/clean_data1.csv")
df

Unnamed: 0,Unique id,channel_name,category,Sub-category,Customer Remarks,Order_id,order_date_time,Issue_reported at,issue_responded,Survey_response_Date,Customer_City,Product_category,Item_price,connected_handling_time,Agent_name,Supervisor,Manager,Tenure Bucket,Agent Shift,CSAT Score
0,08c6a929-a403-4f14-810f-2275fe591230,Outcall,Returns,Return request,Good,0258d703-8287-428d-9e48-446e29eec3e5,29/07/2023 03:44,5/8/2023 11:11,5/8/2023 11:50,5-Aug-23,BETTIAH,Electronics,899.0,367.0,Katelyn Horton,Mason Gupta,Olivia Tan,0-30,Morning,5
1,eae37bb3-91ec-4154-ba3c-7e4a6309a946,Outcall,Returns,Return request,Not good,d5d704c1-7d8e-4573-8e9d-aec29e6d3b40,31/07/2023 20:21,4/8/2023 12:04,5/8/2023 7:02,5-Aug-23,PATNA,Electronics,899.0,604.0,Michael Ruiz,Madison Kim,John Smith,>90,Morning,2
2,296c5c93-6a4f-4fa5-b276-3feecfeaf636,Outcall,Returns,Return request,Call was helpful.,6a318938-1b06-4394-82d7-83a0d7c18f29,29/07/2023 22:52,9/8/2023 12:40,9/8/2023 12:43,9-Aug-23,AGRA,Electronics,868.0,233.0,Dillon Miller,Mason Gupta,Olivia Tan,0-30,Morning,4
3,5155d369-7bc3-4c40-9e26-6dcee58ace99,Outcall,Returns,Return request,Good service,feb6c5f8-5418-4abc-aec4-494ce0df2791,1/8/2023 13:35,8/8/2023 12:35,8/8/2023 12:39,8-Aug-23,Birbhum,Electronics,899.0,458.0,Dillon Miller,Mason Gupta,Olivia Tan,0-30,Morning,5
4,ddaae943-3535-452e-857f-591d4b9ae0c8,Outcall,Cancellation,Return cancellation,Thanks,68b1ae3c-a0f9-48c7-96f0-2e17a460814c,3/8/2023 13:42,7/8/2023 13:55,8/8/2023 9:21,8-Aug-23,BAREILLY,Electronics,799.0,362.0,Patricia Cross,Mason Gupta,Olivia Tan,0-30,Afternoon,5
5,00c52c83-d40a-4b0f-80ea-fcffe23cef76,Outcall,Returns,Return request,Good job,8f14a636-094a-4810-8d61-c14b2478c9bd,30/07/2023 18:10,8/8/2023 11:13,8/8/2023 16:01,8-Aug-23,BASTI,Electronics,699.0,384.0,Daniel Ball,Ethan Nakamura,Olivia Tan,>90,Morning,5
6,5b3a9dfe-8c26-4c5f-9824-02db5daf7ecb,Outcall,Returns,Return request,Super customer care,fe85b0a1-5abb-4f02-8750-61937e039bfd,4/8/2023 0:03,8/8/2023 11:03,8/8/2023 11:07,8-Aug-23,MADHUGIRI,Electronics,899.0,655.0,Evelyn Petersen,Olivia Wang,Olivia Tan,0-30,Morning,5
7,8d093faf-8359-494d-9955-224cf8a5de8d,Outcall,Returns,Return request,Very genuine person ??????,9c7ac0e3-2c31-42d6-903e-ad11439ed523,3/8/2023 15:52,8/8/2023 15:29,8/8/2023 15:32,8-Aug-23,BALESHWAR,Electronics,899.0,460.0,Christopher Anderson,Mason Gupta,Olivia Tan,0-30,Afternoon,5
8,150a36e8-48bc-47d2-bce6-436b65d2fab8,Outcall,Returns,Return request,Product is not best.delivery good.costmor care...,f89b47a1-c44e-48bd-98d8-037db0ee5d53,21/07/2023 20:12,11/8/2023 19:38,11/8/2023 19:53,11-Aug-23,GOHANA,Electronics,899.0,429.0,Patricia Cross,Mason Gupta,Olivia Tan,0-30,Afternoon,4
9,1c0c98aa-2fdc-4899-af96-de7cbec1977c,Outcall,Returns,Return request,Shopzilla Is the Best,9e94dd00-4e52-4389-91c2-646027ad8680,3/8/2023 11:00,11/8/2023 12:52,11/8/2023 12:56,11-Aug-23,BERHAMPORE,Electronics,799.0,217.0,Michael Dunlap,Charlotte Suzuki,Jennifer Nguyen,On Job Training,Evening,5


In [None]:
# Path to your file (as uploaded)
CSV_PATH = "C:/labmentix/5th project/clean_data1.csv"

# ------------- Utilities: load and inspect ----------------
def load_inspect(csv_path):
    print("Loading:", csv_path)
    df = pd.read_csv(csv_path)
    print("Shape:", df.shape)
    display_cols = list(df.columns[:30])
    print("Columns (first 30):", display_cols)
    print("\nSample rows:")
    print(df.head(3).T)
    print("\nColumn dtypes and missingness:")
    print(pd.DataFrame({'dtype': df.dtypes, 'missing': df.isna().mean()}).sort_values('missing', ascending=False).head(50))
    return df
    
df = load_inspect(CSV_PATH)

# ------------- Auto-detect label and feature types -------------
# Heuristic: look for csat, satisfaction, rating, score
label_candidates = [c for c in df.columns if 'csat' in c.lower() or 'satisf' in c.lower() or 'rating' in c.lower() or 'score' in c.lower()]
if len(label_candidates) == 0:
    raise ValueError("No label column found automatically. Please rename your CSAT column to include 'csat' or 'satisfaction' or 'rating' or 'score'. Columns found: " + ", ".join(df.columns))
label_col = label_candidates[0]
print("Using label column:", label_col)

# Identify text columns heuristically
text_cols = [c for c in df.columns if df[c].dtype == object and df[c].nunique() > 20]  # non-trivial textual columns
# remove likely id columns
id_like = [c for c in df.columns if 'id' in c.lower()]
text_cols = [c for c in text_cols if c not in id_like and c != label_col]
print("Detected potential text columns:", text_cols)

# Numeric and categorical features
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remove label from numeric list
if label_col in num_cols:
    num_cols.remove(label_col)
cat_cols = [c for c in df.columns if c not in num_cols and c not in text_cols and c != label_col]

# remove id-like and timestamp-like
cat_cols = [c for c in cat_cols if 'id' not in c.lower() and 'time' not in c.lower() and 'date' not in c.lower()]

print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)

# ------------- Simple feature engineering -------------
# Example derived features depending on availability
if 'order_value' in df.columns and 'delivery_time' in df.columns:
    df['value_per_hour'] = df['order_value'] / (df['delivery_time'] + 1e-6)
    num_cols.append('value_per_hour')

# Fill missing basics now (we'll build pipelines later too)
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
for c in cat_cols:
    df[c] = df[c].fillna('MISSING')

# ------------- Split ----------------
X_all = df.drop(columns=[label_col])
y_all = df[label_col].astype(float)

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=42)
print("Train/Test split:", X_train.shape, X_test.shape)

# ------------- Baseline: XGBoost on tabular (and TF-IDF if text present) -------------
# Build preprocessing for tabular
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
# For categorical, do simple label encoding per column using mapping saved (since OneHot can explode)
def label_encode_df(df_train, df_valid, cat_columns):
    encoders = {}
    for col in cat_columns:
        le = LabelEncoder()
        le.fit(list(df_train[col].astype(str).values) + list(df_valid[col].astype(str).values))
        encoders[col] = le
        df_train[col] = le.transform(df_train[col].astype(str).values)
        df_valid[col] = le.transform(df_valid[col].astype(str).values)
    return df_train, df_valid, encoders

# For text, build TF-IDF features (top 500 features)
use_text = len(text_cols) > 0
tfidf = None
if use_text:
     # combine all text columns into a single column for simplicity
    TEXT_FIELD = "combined_review_text"
    X_train[TEXT_FIELD] = X_train[text_cols].fillna("").agg(" ".join, axis=1)
    X_test[TEXT_FIELD] = X_test[text_cols].fillna("").agg(" ".join, axis=1)
    tfidf = TfidfVectorizer(max_features=500, ngram_range=(1,2))
    tfidf.fit(pd.concat([X_train[TEXT_FIELD], X_test[TEXT_FIELD]]))
    X_train_tfidf = tfidf.transform(X_train[TEXT_FIELD]).toarray()
    X_test_tfidf = tfidf.transform(X_test[TEXT_FIELD]).toarray()
    print("TF-IDF shape:", X_train_tfidf.shape)

# Encode categorical cols
X_train_enc = X_train.copy()
X_test_enc = X_test.copy()
X_train_enc, X_test_enc, encoders = label_encode_df(X_train_enc, X_test_enc, cat_cols)

# Scale numeric columns
scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train_enc[num_cols])
X_test_num = scaler.transform(X_test_enc[num_cols])

# Build final tabular matrix
if use_text:
    X_train_tab = np.hstack([X_train_num, X_train_enc[cat_cols].values, X_train_tfidf])
    X_test_tab = np.hstack([X_test_num, X_test_enc[cat_cols].values, X_test_tfidf])
else:
    X_train_tab = np.hstack([X_train_num, X_train_enc[cat_cols].values])
    X_test_tab = np.hstack([X_test_num, X_test_enc[cat_cols].values])

print("Tabular matrix shapes:", X_train_tab.shape, X_test_tab.shape)

# Train XGBoost regressor
xgb_model = xgb.XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.8, random_state=42)
xgb_model.fit(X_train_tab, y_train, eval_set=[(X_test_tab, y_test)], verbose=50)

# Evaluate baseline
pred_xgb = xgb_model.predict(X_test_tab)
def to_numpy(y):
    """Convert pandas Series / list to numpy 1d float array."""
    if hasattr(y, "to_numpy"):
        arr = y.to_numpy()
    else:
        arr = np.array(y)
    # flatten and convert to float
    arr = np.asarray(arr).astype(float).ravel()
    return arr
    
def print_metrics(y_true, y_pred, prefix=""):
    """
    Print MAE, RMSE and R2 in a version-safe way.
    Handles pandas Series and numpy arrays.
    """
    y_true = to_numpy(y_true)
    y_pred = to_numpy(y_pred)

    # ensure same length
    if y_true.shape[0] != y_pred.shape[0]:
        raise ValueError(f"y_true and y_pred length mismatch: {y_true.shape[0]} vs {y_pred.shape[0]}")

    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)   # returns MSE
    rmse = np.sqrt(mse)                        # version-safe RMSE
    r2 = r2_score(y_true, y_pred)

    print(f"{prefix} MAE: {mae:.4f}  RMSE: {rmse:.4f}  R2: {r2:.4f}")

print_metrics(y_test, pred_xgb, prefix="XGBoost Baseline")

# Save baseline artifacts
joblib.dump(scaler, "deepcsat_scaler.joblib")
joblib.dump(encoders, "deepcsat_cat_encoders.joblib")
if use_text: joblib.dump(tfidf, "deepcsat_tfidf.joblib")
xgb_model.save_model("deepcsat_xgb.json")

# ------------- SHAP explainability for baseline -------------
print("Computing SHAP values (TreeExplainer) â€” this may take a few minutes...")

# Ensure input is numpy array (SHAP needs array-like)
X_sample = np.array(X_test_tab[:500])   # use a sample for speed

# Use TreeExplainer (explicit for XGBoost)
# after training xgb_model
booster = xgb_model.get_booster()
booster.save_model("deepcsat_xgb.bin")      # binary model, not JSON
xgb_model = xgb.XGBRegressor()
xgb_model.load_model("deepcsat_xgb.bin")    # reload it

# Plot SHAP summary
plt.title("SHAP Feature Importance (Top Features)")
shap.summary_plot(shap_values, features, feature_names=None, max_display=20)

# ------------- Multimodal deep model (tabular + text) -------------
# We'll build a Keras model that takes tabular numeric+cat features and TF-IDF text vector (dense) or a text embedding

def build_deep_model(tab_input_dim, text_input_dim=None):
    # Tabular branch
    tab_in = layers.Input(shape=(tab_input_dim,), name="tab_input")
    x = layers.Dense(128, activation='relu')(tab_in)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.2)(x)
    x = layers.Dense(64, activation='relu')(x)
    tab_out = layers.Dense(32, activation='relu')(x)

    if text_input_dim is not None:
        text_in = layers.Input(shape=(text_input_dim,), name="text_input")
        t = layers.Dense(128, activation='relu')(text_in)
        t = layers.Dropout(0.2)(t)
        t = layers.Dense(64, activation='relu')(t)
        text_out = layers.Dense(32, activation='relu')(t)
        merged = layers.concatenate([tab_out, text_out])
        input_list = [tab_in, text_in]
    else:
        merged = tab_out
        input_list = [tab_in]
    y = layers.Dense(64, activation='relu')(merged)
    y = layers.Dropout(0.2)(y)
    y = layers.Dense(32, activation='relu')(y)
    out = layers.Dense(1, activation='linear')(y)  # regression

    model = models.Model(inputs=input_list, outputs=out)
    model.compile(optimizer=optimizers.Adam(learning_rate=1e-3), loss='mse', metrics=['mae'])
    return model

# For deep model, we'll use same tabular numeric+cat features (no TF-IDF expansion to keep dims reasonable)
# We already combined num + categorical indices into arrays; let's form a compact tab input
tab_train = np.hstack([X_train_num, X_train_enc[cat_cols].values])
tab_test = np.hstack([X_test_num, X_test_enc[cat_cols].values])
if use_text:
    # Option A: Use TF-IDF dense (already computed) â€” can be high-dim
    text_train = X_train_tfidf
    text_test = X_test_tfidf
    deep_model = build_deep_model(tab_train.shape[1], text_train.shape[1])
    history = deep_model.fit([tab_train, text_train], y_train, validation_data=([tab_test, text_test], y_test),
                             epochs=50, batch_size=64, callbacks=[callbacks.EarlyStopping(patience=6, restore_best_weights=True)])
else:
    deep_model = build_deep_model(tab_train.shape[1], text_input_dim=None)
    history = deep_model.fit(tab_train, y_train, validation_data=(tab_test, y_test),
                             epochs=50, batch_size=64, callbacks=[callbacks.EarlyStopping(patience=6, restore_best_weights=True)])
# Evaluate deep model
if use_text:
    pred_deep = deep_model.predict([tab_test, text_test]).flatten()
else:
    pred_deep = deep_model.predict(tab_test).flatten()

print_metrics(y_test, pred_deep, prefix="Deep Model")

# Plot loss curves
plt.figure(figsize=(8,4))
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.legend(); plt.title('Deep Model Training Loss'); plt.xlabel('Epoch'); plt.show()

# Save deep model
deep_model.save("deepcsat_deep_model.h5")
joblib.dump({'num_cols': num_cols, 'cat_cols': cat_cols, 'text_cols': text_cols, 'label_col': label_col}, "deepcsat_metadata.joblib")

# ------------- Analysis: predictions vs actual and group analysis -------------
res = X_test.copy()
res['actual'] = y_test
res['xgb_pred'] = pred_xgb
res['deep_pred'] = pred_deep
res['xgb_err'] = np.abs(res['actual'] - res['xgb_pred'])
res['deep_err'] = np.abs(res['actual'] - res['deep_pred'])

print("Overall MAE (XGB):", mean_absolute_error(res['actual'], res['xgb_pred']))
print("Overall MAE (Deep):", mean_absolute_error(res['actual'], res['deep_pred']))

# Error by category if product_category exists
if 'product_category' in res.columns:
    cat_summary = res.groupby('product_category')[['actual', 'xgb_pred', 'deep_pred']].mean()
    print(cat_summary.sort_values('actual').head())
    
# Save results to CSV for review
res.to_csv("deepcsat_test_results.csv", index=False)

# ------------- Quick local Flask server for predictions -------------
# Save final artifacts for serving: scaler, encoders, tfidf, deep model file already saved.
print("Artifacts saved: deepcsat_deep_model.h5, deepcsat_xgb.json, deepcsat_scaler.joblib, deepcsat_cat_encoders.joblib")
print("You can now run a Flask app (example below) to serve predictions.")

In [None]:
# serve_deepcsat.py
from flask import Flask, request, jsonify
import joblib
import numpy as np
import tensorflow as tf
import xgboost as xgb
import json

app = Flask(__name__)

# Load artifacts
scaler = joblib.load("deepcsat_scaler.joblib")
encoders = joblib.load("deepcsat_cat_encoders.joblib")
meta = joblib.load("deepcsat_metadata.joblib")
use_text = False
try:
    tfidf = joblib.load("deepcsat_tfidf.joblib")
    use_text = True
except:
    tfidf = None
    
xgb_model = xgb.XGBRegressor()
xgb_model.load_model("deepcsat_xgb.json")
deep_model = tf.keras.models.load_model("deepcsat_deep_model.h5")

num_cols = meta['num_cols']
cat_cols = meta['cat_cols']
text_cols = meta['text_cols']

def preprocess_single(record):
    # record: dict of feature_name -> value
    # Build DataFrame-like row processing
    import pandas as pd
    row = {}
    for c in num_cols:
        row[c] = float(record.get(c, 0.0))
    for c in cat_cols:
        row[c] = str(record.get(c, 'MISSING'))
    df_row = pd.DataFrame([row])
    # Encode categorical
    for c in cat_cols:
        le = encoders[c]
        # if unseen categories: map to existing via try/except
        val = df_row.at[0,c]
        if val in le.classes_:
            df_row[c] = le.transform([val])[0]
        else:
            # add to classes temporarily (map to -1 or nearest)
            df_row[c] = 0
    # scale numeric
    X_num = scaler.transform(df_row[num_cols])
    X_tab = np.hstack([X_num, df_row[cat_cols].values])
    if use_text and text_cols:
        combined_text = " ".join([str(record.get(c, "")) for c in text_cols])
        text_vec = tfidf.transform([combined_text]).toarray()
        return X_tab, text_vec
    return X_tab, None

@app.route('/predict', methods=['POST'])
def predict():
    payload = request.json
    X_tab, text_vec = preprocess_single(payload)
    # XGBoost
    xgb_pred = xgb_model.predict(np.hstack([X_tab, text_vec]) if text_vec is not None else X_tab)[0]
    # Deep
    if text_vec is not None:
        deep_pred = deep_model.predict([X_tab, text_vec])[0][0]
    else:
        deep_pred = deep_model.predict(X_tab)[0][0]
    return jsonify({'xgb_pred': float(xgb_pred), 'deep_pred': float(deep_pred)})

if __name__ == '__main__':
    app.run(port=5000, debug=False)
