# Customer Satisfaction Prediction
 
*By Namitha S Paragi – Data Science Project*


Objective:
Analyze customer support ticket data and build a machine learning model to predict Customer Satisfaction Rating (1–5) based on customer demographics, product type, ticket details, and service response times.

Goal:  
To help the company identify patterns that lead to low satisfaction and improve customer service quality.

Tech Stack:  
- Python (Pandas, NumPy, Matplotlib, Seaborn)  
- Scikit-learn (RandomForest, preprocessing)  
- Jupyter Notebook


In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [5]:
# Load dataset
data = pd.read_csv("customer_support_tickets.csv")

# Print number of rows and columns
print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])

data.head()



Number of rows: 8469
Number of columns: 17


Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
0,1,Marisa Obrien,carrollallison@example.com,32,Other,GoPro Hero,2021-03-22,Technical issue,Product setup,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Social media,2023-06-01 12:15:36,,
1,2,Jessica Rios,clarkeashley@example.com,42,Female,LG Smart TV,2021-05-22,Technical issue,Peripheral compatibility,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Chat,2023-06-01 16:45:38,,
2,3,Christopher Robbins,gonzalestracy@example.com,48,Other,Dell XPS,2020-07-14,Technical issue,Network problem,I'm facing a problem with my {product_purchase...,Closed,Case maybe show recently my computer follow.,Low,Social media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0
3,4,Christina Dillon,bradleyolson@example.org,27,Female,Microsoft Office,2020-11-13,Billing inquiry,Account access,I'm having an issue with the {product_purchase...,Closed,Try capital clearly never color toward story.,Low,Social media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0
4,5,Alexander Carroll,bradleymark@example.com,67,Female,Autodesk AutoCAD,2020-02-04,Billing inquiry,Data loss,I'm having an issue with the {product_purchase...,Closed,West decision evidence bit.,Low,Email,2023-06-01 00:12:42,2023-06-01 19:53:42,1.0


# Initial Exploration
Below we explore the structure of the dataset, its columns, and missing values.


In [None]:
# Check info and missing values
data.info()
data.isna().sum()


In [None]:
# Data Cleaning and Time Feature Engineering

# Convert time columns to datetime
df['Date of Purchase'] = pd.to_datetime(df['Date of Purchase'], errors='coerce')
df['First Response Time'] = pd.to_datetime(df['First Response Time'], errors='coerce')
df['Time to Resolution'] = pd.to_datetime(df['Time to Resolution'], errors='coerce')

# Calculate response and resolution durations
df['Response_Duration_Minutes'] = (df['First Response Time'] - df['Date of Purchase']).dt.total_seconds() / 60
df['Resolution_Duration_Hours'] = (df['Time to Resolution'] - df['First Response Time']).dt.total_seconds() / 3600

# Clip negatives and fill missing with median
df['Response_Duration_Minutes'] = df['Response_Duration_Minutes'].clip(lower=0)
df['Resolution_Duration_Hours'] = df['Resolution_Duration_Hours'].clip(lower=0)
df['Response_Duration_Minutes'].fillna(df['Response_Duration_Minutes'].median(), inplace=True)
df['Resolution_Duration_Hours'].fillna(df['Resolution_Duration_Hours'].median(), inplace=True)

# Add ticket day/hour features
df['Ticket_Day_of_Week'] = df['Date of Purchase'].dt.day_name()
df['Ticket_Hour_of_Day'] = df['Date of Purchase'].dt.hour

print("✅ Time features created successfully!")
df[['Response_Duration_Minutes','Resolution_Duration_Hours','Ticket_Day_of_Week','Ticket_Hour_of_Day']].head()


In [None]:
# Convert date columns
date_cols = ['Date of Purchase', 'First Response Time', 'Time to Resolution']
for col in date_cols:
    data[col] = pd.to_datetime(data[col], errors='coerce')


In [None]:
# Remove rows where target (Customer Satisfaction Rating) is missing
data = data.dropna(subset=['Customer Satisfaction Rating'])


In [None]:
# Fill missing values
for col in data.select_dtypes(include=['number']).columns:
    data[col].fillna(data[col].median(), inplace=True)

for col in data.select_dtypes(include=['object']).columns:
    data[col].fillna("missing", inplace=True)


In [None]:
# Label encode categorical variables
le = LabelEncoder()
for col in data.select_dtypes(include=['object']).columns:
    data[col] = le.fit_transform(data[col].astype(str))


In [None]:
# Drop unnecessary columns
X = data.drop(['Ticket ID', 'Customer Name', 'Customer Email', 'Customer Satisfaction Rating'], axis=1)
y = data['Customer Satisfaction Rating']

# Create Binary Target Column (Satisfied vs Unsatisfied)

# 1 if rating >=4 (Satisfied), else 0 (Unsatisfied)
df = df[df['Customer Satisfaction Rating'].notna()]  
df['is_satisfied'] = df['Customer Satisfaction Rating'].apply(lambda x: 1 if x >= 4 else 0)

print(df['is_satisfied'].value_counts())
print(" Binary target created successfully!")



In [None]:
# Convert datetime columns to numeric
for col in X.select_dtypes(include=['datetime64']).columns:
    X[col] = X[col].view('int64') // 10**9


# Exploratory Data Analysis (EDA)
Let's visualize key patterns and relationships in the dataset.


In [None]:
# Distribution of Satisfaction Ratings
sns.countplot(x='Customer Satisfaction Rating', data=data, palette='viridis')
plt.title('Distribution of Customer Satisfaction Ratings')
plt.show()


In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(data.corr(), cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


In [None]:
# Age vs Satisfaction
sns.boxplot(x='Customer Satisfaction Rating', y='Customer Age', data=data, palette='magma')
plt.title('Customer Age vs Satisfaction')
plt.show()


In [None]:
# Ticket Priority vs Satisfaction
sns.barplot(x='Ticket Priority', y='Customer Satisfaction Rating', data=data)
plt.title('Ticket Priority vs Average Satisfaction Rating')
plt.show()


In [None]:
# Sentiment Score from Ticket Description

!pip install vaderSentiment --quiet
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

# Apply sentiment analysis to the description
df['description_sentiment'] = df['Ticket Description'].fillna('').apply(
    lambda x: analyzer.polarity_scores(x)['compound']
)

print(" Sentiment scores added successfully!")
df[['Ticket Description','description_sentiment']].head()


In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)


In [None]:
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
# Train Random Forest model
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)



In [None]:
# Evaluate model
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


# Key Insights from Model
- The Random Forest model gives moderate accuracy due to high class overlap.
- Time to Resolution and First Response Time seem most influential.
- The model can help identify which tickets may lead to low satisfaction.


In [None]:
# Feature importance plot
importances = pd.Series(model.feature_importances_, index=X.columns)
importances.nlargest(10).plot(kind='barh', color='skyblue')
plt.title('Top 10 Important Features')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()


In [None]:
# Model Pipeline & Evaluation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Select features
num_cols = ['Customer Age','Response_Duration_Minutes','Resolution_Duration_Hours','description_sentiment']
cat_cols = ['Ticket Type','Ticket Priority','Ticket Channel','Ticket_Day_of_Week']
text_col = 'Ticket Description'

# Split data
X = df[num_cols + cat_cols + [text_col]]
y = df['is_satisfied']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.2)

# Pipelines
num_pipe = Pipeline([('scaler', StandardScaler())])
cat_pipe = Pipeline([('ohe', OneHotEncoder(handle_unknown='ignore'))])
text_pipe = Pipeline([('tfidf', TfidfVectorizer(max_features=3000, stop_words='english'))])

preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols),
    ('text', text_pipe, text_col)
])

model = Pipeline([
    ('preprocess', preprocessor),
    ('clf', RandomForestClassifier(class_weight='balanced', random_state=42))
])

# Train
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))

print("✅ Model trained and evaluated successfully!")


# Conclusion

- Successfully built a Machine Learning model to predict Customer Satisfaction Ratings.
- Identified top influencing factors using feature importance.
- This model can help the support team focus on areas that reduce dissatisfaction.
- The Random Forest model achieved an accuracy of 0.18050541516245489.
- Key features influencing satisfaction: Time to Resolution, First Response Time, and Ticket Priority.
- Added sentiment, response time, and binary satisfaction classification for improved prediction accuracy.

- Future improvements:
  - Use NLP to analyze ticket descriptions.
  - Apply regression or deep learning methods for better accuracy.
