<a href="https://colab.research.google.com/github/Akash-singh45/AI_Support_Ticket_classifier/blob/main/AI_Support_Ticket_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1 : Load and explore the Dataset


In [30]:
# Install Necessaries.
!pip install openpyxl nltk textblob

# Import libraries
import pandas as pd
import numpy as np
import re
import nltk
from textblob import TextBlob

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [31]:
# Load the Excel file
file_path = '/content/ai_dev_assignment_tickets_complex_1000.xls'
df = pd.read_excel(file_path)

# Display first few rows
df.head()


Unnamed: 0,ticket_id,ticket_text,issue_type,urgency_level,product
0,1,Payment issue for my SmartWatch V2. I was unde...,Billing Problem,Medium,SmartWatch V2
1,2,Can you tell me more about the UltraClean Vacu...,General Inquiry,,UltraClean Vacuum
2,3,I ordered SoundWave 300 but got EcoBreeze AC i...,Wrong Item,Medium,SoundWave 300
3,4,Facing installation issue with PhotoSnap Cam. ...,Installation Issue,Low,PhotoSnap Cam
4,5,Order #30903 for Vision LED TV is 13 days late...,Late Delivery,,Vision LED TV


In [32]:
# Data Exploration
# Dataset shape and column info
print("Shape:", df.shape)
df.info()

# Check for nulls
print("\nMissing values:")
print(df.isnull().sum())

# View label distributions
print("\nIssue Type distribution:")
print(df['issue_type'].value_counts())

print("\nUrgency Level distribution:")
print(df['urgency_level'].value_counts())


Shape: (1000, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ticket_id      1000 non-null   int64 
 1   ticket_text    945 non-null    object
 2   issue_type     924 non-null    object
 3   urgency_level  948 non-null    object
 4   product        1000 non-null   object
dtypes: int64(1), object(4)
memory usage: 39.2+ KB

Missing values:
ticket_id         0
ticket_text      55
issue_type       76
urgency_level    52
product           0
dtype: int64

Issue Type distribution:
issue_type
Billing Problem       146
General Inquiry       146
Account Access        143
Installation Issue    142
Product Defect        121
Wrong Item            114
Late Delivery         112
Name: count, dtype: int64

Urgency Level distribution:
urgency_level
High      330
Medium    319
Low       299
Name: count, dtype: int64


In [33]:
# View some ticket text examples
for i in range(3):
    print(f"\nTicket {i+1}:")
    print(df.loc[i, 'ticket_text'])



Ticket 1:
Payment issue for my SmartWatch V2. I was underbilled for order #29224.

Ticket 2:
Can you tell me more about the UltraClean Vacuum warranty? Also, is it available in white?

Ticket 3:
I ordered SoundWave 300 but got EcoBreeze AC instead. My order number is #36824.


# Step 2: Data Cleaning & Preprocessing
We’ll do the following:

Lowercase the text

Remove special characters and numbers

Tokenize

Remove stopwords

Lemmatize


In [34]:
# Import NLP tools
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


In [35]:
# Define Preprocessing function
def preprocess_text(text):
    if pd.isnull(text):
        return ""

    # Lowercase
    text = text.lower()

    # Remove special characters and digits
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]

    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(tokens)


In [36]:
# Apply the preprocessing function
df['clean_text'] = df['ticket_text'].apply(preprocess_text)

# Show before and after
df[['ticket_text', 'clean_text']].head()


Unnamed: 0,ticket_text,clean_text
0,Payment issue for my SmartWatch V2. I was unde...,payment issue smartwatch underbilled order
1,Can you tell me more about the UltraClean Vacu...,tell ultraclean vacuum warranty also available...
2,I ordered SoundWave 300 but got EcoBreeze AC i...,ordered soundwave got ecobreeze instead order ...
3,Facing installation issue with PhotoSnap Cam. ...,facing installation issue photosnap cam setup ...
4,Order #30903 for Vision LED TV is 13 days late...,order vision led day late ordered march also c...


In [37]:
# Handle Missing Data

# Drop rows where label columns are missing
df.dropna(subset=['issue_type', 'urgency_level'], inplace=True)

# Fill missing product values with a placeholder
df['product'] = df['product'].fillna('unknown')


In [38]:
print(df['urgency_level'].value_counts(normalize=True))
# checking distribution of the urgency_level to fill the missing values.
# if medium > 50 percent than i would have filled the missing values with it.
# To preserve the data quantity while minizing the potential label distortion.

urgency_level
High      0.345890
Medium    0.339041
Low       0.315068
Name: proportion, dtype: float64


In [39]:
# Save the Cleaned Dataset.
df.to_csv('cleaned_support_tickets.csv', index=False)


# STEP 3: Feature Engineering

We'll extract both:

Text-based features (like TF-IDF vectors)

Custom features (like ticket length and sentiment)

In [40]:
# 3.1: TF-IDF Vectorization (Text Features)
# We will convert the cleaned ticket text (clean_text) into numeric vectors using TF-IDF.


from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1,2))

# Fit and transform the clean_text
X_tfidf = tfidf_vectorizer.fit_transform(df['clean_text'])

# Convert to DataFrame
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


In [41]:
# 3.2: Ticket Length

# Add ticket length feature
df['ticket_length'] = df['ticket_text'].apply(lambda x: len(str(x).split()))


In [42]:
#3.3: Sentiment Score
# We will use TextBlob to get sentiment polarity (range: -1 to 1)

def get_sentiment(text):
    try:
        return TextBlob(text).sentiment.polarity
    except:
        return 0

df['sentiment'] = df['ticket_text'].apply(get_sentiment)


In [43]:
# 3.4: Combine All Features

# We now combine:
# TF-IDF features
# Ticket length and sentiment

from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler

# Scale ticket_length and sentiment
scaler = StandardScaler()
numerical_features = scaler.fit_transform(df[['ticket_length', 'sentiment']])

# Combine TF-IDF + numerical features
from scipy.sparse import csr_matrix
X_final = hstack([X_tfidf, csr_matrix(numerical_features)])


In [44]:
# 3.5: Prepare Labels

# Encode labels
from sklearn.preprocessing import LabelEncoder

issue_encoder = LabelEncoder()
urgency_encoder = LabelEncoder()

y_issue = issue_encoder.fit_transform(df['issue_type'])
y_urgency = urgency_encoder.fit_transform(df['urgency_level'])


X_final = Features for model training

y_issue = Labels for issue type

y_urgency = Labels for urgency level

In [45]:
# Print first 10 values
print("y_issue (first 10):", y_issue[:10])
print("y_urgency (first 10):", y_urgency[:10])

y_issue (first 10): [1 6 3 2 5 3 5 4 5 4]
y_urgency (first 10): [2 2 1 2 1 0 0 2 0 0]


# STEP 4: Model Training for Multi-Task Classification

We’ll train two separate classifiers:

Issue Type Classifier (y_issue)

Urgency Level Classifier (y_urgency)

We'll use:

Random Forest for strong baseline

Cross validation 5 folds.

In [46]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize the classifiers
issue_clf = RandomForestClassifier(n_estimators=100, random_state=42)
urgency_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform 5-Fold CV for Issue Type
cv_scores_issue = cross_val_score(issue_clf, X_final, y_issue, cv=5, scoring='accuracy')
issue_clf.fit(X_final, y_issue)
print("🔹 Issue Type - CV Scores:", cv_scores_issue)
print("🔹 Issue Type - Mean Accuracy:", cv_scores_issue.mean())

# Perform 5-Fold CV for Urgency Level
cv_scores_urgency = cross_val_score(urgency_clf, X_final, y_urgency, cv=5, scoring='accuracy')
urgency_clf.fit(X_final, y_urgency)
print("\n🔹 Urgency Level - CV Scores:", cv_scores_urgency)
print("🔹 Urgency Level - Mean Accuracy:", cv_scores_urgency.mean())








🔹 Issue Type - CV Scores: [0.96590909 0.95428571 0.94857143 0.94857143 0.92571429]
🔹 Issue Type - Mean Accuracy: 0.9486103896103895

🔹 Urgency Level - CV Scores: [0.36931818 0.29714286 0.37714286 0.33142857 0.29142857]
🔹 Urgency Level - Mean Accuracy: 0.3332922077922078


# STEP 5: Entity Extraction from Ticket Text
You need to extract these 3 things from ticket_text:

Product Names

Dates

Complaint Keywords like broken, error, late, etc.



In [47]:
# 5.1: Define Lists and Regex Patterns

complaint_keywords = [
    'broken', 'late', 'error', 'issue', 'not working', 'damaged',
    'faulty', 'defective', 'missing', 'not delivered', 'not received',
    'never arrived', 'cancel', 'didn’t come', 'no delivery',
    'shattered', 'not turning on', 'wrong item', 'charged twice',
    'refund', 'missing parts', 'didn’t include', 'delayed', 'replacement'
]



# Create product list from your dataset (unique entries except 'unknown')
product_list = df['product'].unique().tolist()
product_list = [p.lower() for p in product_list if p.lower() != 'unknown']

date_pattern = r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}-\d{2}-\d{2})\b'



In [48]:
# 5.2: Define Extraction Function
def extract_entities(text):
    if not isinstance(text, str):  # Defensive check
        return {
            'products': [],
            'dates': [],
            'complaints': []
        }

    text = text.lower()

    # Extract complaint keywords
    keywords_found = []
    for word in complaint_keywords:
        if word in text:
            keywords_found.append(word)

    # Extract dates using regex
    import re
    dates_found = re.findall(date_pattern, text)

    # Extract product names
    products_found = []
    for prod in product_list:
        if prod in text:
            products_found.append(prod)

    return {
        'products': products_found,
        'dates': dates_found,
        'complaints': keywords_found
    }




In [49]:
# 5.3: Test the Entity Extractor

# Example usage on one ticket
sample_text = df['ticket_text'].iloc[0]
entities = extract_entities(sample_text)

print("Ticket Text:\n", sample_text)
print("\nExtracted Entities:\n", entities)

Ticket Text:
 Payment issue for my SmartWatch V2. I was underbilled for order #29224.

Extracted Entities:
 {'products': ['smartwatch v2'], 'dates': [], 'complaints': ['issue']}


In [50]:
# 5.4: Apply to Entire Dataset (Optional Preview)

df['extracted_entities'] = df['ticket_text'].apply(extract_entities)
df[['ticket_text', 'extracted_entities']].head()

Unnamed: 0,ticket_text,extracted_entities
0,Payment issue for my SmartWatch V2. I was unde...,"{'products': ['smartwatch v2'], 'dates': [], '..."
2,I ordered SoundWave 300 but got EcoBreeze AC i...,"{'products': ['soundwave 300', 'ecobreeze ac']..."
3,Facing installation issue with PhotoSnap Cam. ...,"{'products': ['photosnap cam'], 'dates': [], '..."
5,Can you tell me more about the PhotoSnap Cam w...,"{'products': ['photosnap cam'], 'dates': [], '..."
6,is malfunction. It stopped working after just...,"{'products': [], 'dates': [], 'complaints': []}"


# STEP 6: Integration Function
We'll build a single function that takes raw ticket_text and returns:

Predicted issue type

Predicted urgency level

Extracted entities (product names, dates, complaint keywords)



In [51]:
# 6.1: Full predict_ticket() Function

def predict_ticket(text):
    # Step 1: Preprocess input text
    clean = preprocess_text(text)
    length = len(text.split())
    sentiment = get_sentiment(text)

    # Step 2: Transform with TF-IDF
    tfidf_vector = tfidf_vectorizer.transform([clean])

    # Step 3: Scale numeric features
    scaled_numeric = scaler.transform([[length, sentiment]])

    # Step 4: Combine all features
    from scipy.sparse import hstack
    final_input = hstack([tfidf_vector, scaled_numeric])

    # Step 5: Predict issue type and urgency
    predicted_issue = issue_encoder.inverse_transform(issue_clf.predict(final_input))[0]
    predicted_urgency = urgency_encoder.inverse_transform(urgency_clf.predict(final_input))[0]

    # Step 6: Extract entities
    entities = extract_entities(text)

    # Step 7: Return result
    return {
        'predicted_issue_type': predicted_issue,
        'predicted_urgency_level': predicted_urgency,
        'extracted_entities': entities
    }


In [52]:
# 6.2: Test the Function

sample_text = df['ticket_text'].iloc[0]
result = predict_ticket(sample_text)

print("Raw Ticket:\n", sample_text)
print("\n Prediction & Extraction:")
print(result)

Raw Ticket:
 Payment issue for my SmartWatch V2. I was underbilled for order #29224.

 Prediction & Extraction:
{'predicted_issue_type': 'Billing Problem', 'predicted_urgency_level': 'Medium', 'extracted_entities': {'products': ['smartwatch v2'], 'dates': [], 'complaints': ['issue']}}




# Step 7: Gradio Web App setup

In [None]:

import gradio as gr

# Wrap your existing function
def gradio_interface(ticket_text):
    result = predict_ticket(ticket_text)

    return (
        result['predicted_issue_type'],
        result['predicted_urgency_level'],
        result['extracted_entities']
    )


# Create Gradio app
iface = gr.Interface(
    fn=gradio_interface,
    inputs=gr.Textbox(lines=5, placeholder="Paste ticket text here..."),
    outputs=[
        gr.Text(label="Predicted Issue Type"),
        gr.Text(label="Predicted Urgency Level"),
        gr.JSON(label="Extracted Entities")
    ],
    title="📩 Ticket Classifier & Entity Extractor",
    description="Paste a support ticket. Get issue type, urgency prediction, and extracted entities.",
)

iface.launch(debug=True)



# Launch app
iface.launch(debug =True)


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://503660beb2fd5886d1.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


