‚óè	Module 1: Data Collection and Preprocessing

‚óã	Collect a diverse dataset of past user issues and their corresponding structured ticket data.

‚óã	Clean and preprocess the text data to handle inconsistencies, spelling errors, and irrelevant information.

‚óã	Annotate the dataset with labels for ticket categories, priorities, and named entities.

	Milestone 1 : Data Preparation & Annotation
‚óã	Objective: Collect and prepare a clean, annotated dataset.

‚óã	Tasks: Collect a diverse set of historical ticket data;
clean and normalize text; 
manually annotate a portion of the data for training.


In [1]:
%pip install torch --quiet

import pandas as pd
import re
import torch 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

Note: you may need to restart the kernel to use updated packages.


In [2]:
# --- STEP 1: LOAD DATA ---
# We use the raw IT tickets CSV containing ~47k records
df = pd.read_csv(r'D:\AI-Powered Ticket Creation & Categorization\Kaggle Dataset\all_tickets_processed_improved_v3.csv')
display(df.head())

Unnamed: 0,Document,Topic_group
0,connection with icon icon dear please setup ic...,Hardware
1,work experience user work experience user hi w...,Access
2,requesting for meeting requesting meeting hi p...,Hardware
3,reset passwords for external accounts re expir...,Access
4,mail verification warning hi has got attached ...,Miscellaneous


In [3]:
df.shape

(47837, 2)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47837 entries, 0 to 47836
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Document     47837 non-null  object
 1   Topic_group  47837 non-null  object
dtypes: object(2)
memory usage: 747.6+ KB


In [5]:
df.describe(include='all')

Unnamed: 0,Document,Topic_group
count,47837,47837
unique,47837,8
top,connection with icon icon dear please setup ic...,Hardware
freq,1,13617


In [6]:
# --- STEP 2: LIGHT CLEANING ---
# BERT needs sentence structure, so we only remove "noise" like URLs/Emails
def bert_cleaning(text):
    if not isinstance(text, str): return ""
    text = text.lower()
    text = re.sub(r'<[^>]+>', '', text)     
    text = re.sub(r'\S+@\S+', '', text)      
    text = re.sub(r'http\S+', '', text)      
    text = re.sub(r'[^a-z0-9!?. ]', '', text) 
    return re.sub(r'\s+', ' ', text).strip()

df['clean_text'] = df['Document'].apply(bert_cleaning)

In [7]:
# --- STEP 3: LABEL ENCODING ---
# Convert 'Hardware', 'Access', etc., into numbers 0-7
df['label'] = df['Topic_group'].astype('category').cat.codes
num_labels = df['label'].nunique()

In [8]:
# --- STEP 4: CLASS WEIGHTS (THE ACCURACY BOOSTER) ---
y_labels = df['label'].values
weights = compute_class_weight(class_weight='balanced', 
                               classes=np.unique(y_labels), 
                               y=y_labels)

# üöÄ THE FIX: Use CPU if GPU is not available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class_weights = torch.tensor(weights, dtype=torch.float).to(device)

print(f"Milestone 1 running on: {device}")

Milestone 1 running on: cpu


In [9]:
# --- STEP 5: SPLIT DATA ---
train_df, test_df = train_test_split(df, test_size=0.15, random_state=42)

In [10]:
display(df.head())

Unnamed: 0,Document,Topic_group,clean_text,label
0,connection with icon icon dear please setup ic...,Hardware,connection with icon icon dear please setup ic...,3
1,work experience user work experience user hi w...,Access,work experience user work experience user hi w...,0
2,requesting for meeting requesting meeting hi p...,Hardware,requesting for meeting requesting meeting hi p...,3
3,reset passwords for external accounts re expir...,Access,reset passwords for external accounts re expir...,0
4,mail verification warning hi has got attached ...,Miscellaneous,mail verification warning hi has got attached ...,5


In [11]:
print(''' 
Traditional steps like stemming and stop-word 
removal were intentionally skipped because BERT is a contextual model.
Keeping the full sentence structure allows the model to 
leverage its pre-trained linguistic knowledge more 
effectively than if we had reduced the text to just keywords.
      ''')

 
Traditional steps like stemming and stop-word 
removal were intentionally skipped because BERT is a contextual model.
Keeping the full sentence structure allows the model to 
leverage its pre-trained linguistic knowledge more 
effectively than if we had reduced the text to just keywords.
      


In [12]:
print('''
Removing Stop Words

Reason: BERT uses Attention Mechanisms to process the
relationship between all words. Removing stop words destroys 
grammatical context, hindering BERT's ability to 
grasp the user's true intent.

Stemming or Lemmatization

Reason: BERT employs WordPiece Tokenization, which 
natively breaks words into sub-units (e.g., "flicker" and "##ing"). 
Manual reduction is unnecessary and loses the linguistic 
nuance of the original input.

Heavy Punctuation Removal

Reason: Basic marks like ! and ? were preserved as they are
critical for determining sentiment and urgency 
(e.g., identifying a "critical" vs. "general" query).

SMOTE (Synthetic Over-sampling)

Reason: Rather than generating "synthetic" data, Class Weights 
were applied to the loss function. This is more 
effective 
for high-dimensional text and prevents overfitting on artificial examples.      
      ''')


Removing Stop Words

Reason: BERT uses Attention Mechanisms to process the
relationship between all words. Removing stop words destroys 
grammatical context, hindering BERT's ability to 
grasp the user's true intent.

Stemming or Lemmatization

Reason: BERT employs WordPiece Tokenization, which 
natively breaks words into sub-units (e.g., "flicker" and "##ing"). 
Manual reduction is unnecessary and loses the linguistic 
nuance of the original input.

Heavy Punctuation Removal

Reason: Basic marks like ! and ? were preserved as they are
critical for determining sentiment and urgency 
(e.g., identifying a "critical" vs. "general" query).

SMOTE (Synthetic Over-sampling)

Reason: Rather than generating "synthetic" data, Class Weights 
were applied to the loss function. This is more 
effective 
for high-dimensional text and prevents overfitting on artificial examples.      
      
