# Incident ticket classification
## Multinomial classification with supervised machine learning

### The problem
When a user submits an incident ticket, it enters the support system via email, phone or an embedded portal. Each ticket contains a bit of text about the problem or request, and based on this information it **should be** routed to the correct assignment team where the ticket gets resolved.
However, almost 30 - 40% of incident tickets are not routed to the right team with resulting increase in delays, costs and dissartisfied users.

### The solution

We will build an automated ticket classification system that will take a ticket as input and predicts its category, and thus, what team it should be routed to. For this, we will use a small dataset of 3000 tickets, each labeled with category. Thus, we will extract the text and category from each ticket and train a model to predict the category from the text.

In [None]:
import os
import string
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
from matplotlib import pyplot as plt

# NLTK for pre-processing
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# Spacy for pre-processing
import spacy
from spacy.lang.en import English

# Import classifiers
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
 # feature extractors
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
from sklearn.base import TransformerMixin
# Performance evaluation and helper functions
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split

In [None]:
import warnings
warnings.filterwarnings('ignore') 

Uncomment if you need to upload the data to Google Colab.

In [None]:
# from google.colab import files

# uploaded = files.upload()

# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

## Read and explore the data

In [None]:
tickets = pd.read_csv('ticket_data.csv')

In [None]:
tickets.head()

In [None]:
tickets['Category'].unique()

In [None]:
# Eliminate categories with fewer than 100 tickets
column = "Category"
min_tickets = 100
ticket_categories = tickets.loc[(tickets.groupby(column).transform(len) > min_tickets).index]
# Print number of relevant categories & shape
print("Categories:", tickets[column].nunique())
                                              
# Plot the classifiers
fig = plt.figure(figsize=(10,6))
sns.barplot(ticket_categories[column].value_counts().index, ticket_categories[column].value_counts())
plt.xticks(rotation=20)
plt.show()

## Pre-processing

The ticket descriptions are natural language text. We cannot fit a machine learning model on raw text. Instead we have to convert each ticket description to a vector of numbers and then fit the model. Converting the text into a vector is called *vectorization*. However, before vectorization, it is often important to clean up the text by, for example, removing common but uninformative words (aka stopwords) and converting inflectional word forms into their base forms (aka lemmatization). While this type of text preparation is not critical, it tends to improve the performance of the final model.

In [None]:
nltk.download('stopwords')
parser = English()
nlp = spacy.load('en_core_web_sm')
punctuations = string.punctuation

STOPLIST = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS))
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "…", """, """]

# Function to cleanup the text in increments
def cleanup_text(docs, logging=False):
    texts = []
    counter = 1
    for doc in docs:
        if counter % 1000 == 0 and logging:    
            print("Processed %d out of %d documents." % (counter, len(docs)))

        counter += 1
    
        doc = nlp(doc, disable=['parser', 'ner'])
        tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
        tokens = [tok for tok in tokens if tok not in STOPLIST and tok not in SYMBOLS]
        tokens = ' '.join(tokens)
        texts.append(tokens)
      
    return pd.Series(texts)

# Function to find common words for each defined category
def find_common_words_by_category(data, categories, target_column, N):
  
    for category in categories:

        category_text = [text for text in data[data[target_column] == category]['text']]
        cleanup_category_text = cleanup_text(category_text)
        cleanup_category_text = ' '.join(cleanup_category_text).split()
        category_counter = Counter(cleanup_category_text)

        common_words_by_category = [word[0] for word in category_counter.most_common(N)]
        word_count_by_category = [word[1] for word in category_counter.most_common(N)]
    
        word_statement = f"{category} has {word_count_by_category} words : {common_words_by_category}"
        print(word_statement) 

In [None]:
cleaned = cleanup_text(tickets['Description'], logging=True)

In [None]:
data = pd.DataFrame(index=cleaned.index, columns=['text', 'target'])
data['text'] = cleaned
data['target'] = tickets['Category']

In [None]:
data.head()

In [None]:
tickets.head()

Store the cleaned data.

In [None]:
data.to_csv('cleaned_data.csv')

## Fit the model

We will fit and compare several different learners (aka leaning algorithms). Bases on their relative performance we will select the best model. This is very easy using sklearn -- a great Python package for machine learning.

### Split the data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['text'],
                                                    data['target'],
                                                    test_size=.3,
                                                    shuffle=True,
                                                    stratify=data['target'],
                                                    random_state=3)

### Model fitting

In [None]:
# Instatiate the models that we will compare
model_dict = {'Dummy': DummyClassifier(random_state=3),
              'Stochastic Gradient Descent': SGDClassifier(random_state=3, loss='log'),
              'Random Forest': RandomForestClassifier(random_state=3),
              'Decsision Tree': DecisionTreeClassifier(random_state=3),
              'AdaBoost': AdaBoostClassifier(random_state=3),
              'Gaussian Naive Bayes': GaussianNB(),
              'Multinomial Naive Bayes': MultinomialNB(),
              'K Nearest Neighbor': KNeighborsClassifier(),
              'Logistic Regression': LogisticRegression()}

# Vectorizer for converting from text to vectors of numbers.
vectorizer = TfidfVectorizer(sublinear_tf=True, ngram_range=(1, 2), stop_words='english')

# Helper to fix and isse with sparse arrays.
class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

# Function that fits and evaluates the models in model dict
def fit_and_evaluate(model_dict):
    model_name, ac_score_list, p_score_list, r_score_list, f1_score_list = [], [], [], [], []

    for name, clf in model_dict.items():
        print('Fitting', name)
        model_name.append(name)
        pipe = Pipeline([('vectorizer', vectorizer), 
                         ('to_dense', DenseTransformer()),
                         ('clf', clf)])
        pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_test)
        ac_score_list.append(metrics.accuracy_score(y_test, y_pred))
        p_score_list.append(metrics.precision_score(y_test, y_pred, average='macro'))
        r_score_list.append(metrics.recall_score(y_test, y_pred, average='macro'))
        f1_score_list.append(metrics.f1_score(y_test, y_pred, average='macro'))
        model_comparison_df = pd.DataFrame([model_name, ac_score_list, p_score_list, r_score_list, f1_score_list]).T
        model_comparison_df.columns = ['model_name', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']
        model_comparison_df = model_comparison_df.sort_values(by='f1_score', ascending=False)

    return model_comparison_df

In [None]:
model_comparison_df = fit_and_evaluate(model_dict)

### Look at the model comparison

In [None]:
model_comparison_df

### Fit a single model

In [None]:
vectorizer = TfidfVectorizer(sublinear_tf=True, ngram_range=(1, 2), stop_words='english')
clf = LogisticRegression()
pipe = Pipeline([('vectorizer', vectorizer), ('to_dense', DenseTransformer()), ('clf', clf)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))

In [None]:
doc = cleanup_text(['dear modules report report cost thank regard'], logging=False)
category = pipe.predict(doc)
print('Predicted category:', category[0])