# Great Learning - Capstone Project - NLP | Automated Ticket Assignment
*  **Submitted By**: Gaurav, Karishma, Lavanya, Pallavi and Swati 
* **Status** : In-Progress (EDA, Feature Engineering & Selection)
* **Date of Submission** : TBD
* **Dataset** : https://drive.google.com/drive/u/0/folders/1xOCdNI2R5hiodskIJbj-QySMQs6ccehL

# Problem Statement
One of the key activities of any IT function is to ensure there is no impact to the Business operations through Incident Management process. An incident is an unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business.

The main goal of Incident Management process is to provide a quick fix / workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact.

These incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools) within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). 

**The goal of this project is to build a classifier that can classify the incidents by analysing text**.


# Solution
The solution is to build a classification model that can analyse the text and classify to appropriate Service Desk team.

# Approach


*   Analyse and Understand the structure of data
*   Visualize data
*   Text preprocessing
*   Create word vocabulary and Tokens
*   Build a Classification model
*   Train the model
*   Test the Model 

## Get Required Files from Drive

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# changing present working directory
import os
os.chdir("/content/drive/My Drive/Capstone Project")
os.getcwd()

'/content/drive/My Drive/Capstone Project'

In [0]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
import seaborn as sns
sns.set(style="ticks", color_codes=True)
sns.set_palette("Spectral")
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
import DataPreprocessor as DP

from pprint import pprint
from sklearn import preprocessing 

import warnings
warnings.filterwarnings(action='ignore')

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

In [0]:
# NLTK Stop words
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
words = set(nltk.corpus.words.words())
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['received from', 'hi', 'hello','i','am','cc','sir','good morning','gentles','dear','kind','best','please',''])
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer 
from nltk.stem import WordNetLemmatizer
from gensim.utils import tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


#**## Reading and Exploring Data**

## Reading Data "Input Data Synthetic (created but not used in our project).xlsx". Exploring the data and getting some insights about the data.

In [0]:
# Read Dataset
file_name = "Ticket_Data.xlsx" 
df = pd.read_excel(file_name,encoding='cp1252')
df = df.rename(columns = {"Short description": "Short_description",
                          "Assignment group": "Group"})
DELETE_CALLER = False

df.head()

Unnamed: 0,Short_description,Description,Caller,Group
0,login issue,-verified user details.(employee# & manager na...,spxjnwir pjlcoqds,GRP_0
1,outlook,\r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail...,hmjdrvpb komuaywn,GRP_0
2,cant log in to vpn,\r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail...,eylqgodm ybqkwiam,GRP_0
3,unable to access hr_tool page,unable to access hr_tool page,xbkucsvz gcpydteq,GRP_0
4,skype error,skype error,owlgqjme qhcozdfx,GRP_0


In [0]:
# Checking Shape of the data
print("Data shape:", df.shape)
print("Data Description:")
df.describe()

Data shape: (8500, 4)
Data Description:


Unnamed: 0,Short_description,Description,Caller,Group
count,8492,8499,8500,8500
unique,7481,7817,2950,74
top,password reset,the,bpctwhsn kzqsbmtp,GRP_0
freq,38,56,810,3976


## Drop "Caller" column based on flag set by DELETE_CALLER. It seems to be anonymised data (usernames/ids). 

In [0]:
df_v1 = df
if DELETE_CALLER:
  df_v1 = df.drop('Caller',axis=1)
else:
  df_v1['Caller'] =  df_v1['Caller'].apply(lambda x: x.replace(" ", "_"))
df_v1.head(20)

Unnamed: 0,Short_description,Description,Caller,Group
0,login issue,-verified user details.(employee# & manager na...,spxjnwir_pjlcoqds,GRP_0
1,outlook,\r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail...,hmjdrvpb_komuaywn,GRP_0
2,cant log in to vpn,\r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail...,eylqgodm_ybqkwiam,GRP_0
3,unable to access hr_tool page,unable to access hr_tool page,xbkucsvz_gcpydteq,GRP_0
4,skype error,skype error,owlgqjme_qhcozdfx,GRP_0
5,unable to log in to engineering tool and skype,unable to log in to engineering tool and skype,eflahbxn_ltdgrvkz,GRP_0
6,event: critical:HostName_221.company.com the v...,event: critical:HostName_221.company.com the v...,jyoqwxhz_clhxsoqy,GRP_1
7,ticket_no1550391- employment status - new non-...,ticket_no1550391- employment status - new non-...,eqzibjhw_ymebpoih,GRP_0
8,unable to disable add ins on outlook,unable to disable add ins on outlook,mdbegvct_dbvichlg,GRP_0
9,ticket update on inplant_874773,ticket update on inplant_874773,fumkcsji_sarmtlhy,GRP_0


In [0]:
# Drop duplicate rows
df_v1 = df_v1.drop_duplicates(keep='first', inplace=False)

# Fetch rows with same data in "Short_description" & "Description"
df_v1[df_v1['Short_description'] == df_v1['Description']].count()

(8417, 4)

## Finding & Imputing Null values in Short Description & Description columns

In [0]:
# Check for number of null values in each columns
print("Total Null Values in data:", df_v1.isnull().sum().sum())
print("\nNull Values accross columns:\n", df_v1.isnull().sum())
print("\nData with 'Null' Short Description")
df_v1.loc[df_v1['Short_description'].isnull()==True]

Total Null Values in data: 9

Null Values accross columns:
 Short_description    8
Description          1
Caller               0
Group                0
dtype: int64

Data with 'Null' Short Description


Unnamed: 0,Short_description,Description,Caller,Group
2604,,\r\n\r\nreceived from: ohdrnswl.rezuibdt@gmail...,ohdrnswl_rezuibdt,GRP_34
3383,,\r\n-connected to the user system using teamvi...,qftpazns_fxpnytmk,GRP_0
3906,,-user unable tologin to vpn.\r\n-connected to...,awpcmsey_ctdiuqwe,GRP_0
3910,,-user unable tologin to vpn.\r\n-connected to...,rhwsmefo_tvphyura,GRP_0
3915,,-user unable tologin to vpn.\r\n-connected to...,hxripljo_efzounig,GRP_0
3921,,-user unable tologin to vpn.\r\n-connected to...,cziadygo_veiosxby,GRP_0
3924,,name:wvqgbdhm fwchqjor\nlanguage:\nbrowser:mic...,wvqgbdhm_fwchqjor,GRP_0
4341,,\r\n\r\nreceived from: eqmuniov.ehxkcbgj@gmail...,eqmuniov_ehxkcbgj,GRP_0


In [0]:
print("\nData with 'Null' Description")
df_v1.loc[df_v1['Description'].isnull()==True]


Data with 'Null' Description


Unnamed: 0,Short_description,Description,Caller,Group
4395,i am locked out of skype,,viyglzfo_ajtfzpkb,GRP_0


In [0]:
# Impute missing values
df_v1['Short_description'].fillna('the', inplace=True) # replacing null values with stopword 'the'
df_v1['Description'].fillna('the', inplace=True) # replacing null values with stopword 'the'

print("Null values imputed")
print("Null Values in data after imputation:", df_v1.isnull().sum().sum())

Null values imputed
Null Values in data after imputation: 0


## For "Assignment Group" type where number of tickets in the category is less than specified freuency, we will mark then into "GRP_Manual". All "GRP_Manual" tickets should be triaged manually, until the model has enough data to categorise them automatically.

In [0]:
# Reset Assignment Group for group types with less data
Frequency_Threshold = 50
count = df_v1['Group'].value_counts(ascending=True)
idx = count[count.lt(Frequency_Threshold)].index
df_v1.loc[df_v1['Group'].isin(idx), 'Group'] = 'GRP_Manual'
print("Updated unique group types",df_v1['Group'].nunique())
df_v1['Group'].value_counts(ascending=True)


Updated unique group types 25


GRP_26          56
GRP_34          62
GRP_7           68
GRP_17          68
GRP_31          69
GRP_16          85
GRP_18          88
GRP_29          97
GRP_4          100
GRP_33         107
GRP_25         116
GRP_14         118
GRP_5          128
GRP_10         140
GRP_13         145
GRP_6          183
GRP_3          200
GRP_19         215
GRP_2          241
GRP_9          252
GRP_12         257
GRP_24         285
GRP_8          645
GRP_Manual     758
GRP_0         3934
Name: Group, dtype: int64

## Text Cleaning

In [0]:
# Cleaned both - 'Short_Description & Description'
df_v1.Short_description = DP.text_preprocessing(df_v1.Short_description)
df_v1.Description = DP.text_preprocessing(df_v1.Description)
df_v1.head()

Unnamed: 0,Short_description,Description,Caller,Group
0,login issue,verified user details. checked the user name ...,spxjnwir_pjlcoqds,GRP_0
1,outlook,received from hello team my meetings skype me...,hmjdrvpb_komuaywn,GRP_0
2,cant log in to vpn,received from hi i cannot log on to vpn best,eylqgodm_ybqkwiam,GRP_0
3,unable to access hr tool page,unable to access hr tool page,xbkucsvz_gcpydteq,GRP_0
4,skype error,skype error,owlgqjme_qhcozdfx,GRP_0


## Concatenating "Short Description" and "Description" to get "Summary" Tickets

In [0]:
df_v1["Summary"] = df_v1['Short_description'].str.cat(df_v1['Description'], sep = ". ")
if not(DELETE_CALLER):
  df_v1["Summary"] = df_v1['Summary'].str.cat(df_v1['Caller'], sep = ". ")
  df_v1 = df_v1.drop(['Caller'],axis=1)
df_v2 = df_v1.drop(['Short_description','Description'],axis=1)
df_v2 = df_v2[['Summary','Group']]
df_v2.head(20) 

Unnamed: 0,Summary,Group
0,login issue. verified user details. checked t...,GRP_0
1,outlook. received from hello team my meetings...,GRP_0
2,cant log in to vpn. received from hi i cannot...,GRP_0
3,unable to access hr tool page. unable to acces...,GRP_0
4,skype error . skype error,GRP_0
5,unable to log in to engineering tool and skype...,GRP_0
6,event the value of mountpoint threshold for . ...,GRP_Manual
7,employment status new non employee . employm...,GRP_0
8,unable to disable add ins on outlook. unable t...,GRP_0
9,ticket update on . ticket update on,GRP_0


In [0]:
# word tokenisation & removal of stop words & gibberish word(by typos, anonymised names)

# Remove stopwords
df_v2['Summary'] = df_v2['Summary'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

# Remove words not in Englsih Dictionary (typos, anonymised names)
df_v2['Summary'] = df_v2['Summary'].apply(lambda x: ' '.join([word for word in x.split() if word in (words)]))

# Tokenise 'Summary' column
data = df_v2.Summary.values.tolist()
data = [list(tokenize(sentences)) for sentences in data]

# Remove duplicates
temp = []
unique_words = []
for eachrow in data:
    unique_words = list(dict.fromkeys(eachrow))
    temp.append(unique_words)
data = temp

# lemmetise words
# porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()
temp = []
for eachrow in data:
    lemma_words = []
    for eachword in eachrow:
        if len(eachword) > 1:
          # eachword = porter_stemmer.stem(eachword) # words being overstemmed 
          eachword = wordnet_lemmatizer.lemmatize(eachword, pos = "n")
          eachword = wordnet_lemmatizer.lemmatize(eachword, pos = "v")
          eachword = wordnet_lemmatizer.lemmatize(eachword, pos = ("a"))
          lemma_words.append(eachword)
    temp.append(lemma_words)
data = temp 

data = [(" ".join(sentence))  for sentence in data]

In [0]:
maxlen = 0
for sentence in data:
    if (maxlen < sentence.count(' ')+1 ):
        maxlen = sentence.count(' ')+1

    
# Create Word Embeddings
tfidf_vectors = TfidfVectorizer(min_df=3,max_features= maxlen)
tfidf_db = tfidf_vectors.fit_transform(data).toarray()
tfidf_db = pd.DataFrame(tfidf_db)

le = preprocessing.LabelEncoder() 
df_v2['Group']= le.fit_transform(df_v2['Group']) # LabelEncode 'Groups'
df_v2.head(20)

In [0]:
X = tfidf_db
y = df_v2['Group']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)

svm_model = SVC(kernel='linear',C=10)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)
print('Training Accuracy:', svm_model.score(X_train , y_train))
print('Test Accuracy:',svm_model.score(X_test , y_test))


Training Accuracy: 0.7104950495049505
Test Accuracy: 0.5785565785565786
