# Classifying Apropriate Assignment group based on Problem Description
**Pradyun Magal, 2023 Summer**

This Notebook takes you through the entire process of creating the model from start to finish


# **Part 1**
# *Installing Dependencies and Obtaining Data*

Explained more in comments

In [None]:
!pip install tensorflow_text
!pip install tensorflow_hub #For our BERT preprocessing
!pip install spacy # For Name Entity Recognition
!python -m spacy download en_core_web_lg # Used the large English library to increase accuracy

Collecting tensorflow_text
  Downloading tensorflow_text-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/6.5 MB[0m [31m2.0 MB/s[0m eta [36m0:00:04[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m3.8/6.5 MB[0m [31m55.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.5/6.5 MB[0m [31m77.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow<2.14,>=2.13.0 (from tensorflow_text)
  Downloading tensorflow-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (524.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m524.1/524.1 MB[0m [31m2.1 M

In [None]:
#Install and Import Libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra toolkit
import matplotlib.pyplot as plt
import re # Data Cleaning
import spacy
spc = spacy.load("en_core_web_lg") # For NER detection and data cleaning
from spacy import displacy
from bs4 import BeautifulSoup # For text parsing
import tensorflow as tf # The ML Library I will be using to create NN
device_name = tf.test.gpu_device_name()
# if device_name != '/device:GPU:0':
#   raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
# ^ This set of lines is for debugging purposes to make sure that tensorflow can
#   recognize the GPU on Colab
from tensorflow import keras # Layers for NN
import tensorflow_hub as hub # Dependencies Needed for BERT encoder
import tensorflow_text as text
from sklearn import preprocessing # For Label Encoding
from sklearn.model_selection import train_test_split # For Splitting Data
from sklearn.preprocessing import OneHotEncoder
from keras import layers

Found GPU at: /device:GPU:0


**Convert From CSV to DataFrame**

In [None]:
from google.colab import drive
# Bring in our CSV File from the drive
drive.mount('/content/drive')
df = pd.read_csv("/content/drive/MyDrive/GMSCRFDump.csv", encoding="windows-1252", encoding_errors="replace")
df.head()
df.info()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29356 entries, 0 to 29355
Data columns (total 29 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   ID                            29356 non-null  int64 
 1   Open DateTime                 29356 non-null  object
 2   Ticket Id                     29356 non-null  object
 3   Title                         27157 non-null  object
 4   Description                   29350 non-null  object
 5   Staff Name                    29356 non-null  object
 6   Close DateTime                26872 non-null  object
 7   Expected Response DateTime    29356 non-null  object
 8   Expected Resolution DateTime  29356 non-null  object
 9   Response Violated             29356 non-null  bool  
 10  Response Violation Reason     171 non-null    object
 11  Res

**Gather Columns We Need**

In this case, Description is our Feature and Group is our Label

In [None]:
mainDf = df[["Description","Group"]]
mainDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29356 entries, 0 to 29355
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  29350 non-null  object
 1   Group        29356 non-null  object
dtypes: object(2)
memory usage: 458.8+ KB


# **Part 2**
# *Begin Data Prep*

*Includes splitting, cleaning and sorting of our data set*

I also analyze dataset for things like imbalances and biases.

In [None]:
# Start off by simply dropping Null Rows
mainDf = mainDf.dropna()
mainDf.info()
mainDf.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29350 entries, 0 to 29355
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  29350 non-null  object
 1   Group        29350 non-null  object
dtypes: object(2)
memory usage: 687.9+ KB


Unnamed: 0,Description,Group
0,Kindly Provide Bookmark Access to Below User a...,MESSAGING DOMINO
1,KIndly add user ID in Ludhiana RO & Ludhiana G...,MESSAGING DOMINO
2,Kindly Provide Bookmark Access For Freelook Ca...,MESSAGING DOMINO
3,Please add user in all HO & GO group. Vaibhav ...,MESSAGING DOMINO
4,to please modify (Add & Delete) some ID form d...,MESSAGING DOMINO


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


As we can see from Before, there's not a lot of NULL rows. This means we can safley remove null values and it won't have a huge effect on the Data

**Data Imbalance**

We noticed that there's a huge imbalance present in the spread of Data as seen below. There are some labels in the dataset that make up less than 0.005% of the total data



In [None]:
print(mainDf["Group"].value_counts())
# Look at how dataset balance affects accuracy, eg most common labels bias model
# Network Management and Below


I-Prompt                       9229
SERVER MANAGEMENT              5031
MESSAGING DOMINO               3116
DB MANAGEMENT - ORACLE         3022
SECURITY MANAGEMENT            1978
Server Management - AIX        1287
STORAGE SUPPORT                1098
Server Management - Linux       974
DB MANAGEMENT - SQL             941
BACKUP MANAGEMENT               808
NETWORK MANAGEMENT              562
Server Management-AVPM          472
Infosec                         311
Server Management - SOLARIS     228
SERVER MANAGEMENT - UNIX        183
AS400                            38
SIM ANALYSIS                     33
MLIC                             26
DB2 - DATABASE                   13
Name: Group, dtype: int64


With this in mind, we decied to do the following so that we don't let labels that show up very infrequently affecy our results:



```
 - Labels with 1000+ Appearances stays in main Model
 - Labels < 1000 Appearances Goes into a "Mini" or smaller model
 - Labels with < 100 Appearances Gets Dropped Completley

 We Will also be removing around 40% random rows in I-Prompt and saving them in another dataframe for testing.
```

The following ruleset above means we need to split our data into two different Datasets, I chose the names `mainDf` and `miniD` for this.


In [None]:
# First we drop all the excess rows in I-Prompt, save it to ipromptDf
ipromptDf = mainDf[mainDf["Group"] == 'I-Prompt'].sample(frac=.4)
mainDf = mainDf.drop(ipromptDf.index,inplace=False)

In [None]:
ipromptDf.head() # Just to see what it looks like

Unnamed: 0,Description,Group
3470,Call logged by Iprompt for the task :: 'Need t...,I-Prompt
20171,Call logged by Iprompt for the task :: 'SAN Sw...,I-Prompt
24966,Call logged by Iprompt for the task :: 'Need t...,I-Prompt
6771,Call logged by Iprompt for the task :: 'Need t...,I-Prompt
25017,Call logged by Iprompt for the task :: 'Need t...,I-Prompt


In [None]:
includeList = ["Server Management - Linux","DB MANAGEMENT - SQL","BACKUP MANAGEMENT","NETWORK MANAGEMENT","Server Management-AVPM","Infosec","Server Management - SOLARIS","SERVER MANAGEMENT - UNIX"]
excludeList = ["AS400","SIM ANALYSIS","MLIC","DB2 - DATABASE"]
miniDf = mainDf[mainDf["Group"].isin(includeList)] # Mini DF includes whatever is in the list
mainDf = mainDf[~mainDf["Group"].isin(excludeList)] # Main Df drops the labels we tell it to
print(miniDf["Group"].value_counts())
print(mainDf["Group"].value_counts())
print(len(miniDf)) # This value will match with the submodel length (later lines below)
print(len(mainDf))

Server Management - Linux      974
DB MANAGEMENT - SQL            941
BACKUP MANAGEMENT              808
NETWORK MANAGEMENT             562
Server Management-AVPM         472
Infosec                        311
Server Management - SOLARIS    228
SERVER MANAGEMENT - UNIX       183
Name: Group, dtype: int64
I-Prompt                       5537
SERVER MANAGEMENT              5031
MESSAGING DOMINO               3116
DB MANAGEMENT - ORACLE         3022
SECURITY MANAGEMENT            1978
Server Management - AIX        1287
STORAGE SUPPORT                1098
Server Management - Linux       974
DB MANAGEMENT - SQL             941
BACKUP MANAGEMENT               808
NETWORK MANAGEMENT              562
Server Management-AVPM          472
Infosec                         311
Server Management - SOLARIS     228
SERVER MANAGEMENT - UNIX        183
Name: Group, dtype: int64
4479
25548


In [None]:
# Adjust all of the labels
rDict = dict.fromkeys(includeList, "submodel")
print(rDict)
mainDf = mainDf.replace(rDict)
print(mainDf["Group"].value_counts())
# 4479 In submodel, which is the total amount in miniDf meaning the split was successful

{'Server Management - Linux': 'submodel', 'DB MANAGEMENT - SQL': 'submodel', 'BACKUP MANAGEMENT': 'submodel', 'NETWORK MANAGEMENT': 'submodel', 'Server Management-AVPM': 'submodel', 'Infosec': 'submodel', 'Server Management - SOLARIS': 'submodel', 'SERVER MANAGEMENT - UNIX': 'submodel'}
I-Prompt                   5537
SERVER MANAGEMENT          5031
submodel                   4479
MESSAGING DOMINO           3116
DB MANAGEMENT - ORACLE     3022
SECURITY MANAGEMENT        1978
Server Management - AIX    1287
STORAGE SUPPORT            1098
Name: Group, dtype: int64


**Label Encoding**

We need to convert our `str` label classes into numerical categories, easiest way to do this with a pandas dataframe is using sklearn's `preprocessing` library

In [None]:
from sklearn import preprocessing
nameList = list(mainDf["Group"].unique()) # We will use this list to form a dictionary from where we can conver the number label to the string label
print(mainDf["Group"].unique())
label_encoder = preprocessing.LabelEncoder()
mainDf["Group"] = label_encoder.fit_transform(mainDf["Group"]) # Encodes the Group column
encodeList = list(mainDf["Group"].unique()) # Obtain the unique values after encoding
print(mainDf["Group"].unique())
deDict = dict(zip(encodeList, nameList))
# deDict is correct, because according to pandas the unqiue() function returns values in order of apperance;
# And since we didn't adjust the order of the frame we know it is correct

['MESSAGING DOMINO' 'SERVER MANAGEMENT' 'submodel' 'STORAGE SUPPORT'
 'Server Management - AIX' 'I-Prompt' 'SECURITY MANAGEMENT'
 'DB MANAGEMENT - ORACLE']
[2 4 7 5 6 1 3 0]


Below I did the same process on the `miniDf`

In [None]:
# Encode the miniDF
label_encoder = preprocessing.LabelEncoder()
miniNameList = list(miniDf["Group"].unique())
miniDf["Group"] = label_encoder.fit_transform(miniDf["Group"])
miniEncodeList = list(miniDf["Group"].unique())
print(miniDf["Group"].unique())
miniDict = dict(zip(miniEncodeList, miniNameList))
print(miniDict)

[5 1 3 0 2 6 7 4]
{5: 'Server Management - Linux', 1: 'DB MANAGEMENT - SQL', 3: 'NETWORK MANAGEMENT', 0: 'BACKUP MANAGEMENT', 2: 'Infosec', 6: 'Server Management - SOLARIS', 7: 'Server Management-AVPM', 4: 'SERVER MANAGEMENT - UNIX'}


In [None]:
ipromptDf["Group"] = 1
# The following line will make all labels 1 which is the encoding for iPrompt in the mainDf

In [None]:
print(deDict)
print(mainDf["Group"].value_counts())
# This cell is just for debugging, but it gives us further reassurance that the labels are correct. I-Prompt is still the highest appearnce for example

{2: 'MESSAGING DOMINO', 4: 'SERVER MANAGEMENT', 7: 'submodel', 5: 'STORAGE SUPPORT', 6: 'Server Management - AIX', 1: 'I-Prompt', 3: 'SECURITY MANAGEMENT', 0: 'DB MANAGEMENT - ORACLE'}
1    5537
4    5031
7    4479
2    3116
0    3022
3    1978
6    1287
5    1098
Name: Group, dtype: int64


**Text Cleaning**

I experminted around with many cleaning strategies including:

*   Regex removal of special character
*   Regex replacement of certain characters with spaces
*   Parsing with Beautiful Soup Module to bring out text from any potential
HTML
* Removal of Stopwords from NLTK library

All of the above were done in attempts to make it easier for the [Tensorflow BERT Encoder](https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1) to tokenize the words. The idea is removing any uneeded characters that can throw off the encodings, for example we don't want the model to thing '/' or '@' has something to do with the final encoding.
But after lots of trial and error we figured we would let the [BERT Preprocess Model on TF Hub](https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3) do the work for us, and remove any Names in the tickets instead.




In [None]:
def removeWeirdCharacters(text):
    # soup = BeautifulSoup(text,"html.parser")
    # text = soup.get_text()
    doc = spc(text)
    newText = text
    for word in doc.ents:
        if word.label_ == "PERSON":
            newText = newText.replace(word.text,'')
    text = newText
    # text = re.sub(r"[()@.,-]+",'',text)
    # text = re.sub(r"[&;:/]",' ',text)
    # text = re.sub(r"\d",'',text)
    # stop_words = set(stopwords.words('english'))
    # word_tokens = word_tokenize(text)
    # filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
    # text = ' '.join(filtered_sentence)
    text = text.lower()
    return text

As seen above, I left all of the initial attempts of cleaning the text data by hand as comments. It is a little harder to read but serves as a reminder forn me personally as to what methods I've tried.

In [None]:
mainDf["Description"] = mainDf["Description"].apply(removeWeirdCharacters)
miniDf["Description"] = miniDf["Description"].apply(removeWeirdCharacters)
ipromptDf["Description"] = ipromptDf["Description"].apply(removeWeirdCharacters)
# Apply the function

In [None]:
print(mainDf.head())
print(mainDf["Description"][100])
print("---------------------")
print(miniDf.head())
print(miniDf["Description"][70])
# See roughly what the cleaning looks like

                                         Description  Group
0  kindly provide bookmark access to below user a...      2
1  kindly add user id in ludhiana ro & ludhiana g...      2
2  kindly provide bookmark access for freelook ca...      2
3  please add user in all ho & go group.  (emp id...      2
4  to please modify (add & delete) some id form d...      2
increase mail quota size for below user. ad2835 
---------------------
                                           Description  Group
68   kindly provide server right for below user ser...      5
70   kindly replicate all the db access from aj2929...      1
71   please provide the structure of view name -- v...      1
72   need to execute dbcc checkdb on gursrv0345 ser...      1
144  kindly enable the ports and ether channel cabl...      3
kindly replicate all the db access from aj29296 to os42439. 


**Data Splitting**

Simply used sklearn's `train_test_split` function on both `mainDf` and `miniDf`
Made sure to use `stratify` in order to make the best use of randomizing the order and splitting as cleanly as possible. We want to make sure we test and train our model on data it's never seen before.

In [None]:

X_train, x_test, Y_train, y_test = train_test_split(mainDf['Description'],mainDf["Group"], stratify=mainDf["Group"])
miniX_train, minix_test, miniY_train, miniy_test = train_test_split(miniDf['Description'],miniDf["Group"], stratify=miniDf["Group"])
# The following test are to make sure everything is proper, the x and ys for both trains and tests need to be the same length
print("TESTS")
print(len(X_train))
print(len(Y_train))
print(len(x_test))
print(len(y_test))
print("MINIS")
print(len(miniX_train))
print(len(miniY_train))
print(len(minix_test))
print(len(miniy_test))

TESTS
19161
19161
6387
6387
MINIS
3359
3359
1120
1120


Let's have a look at the series created from the split:

In [None]:
print(miniX_train.head())

26924    backup team : kindly confirm the latest tape b...
22971    vlingursrv0651, vm tools not installed, so ple...
7033     backup team : kindly confirm the latest tape b...
6826     kindly activate the 1 lan port . loginid mkptn...
26023    kindly share the below details asap 1. current...
Name: Description, dtype: object


**One Hot Encoding**
Since Tensorflow uses Keras layers to form a model, I needed to one hot encode the data so that it can be passed into the neural network.
The loss function I used for the neural network was Keras's [CategoricalCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) which requires data be sent in via one hot encoding

I used pandas `get_dummies` as a means of one hot encoding the categorixal data

In [None]:
newytrain = pd.get_dummies(Y_train, prefix='group')
print(len(newytrain)) # Just to make sure
unhotytrain = Y_train # For analysis and testing, I made sure to keep copies of the dataframe columns before they were one hot encoded to be safe
Y_train = newytrain

newminiytrain = pd.get_dummies(miniY_train,prefix='group')
print(len(newminiytrain)) # Can never be too safe
miniytrainnohot = miniY_train
miniY_train = newminiytrain

19161
3359


Same thing on the Test splits

In [None]:
#test splits
newytest = pd.get_dummies(y_test, prefix='group')
unhotytest = y_test
y_test = newytest

mininewytest = pd.get_dummies(miniy_test, prefix='group')
miniunhotytest = miniy_test
miniy_test = mininewytest


In [None]:
print(miniY_train.head()) # Seeing what our result looks like

       group_0  group_1  group_2  group_3  group_4  group_5  group_6  group_7
26924        1        0        0        0        0        0        0        0
22971        0        0        0        0        0        1        0        0
7033         1        0        0        0        0        0        0        0
6826         0        0        0        1        0        0        0        0
26023        0        0        0        0        0        0        1        0


# **The Nueral Network**
# *Building and Training it*

The first two layers will be to handle text, it goes through `bert_preprocess` and `bert_encoder` in order to first get the text tokenized so that it can be processed by the rest of the model.
It then goes through a dropout layer just to prevent overfitting, the dropout layer is very sparse but necessary.
And the final and most important layer is the `Keras.dense` layer that uses a sigmoid function for activation.
This layer has as many nuerons as there are labels to predict, in the case of both the mini model and the main model there are 8 label classes.

The returned result is an array of 8 different probability values, the highest being the overall prediction of the model.

In [None]:
#What nn Would Look like:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

#Inputs
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

#Heart of NN
l = tf.keras.layers.Dropout(0.02, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(8, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

Same for the mini model:

In [None]:
#mini model
bert_preprocess2 = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder2 = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

#Inputs
text_input2 = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text2 = bert_preprocess2(text_input2)
outputs2 = bert_encoder2(preprocessed_text2)

#Heart of NN
l2 = tf.keras.layers.Dropout(0.01, name="dropout")(outputs2['pooled_output'])
l2 = tf.keras.layers.Dense(8, activation='sigmoid', name="output")(l2)

# Use inputs and outputs to construct a final model
minimodel = tf.keras.Model(inputs=[text_input2], outputs = [l2])

In [None]:
print(model.summary())
print(minimodel.summary())

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 text (InputLayer)           [(None,)]                    0         []                            
                                                                                                  
 keras_layer (KerasLayer)    {'input_word_ids': (None,    0         ['text[0][0]']                
                             128),                                                                
                              'input_type_ids': (None,                                            
                             128),                                                                
                              'input_mask': (None, 128)                                           
                             }                                                                

**Metrics**

For the metrics I decided to go with:
*   Categorical Accuracy to see how good the model is at overall predicting
*   Precision to see how consistent the model is at predicting particular classes
* Recall to give us a good idea of the ratio between false positives and negatives

I compiled the models with the `adam` optimizer as this seemingly is the most popular.

In [None]:
METRICS = [
      tf.keras.metrics.CategoricalAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=METRICS)

minimodel.compile(optimizer='adam',loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=METRICS)

This is where the waiting game starts, I decided to start with 30 Epochs on the main model and 25 on the mini, and see how much I needed to add afterwards.

It is tricky to not overtrain the model in hopes of maximizing accuracy as theres usually an apropriate treshold of epochs that one has to find via experimenting.

Final Amount of epochs:

Main 45

Mini 40

In [None]:
model.fit(X_train, Y_train, epochs=40)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x789c8a6447c0>

In [None]:
minimodel.fit(miniX_train,miniY_train,epochs=35)


Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.src.callbacks.History at 0x789c893b7e20>

In [None]:
model.save("/content/drive/MyDrive/PradyunfinNoNer5.h5")

  saving_api.save_model(


In [None]:
minimodel.save("/content/drive/MyDrive/PradyufinMiniNoNer5.h5")

In [None]:
model.evaluate(x_test, y_test)



[0.8266290426254272,
 0.6896821856498718,
 0.2646704614162445,
 0.9632065296173096]

In [None]:
minimodel.evaluate(minix_test,miniy_test)



[1.0200402736663818,
 0.6633928418159485,
 0.2815934121608734,
 0.9151785969734192]

# Diagnostics

Load Models made above

In [None]:
mainModel = tf.keras.models.load_model("/content/drive/MyDrive/PradyunfinNoNer11.h5",custom_objects={'KerasLayer':hub.KerasLayer})
miniModel = tf.keras.models.load_model("/content/drive/MyDrive/PradyufinMiniNoNer11.h5",custom_objects={'KerasLayer':hub.KerasLayer})

KeyboardInterrupt: ignored

In [None]:
miniModel.save("/content/drive/MyDrive/PradyufinMiniNoNer11.h5")

In [None]:
mainModel.save("/content/drive/MyDrive/PradyunfinNoNer11.h5")

Run Metrics on Larger Model

In [None]:
newy_test = y_test.reset_index(inplace=False,drop=True)
newx_test = x_test.reset_index(inplace=False,drop=True)
y_hat = mainModel.predict(newx_test)
y_pred = np.argmax(y_hat,axis=1)



In [None]:
metricsytest = unhotytest.to_numpy()
metricsytest

array([7, 4, 2, ..., 7, 3, 1])

In [None]:
from sklearn.metrics import classification_report
print(classification_report(metricsytest, y_pred))
print(deDict)

              precision    recall  f1-score   support

           0       0.88      0.88      0.88       755
           1       1.00      1.00      1.00      1384
           2       0.69      0.73      0.71       779
           3       0.58      0.56      0.57       494
           4       0.61      0.70      0.65      1258
           5       0.42      0.60      0.49       275
           6       0.86      0.60      0.71       322
           7       0.64      0.51      0.57      1120

    accuracy                           0.74      6387
   macro avg       0.71      0.70      0.70      6387
weighted avg       0.75      0.74      0.74      6387

{2: 'MESSAGING DOMINO', 4: 'SERVER MANAGEMENT', 7: 'submodel', 5: 'STORAGE SUPPORT', 6: 'Server Management - AIX', 1: 'I-Prompt', 3: 'SECURITY MANAGEMENT', 0: 'DB MANAGEMENT - ORACLE'}


In [None]:
mininewy_test = miniy_test.reset_index(inplace=False,drop=True)
mininewx_test = minix_test.reset_index(inplace=False,drop=True)
miniy_hat = miniModel.predict(mininewx_test)




In [None]:
ipromptHat = mainModel.predict(ipromptDf["Description"])



In [None]:
miniy_pred = np.argmax(miniy_hat,axis=1)
miniy_pred

array([3, 1, 5, ..., 5, 1, 0])

In [None]:
minimetricsytest = miniunhotytest.to_numpy()
minimetricsytest


array([0, 1, 4, ..., 5, 1, 0])

In [None]:
print(classification_report(minimetricsytest, miniy_pred))
print(miniDict)


              precision    recall  f1-score   support

           0       0.74      0.83      0.78       202
           1       0.92      0.87      0.89       235
           2       1.00      0.88      0.94        78
           3       0.78      0.77      0.78       140
           4       0.49      0.59      0.53        46
           5       0.69      0.82      0.75       244
           6       0.69      0.51      0.59        57
           7       0.70      0.46      0.55       118

    accuracy                           0.77      1120
   macro avg       0.75      0.72      0.73      1120
weighted avg       0.77      0.77      0.76      1120

{5: 'Server Management - Linux', 1: 'DB MANAGEMENT - SQL', 3: 'NETWORK MANAGEMENT', 0: 'BACKUP MANAGEMENT', 2: 'Infosec', 6: 'Server Management - SOLARIS', 7: 'Server Management-AVPM', 4: 'SERVER MANAGEMENT - UNIX'}


In [None]:
ipromptPred = np.argmax(ipromptHat,axis=1)
print(classification_report(ipromptDf["Group"], ipromptPred))
# Should be 100%

              precision    recall  f1-score   support

           1       1.00      1.00      1.00      3692

    accuracy                           1.00      3692
   macro avg       1.00      1.00      1.00      3692
weighted avg       1.00      1.00      1.00      3692

