# Linear SVC to predict whether an incident met SLA

The Incident Management dataset has about 141712 records of 24918 incidents. Each state of the incident is being captured as an individual record with few exceptions where the closed state of an incident is recorded more than once. With the help of the below segment of the code, we load and clean the Incident Management data so that only one record representing the truly closed state per incident is obtained.

------------------------------------------------------------------------------------------------------------------------------

##### Create a spark session and load the Incident Management Data set

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('IMMLSVC').getOrCreate()

In [3]:
df = spark.read.csv('incident_event_log.csv',inferSchema=True,header=True)

------------------------------------------------------------------------------------------------------------------------------

##### Data pre-processing

In [4]:
# Import the required libraries

from pyspark.sql.functions import datediff,date_format,to_date,to_timestamp

In [5]:
import pyspark.sql.functions as f

In [6]:
# Create new timestamp and date columns for all the attributes that had timestamp details stored as string
# The target column made_sla is converted to hold numeric values
# Two durations (resolved and closed) are calculated to be passed as the independent variables

df=df.withColumn('resolved_ts',to_timestamp(df.resolved_at, 'dd/MM/yyyy HH:mm')).\
        withColumn('opened_ts',to_timestamp(df.opened_at, 'dd/MM/yyyy HH:mm')).\
        withColumn('closed_ts',to_timestamp(df.closed_at, 'dd/MM/yyyy HH:mm')).\
        withColumn('resolved',to_date(df.resolved_at, 'dd/MM/yyyy HH:mm')).\
        withColumn('opened',to_date(df.opened_at, 'dd/MM/yyyy HH:mm')).\
        withColumn('closed',to_date(df.closed_at, 'dd/MM/yyyy HH:mm')).\
        withColumn('knowledge', f.col('knowledge').cast('string')).\
        replace(['TRUE',], 'True', subset='knowledge').\
        replace(['FALSE'], 'False', subset='knowledge').\
        withColumn('resolved_duration',datediff(to_date(df.resolved_at, 'dd/MM/yyyy HH:mm'),\
                                                to_date(df.opened_at, 'dd/MM/yyyy HH:mm'))).\
        withColumn('closed_duration',datediff(to_date(df.closed_at, 'dd/MM/yyyy HH:mm'),\
                                                to_date(df.opened_at, 'dd/MM/yyyy HH:mm'))).\
        withColumn('made_sla_int',df.made_sla.cast('integer'))

In [7]:
# The data set has multiple states(New, Active, Awaiting user info, Resolved, Closed etc. ) of an incident. With the help 
# of the below command, we are just filtering one record per incident, that has the truly closed state of the incident. 

df_unique_incidents=df.filter("incident_state=='Closed'").sort("sys_mod_count",ascending=False).dropDuplicates(["number"])

In [8]:
# Selecting the dependent and the independent variables that are identified as most useful attributes to make predictions

data=df_unique_incidents.select(['sys_mod_count','opened_by','location','category','priority','assignment_group',
                                 'knowledge','resolved_duration','closed_duration','made_sla_int'])

In [9]:
data=data.dropna()

In [10]:
# Create a 70-30 train test split

train_data,test_data=data.randomSplit([0.7,0.3])

------------------------------------------------------------------------------------------------------------------------------

### Building the Linear SVC model

In [12]:
# Import the required libraries

from pyspark.ml.classification import LinearSVC
from pyspark.ml.feature import VectorAssembler,StringIndexer,StandardScaler
from pyspark.ml import Pipeline

In [13]:
# Use StringIndexer to convert the categorical columns to hold numerical data

opened_by_indexer = StringIndexer(inputCol='opened_by',outputCol='opened_by_index',handleInvalid='keep')
location_indexer = StringIndexer(inputCol='location',outputCol='location_index',handleInvalid='keep')
category_indexer = StringIndexer(inputCol='category',outputCol='category_index',handleInvalid='keep')
priority_indexer = StringIndexer(inputCol='priority',outputCol='priority_index',handleInvalid='keep')
assignment_group_indexer = StringIndexer(inputCol='assignment_group',outputCol='assignment_group_index',handleInvalid='keep')
knowledge_indexer = StringIndexer(inputCol='knowledge',outputCol='knowledge_index',handleInvalid='keep')

In [19]:
# Vector assembler is used to create a vector of input features

assembler = VectorAssembler(inputCols=['opened_by_index','location_index','category_index',
                                       'priority_index','assignment_group_index','knowledge_index'],
                            outputCol="unscaled_features")

In [20]:
# Standard scaler is used to scale the data for the linear SVC to perform well on the training data

scaler = StandardScaler(inputCol="unscaled_features",outputCol="features")

In [21]:
# Create an object for the Linear SVC model

svc_model = LinearSVC(labelCol='made_sla_int')

In [22]:
# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data
# in the same way as that of the train data. It also 

pipe = Pipeline(stages=[opened_by_indexer,location_indexer,category_indexer,priority_indexer,
                        assignment_group_indexer,knowledge_indexer,assembler,scaler,svc_model])

In [23]:
# The total duration to train the model was around 30 minnutes

fit_model=pipe.fit(train_data)

In [24]:
# Store the results in a dataframe

results = fit_model.transform(test_data)

In [25]:
results.select(['made_sla_int','prediction']).show()

+------------+----------+
|made_sla_int|prediction|
+------------+----------+
|           0|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
|           1|       0.0|
|           1|       1.0|
|           1|       1.0|
|           1|       1.0|
+------------+----------+
only showing top 20 rows



-------------------------------------------------------------------------------------------------------------------------------

### Evaluating the model

#####  1. Area under the ROC

In [26]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [27]:
AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='made_sla_int',metricName='areaUnderROC')

In [28]:
AUC = AUC_evaluator.evaluate(results)

In [29]:
print("The area under the curve is {}".format(AUC))

The area under the curve is 0.64819823595459


A roughly 65% area under ROC denotes the model has performed reasonably well in predicting whether an incident has met the sla

------------------------------------------------------------------------------------------------------------------------------

#####  2. Area under the PR

In [30]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='made_sla_int',metricName='areaUnderPR')

In [31]:
PR = PR_evaluator.evaluate(results)

In [32]:
print("The area under the PR curve is {}".format(PR))

The area under the PR curve is 0.7060097342722701


------------------------------------------------------------------------------------------------------------------------------

#####  3. Accuracy

In [33]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [34]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="made_sla_int", predictionCol="prediction", metricName="accuracy")

In [35]:
accuracy = ACC_evaluator.evaluate(results)

In [36]:
print("The accuracy of the model is {}".format(accuracy))

The accuracy of the model is 0.720598616405478


------------------------------------------------------------------------------------------------------------------------------

#####  4. Confusion Matrix

In [37]:
from sklearn.metrics import confusion_matrix

In [39]:
y_true = results.select("made_sla_int")
y_true = y_true.toPandas()

y_pred = results.select("prediction")
y_pred = y_pred.toPandas()

cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix: \n {}".format(cnf_matrix))

Below is the confusion matrix: 
 [[ 951 1708]
 [ 271 4153]]
