# Linear Regression to estimate duration

The Incident Management dataset has about 141712 records of 24918 incidents. Each state of the incident is being captured as an individual record with few exceptions where the closed state of an incident is recorded more than once. With the help of the below segment of the code, we load and clean the Incident Management data so that only one record representing the truly closed state per incident is obtained.

------------------------------------------------------------------------------------------------------------------------------

##### Create a spark session and load the Incident Management Data set

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('IMMLLR2').getOrCreate()

In [3]:
df = spark.read.csv('incident_event_log.csv',inferSchema=True,header=True)

------------------------------------------------------------------------------------------------------------------------------

##### Data pre-processing

In [4]:
# Import the required libraries

from pyspark.sql.functions import datediff,date_format,to_date,to_timestamp

In [5]:
import pyspark.sql.functions as f

In [6]:
# Create new timestamp and date columns for all the attributes that had timestamp details stored as string
# Convert the boolean value of 'knowledge' to string
# Create the duration column (difference in number of days between the incident is opened and resolved)

df=df.withColumn('resolved_ts',to_timestamp(df.resolved_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('opened_ts',to_timestamp(df.opened_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('sys_created_ts',to_timestamp(df.sys_created_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('sys_updated_ts',to_timestamp(df.sys_updated_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('closed_ts',to_timestamp(df.closed_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('resolved',to_date(df.resolved_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('opened',to_date(df.opened_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('sys_created',to_date(df.sys_created_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('sys_updated',to_date(df.sys_updated_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('closed',to_date(df.closed_at, 'dd/MM/yyyy HH:mm')).\
                withColumn('knowledge', f.col('knowledge').cast('string')).\
                replace(['TRUE',], 'True', subset='knowledge').\
                replace(['FALSE'], 'False', subset='knowledge').\
                withColumn('duration',datediff(to_date(df.resolved_at, 'dd/MM/yyyy HH:mm'),to_date(df.opened_at, 'dd/MM/yyyy HH:mm')))

In [7]:
# The data set has multiple states(New, Active, Awaiting user info, Resolved, Closed etc. ) of an incident. With the help 
# of the below command, we are just filtering one record per incident, that has the truly closed state of the incident. 

df_unique_incidents=df.filter("incident_state=='Closed'").sort("sys_mod_count",ascending=False).dropDuplicates(["number"])

In [8]:
# Selecting the dependent and the independent variables that are identified as most useful attributes to estimate duration

data=df_unique_incidents.select(['reassignment_count','reopen_count','sys_mod_count','opened_by',
                                 'location','category','subcategory','priority','assignment_group',
                                 'assigned_to','knowledge','resolved_by','duration'])

In [9]:
data.count()

24918

In [10]:
data=data.dropna()

In [11]:
data.count()

23361

In [12]:
# Create a 70-30 train test split

train_data,test_data=data.randomSplit([0.7,0.3])

------------------------------------------------------------------------------------------------------------------------------

### Building the Linear Regression model

In [13]:
# Import the required libraries

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml import Pipeline

In [14]:
# Use StringIndexer to convert the categorical columns to hold numerical data

opened_by_indexer = StringIndexer(inputCol='opened_by',outputCol='opened_by_index',handleInvalid='keep')
location_indexer = StringIndexer(inputCol='location',outputCol='location_index',handleInvalid='keep')
category_indexer = StringIndexer(inputCol='category',outputCol='category_index',handleInvalid='keep')
subcategory_indexer = StringIndexer(inputCol='subcategory',outputCol='subcategory_index',handleInvalid='keep')
priority_indexer = StringIndexer(inputCol='priority',outputCol='priority_index',handleInvalid='keep')
assignment_group_indexer = StringIndexer(inputCol='assignment_group',outputCol='assignment_group_index',handleInvalid='keep')
assigned_to_indexer = StringIndexer(inputCol='assigned_to',outputCol='assigned_to_index',handleInvalid='keep')
knowledge_indexer = StringIndexer(inputCol='knowledge',outputCol='knowledge_index',handleInvalid='keep')
resolved_by_indexer = StringIndexer(inputCol='resolved_by',outputCol='resolved_by_index',handleInvalid='keep')

In [15]:
# Vector assembler is used to create a vector of input features

assembler = VectorAssembler(inputCols=["opened_by_index",'location_index','category_index',
                                       'subcategory_index','priority_index','assignment_group_index',
                                       'assigned_to_index','knowledge_index','resolved_by_index'],
                            outputCol="features")

In [16]:
# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data
# in the same way as that of the train data

pipe = Pipeline(stages=[opened_by_indexer,location_indexer,category_indexer,subcategory_indexer,
                        priority_indexer,assignment_group_indexer,assigned_to_indexer,
                        knowledge_indexer,resolved_by_indexer,assembler])

In [17]:
fitted_pipe=pipe.fit(train_data)

In [18]:
train_data=fitted_pipe.transform(train_data)

In [19]:
# Create an object for the Linear Regression model

lr_model = LinearRegression(labelCol='duration')

In [20]:
# Fit the model on the train data

fit_model = lr_model.fit(train_data.select(['features','duration']))

In [21]:
# Transform the test data using the model to predict the duration

test_data=fitted_pipe.transform(test_data)

In [22]:
# Store the results in a dataframe

results = fit_model.transform(test_data)

In [23]:
results.select(['duration','prediction']).show()

+--------+------------------+
|duration|        prediction|
+--------+------------------+
|       0| 16.34774670264543|
|       0|2.4465833664231917|
|       0| 4.385528391442999|
|       0|0.9160042743982437|
|       0| 4.709146325552941|
|       0| 5.162949904284861|
|       0| 4.824785067866328|
|       0| 4.731471422608705|
|       0| 6.936210150757579|
|       0|0.5288913728714648|
|       0| 4.867279976491825|
|       8|  17.7661347794221|
|       0| 5.831660171358101|
|       2| 5.742697595201046|
|       0| 4.308612166666301|
|       0| 3.870950806504972|
|       0| 5.384829499506154|
|       0|13.249085982661269|
|       4| 9.218928562704779|
|       0| 6.080209095904267|
+--------+------------------+
only showing top 20 rows



-------------------------------------------------------------------------------------------------------------------------------

##### Evaluating the model

In [24]:
test_results = fit_model.evaluate(test_data)

In [25]:
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
| -16.34774670264543|
|-2.4465833664231917|
| -4.385528391442999|
|-0.9160042743982437|
| -4.709146325552941|
| -5.162949904284861|
| -4.824785067866328|
| -4.731471422608705|
| -6.936210150757579|
|-0.5288913728714648|
| -4.867279976491825|
|   -9.7661347794221|
| -5.831660171358101|
|-3.7426975952010464|
| -4.308612166666301|
| -3.870950806504972|
| -5.384829499506154|
|-13.249085982661269|
| -5.218928562704779|
| -6.080209095904267|
+-------------------+
only showing top 20 rows



In [26]:
test_results.rootMeanSquaredError

20.92734461854175

The root mean squared error is very high indicating that the models prediction is really on the poorer side

In [27]:
test_results.r2

0.09146143422372677

The r-squared value implies that the model explains only about 8% variance