# K Means to cluster the incidents into 4 groups (Priority levels)

The Incident Management dataset has about 141712 records of 24918 incidents. Each state of the incident is being captured as an individual record with few exceptions where the closed state of an incident is recorded more than once. With the help of the below segment of the code, we load and clean the Incident Management data so that only one record representing the truly closed state per incident is obtained.

------------------------------------------------------------------------------------------------------------------------------

##### Create a spark session and load the Incident Management Data set

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('IMMLKM2').getOrCreate()

In [3]:
df = spark.read.csv('incident_event_log.csv',inferSchema=True,header=True)

------------------------------------------------------------------------------------------------------------------------------

##### Data pre-processing

In [4]:
# The data set has multiple states(New, Active, Awaiting user info, Resolved, Closed etc. ) of an incident. With the help 
# of the below command, we are just filtering one record per incident, that has the truly closed state of the incident. 

df_unique_incidents=df.filter("incident_state=='Closed'").sort("sys_mod_count",ascending=False).dropDuplicates(["number"])

In [5]:
# Selecting the dependent and the independent variables that are identified as most useful attributes to make predictions

data=df_unique_incidents.select(['opened_by','location','category','subcategory',
                                 'u_symptom','assignment_group','priority'])

In [6]:
data=data.dropna()

------------------------------------------------------------------------------------------------------------------------------

### Building the K Means model

In [7]:
# Import the required libraries

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml import Pipeline

In [8]:
# Use StringIndexer to convert the categorical columns to hold numerical data

opened_by_indexer = StringIndexer(inputCol='opened_by',outputCol='opened_by_index',handleInvalid='keep')
location_indexer = StringIndexer(inputCol='location',outputCol='location_index',handleInvalid='keep')
category_indexer = StringIndexer(inputCol='category',outputCol='category_index',handleInvalid='keep')
subcategory_indexer = StringIndexer(inputCol='subcategory',outputCol='subcategory_index',handleInvalid='keep')
u_symptom_indexer = StringIndexer(inputCol='u_symptom',outputCol='u_symptom_index',handleInvalid='keep')
assignment_group_indexer = StringIndexer(inputCol='assignment_group',outputCol='assignment_group_index',handleInvalid='keep')
priority_indexer = StringIndexer(inputCol='priority',outputCol='priority_index',handleInvalid='keep')

In [9]:
# Vector assembler is used to create a vector of input features

assembler = VectorAssembler(inputCols=['opened_by_index','location_index','category_index',
                                       'subcategory_index','u_symptom_index','assignment_group_index'],
                            outputCol="features")

In [10]:
# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data
# in the same way as that of the train data.

pipe = Pipeline(stages=[opened_by_indexer,location_indexer,category_indexer,subcategory_indexer,
                        u_symptom_indexer,assignment_group_indexer,priority_indexer,assembler])

In [11]:
# It took 5 minutes for this step to complete execution

final_data=pipe.fit(data).transform(data)

In [12]:
# Create an object for the Logistic Regression model

kmeans_model = KMeans(k=4)

In [13]:
fit_model = kmeans_model.fit(final_data)

In [14]:
wssse = fit_model.computeCost(final_data)
print("The within set sum of squared error of the mode is {}".format(wssse))

The within set sum of squared error of the mode is 37759716.044208884


In [15]:
centers = fit_model.clusterCenters()

In [16]:
print("Cluster Centers")
index=1
for cluster in centers:
    print("Centroid {}: {}".format(index,cluster))
    index+=1

Cluster Centers
Centroid 1: [9.88831181 9.16459567 5.29759247 5.9773323  4.14687894 6.00058247]
Centroid 2: [ 23.28323699  11.81310212  10.38150289  71.05009634 238.61464355
  17.89017341]
Centroid 3: [18.04699248 10.39520677 11.2612782  82.88909774  8.93609023 13.58317669]
Centroid 4: [13.18154584 10.6003595   7.63151588 34.09646495 87.1312163  14.5817855 ]


In [17]:
# Store the results in a dataframe

results = fit_model.transform(final_data)

In [18]:
results.select(['opened_by_index','location_index','category_index','subcategory_index',
                'u_symptom_index','assignment_group_index','prediction']).show()

+---------------+--------------+--------------+-----------------+---------------+----------------------+----------+
|opened_by_index|location_index|category_index|subcategory_index|u_symptom_index|assignment_group_index|prediction|
+---------------+--------------+--------------+-----------------+---------------+----------------------+----------+
|           14.0|           4.0|           4.0|              4.0|            1.0|                   0.0|         0|
|            9.0|           4.0|           1.0|              2.0|            1.0|                   0.0|         0|
|            2.0|           0.0|          10.0|              3.0|           41.0|                   7.0|         0|
|            3.0|           5.0|           8.0|             11.0|            1.0|                  11.0|         0|
|           29.0|          22.0|          10.0|              3.0|           12.0|                   7.0|         0|
|           17.0|          18.0|           5.0|              3.0|       

In [19]:
results.groupby('prediction').count().sort('prediction').show()

+----------+-----+
|prediction|count|
+----------+-----+
|         0|20602|
|         1|  519|
|         2| 2128|
|         3| 1669|
+----------+-----+



-------------------------------------------------------------------------------------------------------------------------------