# CLASSIFICATION

**Data Description**: 

**Domain**: 

**Context**: 

**Attribute Information**
* 1. **`number`**: incident identifier (24,918 different values)
* 2. **`incident state`**: eight levels controlling the incident management process transitions from opening until closing the case;
* 3.  **`active`**: boolean attribute that shows whether the record is active or closed/canceled;
* 4.  **`reassignment_count`**: number of times the incident has the group or the support analysts changed;
* 5.  **`reopen_count`**: number of times the incident resolution was rejected by the caller;
* 6.  **`sys_mod_count`**: number of incident updates until that moment;
* 7.  **`made_sla`**: boolean attribute that shows whether the incident exceeded the target SLA;
* 8.  **`caller_id`**: identifier of the user affected;
* 9.  **`opened_by`**: identifier of the user who reported the incident;
* 10. **`opened_at`**: incident user opening date and time;
* 11. **`sys_created_by`**: identifier of the user who registered the incident;
* 12. **`sys_created_at`**: incident system creation date and time;
* 13. **`sys_updated_by`**: identifier of the user who updated the incident and generated the current log record;
* 14. **`sys_updated_at`**: incident system update date and time;
* 15. **`contact_type`**: categorical attribute that shows by what means the incident was reported;
* 16. **`location`**: identifier of the location of the place affected;
* 17. **`category`**: first-level description of the affected service;
* 18. **`subcategory`**: second-level description of the affected service (related to the first level description, i.e., to category);
* 19. **`u_symptom`**: description of the user perception about service availability;
* 20. **`cmdb_ci`**: (confirmation item) identifier used to report the affected item (not mandatory);
* 21. **`impact`**: description of the impact caused by the incident (values: 1â€“High; 2â€“Medium; 3â€“Low);
* 22. **`urgency`**: description of the urgency informed by the user for the incident resolution (values: 1â€“High; 2â€“Medium; 3â€“Low);
* 23. **`priority`**: calculated by the system based on 'impact' and 'urgency';
* 24. **`assignment_group`**: identifier of the support group in charge of the incident;
* 25. **`assigned_to`**: identifier of the user in charge of the incident;
* 26. **`knowledge`**: boolean attribute that shows whether a knowledge base document was used to resolve the incident;
* 27. **`u_priority_confirmation`**: boolean attribute that shows whether the priority field has been double-checked;
* 28. **`notify`**: categorical attribute that shows whether notifications were generated for the incident;
* 29. **`problem_id`**: identifier of the problem associated with the incident;
* 30. **`rfc`**: (request for change) identifier of the change request associated with the incident;
* 31. **`vendor`**: identifier of the vendor in charge of the incident;
* 32. **`caused_by`**: identifier of the RFC responsible by the incident;
* 33. **`close_code`**: identifier of the resolution of the incident;
* 34. **`resolved_by`**: identifier of the user who resolved the incident;
* 35. **`resolved_at`**: incident user resolution date and time (dependent variable);
* 36. **`closed_at`**: incident user close date and time (dependent variable).

**Learning Outcomes**
* 
* 
* 
* 

### INSTALLING NECCESSARY PACKAGES FOR THIS PROJECT

In [1]:
!pip install -U imbalanced-learn

# Importing packages - Pandas, Numpy, Seaborn, Scipy
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, sys
import matplotlib.style as style; style.use('fivethirtyeight')
from scipy.stats import zscore, norm

np.random.seed(0)

# Modelling - LR, KNN, NB, Metrics
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Oversampling
from imblearn.over_sampling import SMOTE

# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
pd.options.display.max_rows = 4000



### # Task 1: Retrieving and Preparing the Data

In [14]:
# Reading the data as a dataframe and print the first five rows
incident_data = pd.read_csv('incident_event_log.csv')
incident_data.head()

Unnamed: 0,number,incident_state,active,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,opened_at,...,u_priority_confirmation,notify,problem_id,rfc,vendor,caused_by,closed_code,resolved_by,resolved_at,closed_at
0,INC0000045,New,True,0,0,0,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
1,INC0000045,Resolved,True,0,0,2,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
2,INC0000045,Resolved,True,0,0,3,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
3,INC0000045,Closed,False,0,0,4,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
4,INC0000047,New,True,0,0,0,True,Caller 2403,Opened by 397,29/2/2016 04:40,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,6/3/2016 10:00


#### Observation 1 - Checking for missing values represented as '?' and replacing them
hecking for missing values represented as '?' and replacing them

In [22]:
# Replace '?' with 'unknown information' for missing values
incident_data.replace({'?': 'unknown information'}, inplace=True)
incident_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141712 entries, 0 to 141711
Data columns (total 36 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   number                   141712 non-null  object
 1   incident_state           141712 non-null  object
 2   active                   141712 non-null  bool  
 3   reassignment_count       141712 non-null  int64 
 4   reopen_count             141712 non-null  int64 
 5   sys_mod_count            141712 non-null  int64 
 6   made_sla                 141712 non-null  bool  
 7   caller_id                141712 non-null  object
 8   opened_by                141712 non-null  object
 9   opened_at                141712 non-null  object
 10  sys_created_by           141712 non-null  object
 11  sys_created_at           141712 non-null  object
 12  sys_updated_by           141712 non-null  object
 13  sys_updated_at           141712 non-null  object
 14  contact_type        

#### Observation 2 - Removing white spaces from string columns

In [25]:
# Removing white spaces from string columns
string_columns = incident_data.select_dtypes(include='object').columns
incident_data[string_columns] = incident_data[string_columns].applymap(lambda x: x.strip() if isinstance(x, str) else x)

#### Observation 3 - Converting date columns to datetime format

In [27]:
# Converting date columns to datetime format
datetime_columns = ['opened_at', 'sys_created_at', 'sys_updated_at', 'resolved_at', 'closed_at']
incident_data[datetime_columns] = incident_data[datetime_columns].apply(pd.to_datetime, errors='coerce')
incident_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141712 entries, 0 to 141711
Data columns (total 36 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   number                   141712 non-null  object        
 1   incident_state           141712 non-null  object        
 2   active                   141712 non-null  bool          
 3   reassignment_count       141712 non-null  int64         
 4   reopen_count             141712 non-null  int64         
 5   sys_mod_count            141712 non-null  int64         
 6   made_sla                 141712 non-null  bool          
 7   caller_id                141712 non-null  object        
 8   opened_by                141712 non-null  object        
 9   opened_at                141712 non-null  datetime64[ns]
 10  sys_created_by           141712 non-null  object        
 11  sys_created_at           88636 non-null   datetime64[ns]
 12  sys_updated_by  

#### Observation 4 - Standardizing text data

In [29]:
# Standardising text data
text_columns = incident_data.select_dtypes(include='object').columns
incident_data[text_columns] = incident_data[text_columns].apply(lambda x: x.str.lower())

#### Observation 5 - Feature engineering for incident duration (Preparation for Task 2: Feature Engineering)

In [33]:
# Feature engineering for incident duration
incident_data['duration_hours'] = (incident_data['closed_at'] - incident_data['opened_at']).dt.total_seconds() / 3600
incident_data['duration_hours'].fillna(incident_data['duration_hours'].median(), inplace=True)
incident_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141712 entries, 0 to 141711
Data columns (total 37 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   number                   141712 non-null  object        
 1   incident_state           141712 non-null  object        
 2   active                   141712 non-null  bool          
 3   reassignment_count       141712 non-null  int64         
 4   reopen_count             141712 non-null  int64         
 5   sys_mod_count            141712 non-null  int64         
 6   made_sla                 141712 non-null  bool          
 7   caller_id                141712 non-null  object        
 8   opened_by                141712 non-null  object        
 9   opened_at                141712 non-null  datetime64[ns]
 10  sys_created_by           141712 non-null  object        
 11  sys_created_at           88636 non-null   datetime64[ns]
 12  sys_updated_by  

#### Observation 6 - Conducting Exploratory Data Analysis (EDA)

In [34]:
print("Descriptive Statistics for Numerical Columns:")
print(incident_data.describe())

Descriptive Statistics for Numerical Columns:
       reassignment_count   reopen_count  sys_mod_count   
count       141712.000000  141712.000000  141712.000000  \
mean             1.104197       0.021918       5.080946   
min              0.000000       0.000000       0.000000   
25%              0.000000       0.000000       1.000000   
50%              1.000000       0.000000       3.000000   
75%              1.000000       0.000000       6.000000   
max             27.000000       8.000000     129.000000   
std              1.734673       0.207302       7.680652   

                           opened_at                 sys_created_at   
count                         141712                          88636  \
mean   2016-04-12 22:19:09.100852736  2016-04-08 16:21:18.906539008   
min              2016-02-29 01:16:00            2016-02-29 01:23:00   
25%              2016-03-16 15:24:00            2016-03-14 16:13:00   
50%              2016-04-07 16:27:00            2016-04-01 20:08:00

#### Observation 7 - Data normalisation/standardisation

In [35]:
# Data normalization/standardization
incident_data['duration_hours'] = (incident_data['duration_hours'] - incident_data['duration_hours'].mean()) / incident_data['duration_hours'].std()
incident_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141712 entries, 0 to 141711
Data columns (total 37 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   number                   141712 non-null  object        
 1   incident_state           141712 non-null  object        
 2   active                   141712 non-null  bool          
 3   reassignment_count       141712 non-null  int64         
 4   reopen_count             141712 non-null  int64         
 5   sys_mod_count            141712 non-null  int64         
 6   made_sla                 141712 non-null  bool          
 7   caller_id                141712 non-null  object        
 8   opened_by                141712 non-null  object        
 9   opened_at                141712 non-null  datetime64[ns]
 10  sys_created_by           141712 non-null  object        
 11  sys_created_at           88636 non-null   datetime64[ns]
 12  sys_updated_by  

In [38]:
# Print the preprocessed data
print("\nPreprocessed Data:")
incident_data.head()


Preprocessed Data:


Unnamed: 0,number,incident_state,active,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,opened_at,...,notify,problem_id,rfc,vendor,caused_by,closed_code,resolved_by,resolved_at,closed_at,duration_hours
0,inc0000045,new,True,0,0,0,True,caller 2403,opened by 8,2016-02-29 01:16:00,...,do not notify,unknown information,unknown information,unknown information,unknown information,code 5,resolved by 149,2016-02-29 11:29:00,2016-05-03 12:00:00,-0.077114
1,inc0000045,resolved,True,0,0,2,True,caller 2403,opened by 8,2016-02-29 01:16:00,...,do not notify,unknown information,unknown information,unknown information,unknown information,code 5,resolved by 149,2016-02-29 11:29:00,2016-05-03 12:00:00,-0.077114
2,inc0000045,resolved,True,0,0,3,True,caller 2403,opened by 8,2016-02-29 01:16:00,...,do not notify,unknown information,unknown information,unknown information,unknown information,code 5,resolved by 149,2016-02-29 11:29:00,2016-05-03 12:00:00,-0.077114
3,inc0000045,closed,False,0,0,4,True,caller 2403,opened by 8,2016-02-29 01:16:00,...,do not notify,unknown information,unknown information,unknown information,unknown information,code 5,resolved by 149,2016-02-29 11:29:00,2016-05-03 12:00:00,-0.077114
4,inc0000047,new,True,0,0,0,True,caller 2403,opened by 397,2016-02-29 04:40:00,...,do not notify,unknown information,unknown information,unknown information,unknown information,code 5,resolved by 81,2016-03-01 09:52:00,2016-06-03 10:00:00,0.386277
