#Comparison of different ML Algorithms to predict CPU burst times of processes.

The Goal of the project is to select the most significant attributes of the process and predict the CPU-Burst for the process. And also to analyze the relationship between the process attributes and the CPU burst time. Compare different approaches using the selected attributes.

# Step - 1 : Frame The Problem

In the course we have learnt several CPU Scheduling algorithms used in modern operating systems. CPU scheduling is defined as a phenomena which decides which process should use the CPU next while another process is in the waiting state due to absence of any resource such as I/O etc. While scheduling, each process gets to use the CPU for it's slice. The slice that it gets, is called the CPU burst. In simple terms, the duration for which a process gets control of the CPU is the CPU burst time, and the concept of gaining control of the CPU is the CPU burst.

Out of the scheduling algorithms known, the implementation of  Shortest-Job-First (SJF) and Shortest Remaining Time First (SRTF) algorithms depend on knowing the length of the CPU-bursts for processes in the ready queue. Various ways have been proposed to estimate the CPU burst length. One of the traditional ways is called EA(Exponential Averaging) method. By EA method, the length of a process CPU-burst is approximated to the length of previous execution. However these methods may not give an accurate or reliable predicted values. Also, in the recent past ML algorithms have been proven efficient in predicting application resource consumption. From the inspiration of the past studies we thought of proposing a ML approach for predicting the CPU burst times by using significant attributes which contribute towards knowing burst time for a particular process.

In the course, we learnt about various scheduling algorithms. The scheduling algorithms can be made to work more efficiently if the CPU burst times of a process were known. Hence, we want to present different approaches for predicting the burst times and compare them with each other. 



# Step - 2 : Obtain the Data

## Import Libraries

In [0]:
!pwd

In [0]:
!pip install -q  missingno

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
%matplotlib inline

In [0]:
!ls -l

Pandas provides two important data types with in built functions to be able to provide extensive capability to handle the data.The datatypes include Series and DataFrames.

Pandas provides ways to read or get the data from various sources like read_csv,read_excel,read_html etc.The data is read and stored in the form of DataFrames.

In [0]:
!wget https://www.dropbox.com/s/ikyxo0zew514a0b/processes_datasets.csv

In [0]:
!ls -l


In [0]:
data = pd.read_csv('processes_datasets.csv')

In [0]:
data.head()

In [0]:
#to get the last 5 entries of the data
data.tail()

In [0]:
type(data)

In [0]:
data.shape

In [0]:
data.info()

In [0]:
data.sum()

In [0]:
data.info()

In [0]:
data.describe()

# Step - 3 : Analyse the Data

In [0]:
ms.matrix(data)

In [0]:
data.info()

We can observe that there are missing values in 'JobStructure','JobStructureParams','UsedNetwork','UsedLocalDiskSpace','UsedResources','ReqPlatform','ReqNetwork',
          'ReqLocalDiskSpace','ReqResources','VOID','ProjectID' Let's continue.

1.   List item
2.   List item



In [0]:
data.info()

# Step - 4 : Feature Engineering

## Feature Engineering

We want to fill the missing values of the Reqmemory in the dataset with the average ReqMemory value for each of the classes. This is called data imputation.

In [0]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Used Memory',y='RunTime',data=data,palette='winter')

In [0]:
data.columns

In [0]:
data['UserID'].value_counts()

Applying the function.

In [0]:
ms.matrix(data)

The Age column is imputed sucessfully.

Let's drop the Cabin column and the row in the Embarked that is NaN.

In [0]:
data.info()

In [0]:
data.drop(['JobStructure','JobStructureParams','UsedNetwork','UsedLocalDiskSpace','UsedResources','ReqPlatform','ReqNetwork',
          'ReqLocalDiskSpace','ReqResources','VOID','ProjectID'],axis=1).head()

In [0]:
data.info()

In [0]:
data.drop(['JobStructure','JobStructureParams','UsedNetwork','UsedLocalDiskSpace','UsedResources','ReqPlatform','ReqNetwork',
          'ReqLocalDiskSpace','ReqResources','VOID','ProjectID'],axis=1,inplace=True)

In [0]:
data.head()

In [0]:
data.info()

## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [0]:
data['QueueNO'] = data['QueueID'].rank(method='dense', ascending=False).astype(int)
data['GroupNO'] = data['GroupID'].rank(method='dense', ascending=False).astype(int)
data['ExecutableNO'] = data['ExecutableID'].rank(method='dense', ascending=False).astype(int)
data['OrigSiteNO'] = data['OrigSiteID'].rank(method='dense', ascending=False).astype(int)
data['LastRunSiteNO'] = data['LastRunSiteID'].rank(method='dense', ascending=False).astype(int)
data.head()

In [0]:
data['USERNO'] = data['UserID'].rank(method='dense', ascending=False).astype(int)
data.head()

In [0]:
data['USERNO']

In [0]:
data['UserID']

In [0]:
data.drop(['UserID','QueueID','GroupID','ExecutableID','OrigSiteID','LastRunSiteID'],axis=1,inplace=True)

In [0]:
data.head()

In [0]:
data.describe()

In [0]:
data.info()

In [0]:
ms.matrix(data)

In [0]:
data.info()

In [0]:
data.loc[data['ReqMemory']== -1].index

In [0]:
data.loc[data['ReqTime: ']== -1].index

# Step - 5 : Model Selection

# Building a Logistic Regression Model



In [0]:
from sklearn.linear_model import LogisticRegression

# Build the Model.
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train) # this is where training happens

In [0]:
logmodel.coef_

In [0]:
logmodel.intercept_

In [0]:
logmodel.verbose

In [0]:
predict =  logmodel.predict(X_test)
predict[:5]

In [0]:
y_test[:5]

In [0]:
#@title Default title text
sum(predict-y_test)

## Building a KNN model

In [0]:
X = data.drop('RunTime ',axis=1)
y = data['RunTime ']

In [0]:
data.columns

In [0]:
y.head()

In [0]:
X.head()

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X1, X2, y1, y2 = train_test_split(X,y 
                                                    , test_size=0.98,random_state=32)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y 
                                                    , test_size=0.30,random_state=32)

In [0]:
len(X_train)

In [0]:
121253/400000

In [0]:
X.head()

In [0]:
X_train.head()

In [0]:
y_train.head()

In [0]:
y_test.head()

In [0]:
y_pred[:5]

In [0]:
from sklearn.neighbors import KNeighborsClassifier

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=4)

#Train the model using the training sets
knn.fit(X_train, y_train)

knn.score(X_test, y_test)

#Predict the response for test dataset
y_pred = knn.predict(X_test)

In [0]:
knn.score(X_test, y_test)

In [0]:
knn.score(X_train, y_train)

In [0]:
y_pred

In [0]:
y_pred[:20]

In [0]:
y_test[:20]

In [0]:
#@title Default title text
sum(predict-y_test)

Let's move on to evaluate our model.

# Build a Linear Regression Model

In [0]:
from sklearn.linear_model import LinearRegression

In [0]:
reg = LinearRegression().fit(X_train, y_train)

In [0]:
y_pred_lin_reg = reg.predict(X_test)

# SVC Algorithm 

Prediction

In [0]:
X = data.drop('RunTime ',axis=1)
y = data['RunTime ']

In [0]:
X.head()

In [0]:
y.head()

In [0]:
from sklearn import preprocessing

In [0]:
min_max_scaler = preprocessing.MinMaxScaler()

In [0]:
data.head()


In [0]:
X_minmax = min_max_scaler.fit_transform(X)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_minmax,y 
                                                    , test_size=0.30,random_state=50)

In [0]:
from sklearn.svm import SVC

In [0]:
clf = SVC(kernel='linear')

In [0]:
clf.fit(X_train,y_train)

In [0]:
y_pred = clf.predict(X_test)

# Step - 6 : Evaluation

## Evaluation

We can check precision, recall, f1 - score using classification report!

#### Confusion Matrix

In [0]:
from sklearn.metrics import  accuracy_score

In [0]:
print(accuracy_score(y_test,y_pred))

In [0]:
X.corrwith(y)

In [0]:
from sklearn.metrics import  accuracy_score, r2_score

In [0]:
np.corrcoef(y_pred,y_test)

In [0]:
r2_score(y_pred, y_test)

In [0]:
from sklearn.metrics import confusion_matrix, classification_report

## Confusion Matrix

True Negative   |	False positive,  
________________|________________  
                |                  
False negative  |	True negative    

In [0]:
print(confusion_matrix(y_test, predict))

In [0]:
from sklearn.metrics import precision_score

In [0]:
print(accuracy_score(y_test,predict))

In [0]:
from sklearn.metrics import f1_score


In [0]:
print(f1_score(y_test,predict))

To get all the above metrics at one go, use the following function:

In [0]:
from sklearn.metrics import classification_report

In [0]:
print(classification_report(y_test,predict))

---
                                                     THE END