<a href="https://colab.research.google.com/github/Khuks/Prediction-of-network-traffic-using-a-machine-learning-model/blob/main/Prediction_of_network_traffic_using_a_machine_learning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using Machine Learning Models To Predict Kwaluseni Campus Network Traffic

### 1. Data Preparation/ Data Preprocessing 
<br>
<br>

#### a) Import Libraries

In [None]:
import pandas as pd # For data manipulation and data wrangling
import matplotlib.pyplot as plt
import seaborn as sns
import io
from google.colab import files
import math
import numpy as np

#### b) Load the dataset

In [None]:
uploaded=files.upload()
network_data=pd.read_excel(io.BytesIO(uploaded['KWS_Network_data.xlsx']))
print(network_data)

#### c) Print the first five rows of the dataset

In [None]:
network_data.head()

#### d) Check the information of the dataset

In [None]:
network_data.info()

#### e) Check Descriptive Statistics 

In [None]:
network_data.describe()

#### f) Check the shape of the dataset

In [None]:
network_data.shape

#### g) Check for Null Values

In [None]:
network_data.isnull().sum()

#### h) Drop Null Values

In [None]:
network_data=network_data.dropna()

In [None]:
# Check the total number of null values
network_data.isnull().sum()

The are no null values within the dataset

#### h) Check For Duplicates

In [None]:
network_data.duplicated().sum()

The are no duplicates within the rows of the dataset.

### 2. Exploratory Data Analysis and Data Visualization

##### a) Plot the amount of data being transmitted against traffic code

In [None]:
df_groups=network_data.groupby('Traffic Code')['Value(bps)'].sum()
df_groups.plot(kind='bar')

plt.xlabel("Traffic Code")
plt.ylabel("Value (bps) e+09")
plt.title("Amount Of Data Being Transmitted Against Traffic Code",pad=40)
plt.figure(figsize=(15, 10))

###### b) Plot the Distribution Of The Dataset

In [None]:
plt.figure(1)
plt.subplot(121)
plt.title("Normal Distribution Curve For Value(bps)")
sns.distplot(network_data["Value(bps)"]);

plt.subplot(122)
network_data["Value(bps)"].plot.box(figsize=(16,5))
plt.title("Determine Outliers For Value (bps)")
plt.ylabel("Vale(bps)")
plt.show()

It can be inferred that most of the data in the distribution of Value (bps) is towards the left which means that it is normally distributed. We will try to make it normal in later sections as algorithms work better if the data is normally distributed.

The boxplot confirms the presence of a lot of outliers/extreme values.

###### d) Plot the Timestamp against Traffic Code

In [None]:
df_groups=network_data.groupby('Traffic Code')[' Timestamp'].sum()
df_groups.plot(kind='bar')
plt.xlabel("Traffic Code",fontsize = 10)
plt.ylabel("Timestamp e+09",fontsize = 10)
plt.title("The Timestamp Against Traffic Code",pad=40,fontsize = 15)
plt.figure(figsize=(15, 10))

###### e) Plot the Distribution Of The Timestamp

In [None]:
plt.figure(1)
plt.subplot(121)
plt.title("Normal Distribution Curve For Timestamp")
sns.distplot(network_data[" Timestamp"]);

plt.subplot(122)
network_data[" Timestamp"].plot.box(figsize=(20,10))
plt.title("Determine Outliers For Timestamp")
plt.ylabel("Timestamp")
plt.show()

It can be inferred that most of the data in the distribution of Value (bps) is towards the left which means that it is normally distributed. We will try to make it normal in later sections as algorithms work better if the data is normally distributed.

The boxplot confirms the presence of a lot of outliers/extreme values.

### 3. Feature Selection

##### a) Plot the correlation heatmap to Determine the relationship between variables

In [None]:
heatmap_data=network_data.corr()
f,ax=plt.subplots(figsize=(10,12))
sns.heatmap(heatmap_data,vmax=.8,square=True,cmap="BuPu",annot=True);

#### b) Pair Plot

In [None]:
sns.pairplot(network_data)
plt.tight_layout()

## 3. Building the Model

#### a) Splitting data into training and testing 

In [None]:
#training =network_data.iloc[:math.ceil(len(network_data)*0.8)].copy()
#testing=network_data.iloc[math.ceil(len(network_data)*0.8):].copy()'
from sklearn import metrics

After creating new features, we can continue the model building process. So we will start with logistic regression model and then move over to more complex models like RandomForest, SVM and LSTM.

We will build the following models in this section.

i) Random Forest

ii) SVM

iii)Long Short Term Memory

Let’s prepare the data for feeding into the models.

We will use scikit-learn (sklearn) for making different models which is an open source library for Python. It is one of the most efficient tool which contains many inbuilt functions that can be used for modeling in Python.

Sklearn requires the target variable in a separate dataset. So, we will drop our target variable from the train dataset and save it in another dataset.

droping the target variable "Value (bps)"

Drop "Value (bps)"

In [None]:
X=network_data.drop("Value(bps)",1)

In [None]:
X.head(2)

save the target variable "Value (bps)" in another dataset

In [None]:
y=network_data[["Value(bps)"]]

In [None]:
y.head(2)

Now we will make dummy variables for the categorical variables. Dummy variable turns categorical variables into a series of 0 and 1, making them lot easier to quantify and compare.

Let us understand the process of dummies first:

In [None]:
X = pd.get_dummies(X)

In [None]:
X.head(3)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,f1_score

# Random Forest Regression

#### a) Splitting the dataset into training and testing 

In [None]:
x_train,x_cv,y_train,y_cv=train_test_split(X,y,test_size=0.2,random_state=1)

#### b) Building the model

In [None]:
forest_model = RandomForestRegressor(random_state=1,max_depth=10,n_estimators=50)

#### c) Model Training 

In [None]:
forest_model.fit(x_train,y_train.values.ravel())

## Model Evaluation

#### a) Model Evaluation on Training Dataset

In [None]:
random_forest_train_pred=forest_model.predict(x_train)
random_forest_train_pred

In [None]:
random_forest_train_pred = pd.DataFrame(random_forest_train_pred)

In [None]:
y_train= y_train.reset_index(drop=True)
y_train

In [None]:
#newdf = y_cv.drop("index", axis=1)
random_forest_train_pred =random_forest_train_pred.reset_index(drop=True)
random_forest_train_pred

In [None]:
random_forest_train_pred["Y_Train_Prediction"]=random_forest_train_pred

In [None]:
random_forest_train_pred

In [None]:
random_forest_train_pred.drop(columns = random_forest_train_pred.columns[0], axis = 1, inplace= True)
random_forest_train_pred

In [None]:
# Model Evaluation on Training Dataset
print('R^2:',metrics.r2_score(y_train,random_forest_train_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, random_forest_train_pred))*(len(y_train)-1)/(len(y_train)-random_forest_train_pred.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train,random_forest_train_pred))
print('MSE:',metrics.mean_squared_error(y_train,random_forest_train_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train,random_forest_train_pred)))

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train,random_forest_train_pred)
plt.xlabel("Value (bps)")
plt.ylabel("Predicted Value(bps)")
plt.title("Training Value(bps) vs Predicted Train Value(bps)-Random Forest")
plt.show()

#### b) Model Evaluation on Testing Dataset

In [None]:
random_forest_test_pred=forest_model.predict(x_cv)
random_forest_test_pred

In [None]:
random_forest_test_pred = pd.DataFrame(random_forest_test_pred)

In [None]:
y_cv= y_cv.reset_index(drop=True)
y_cv

In [None]:
#newdf = y_cv.drop("index", axis=1)
random_forest_test_pred =random_forest_test_pred.reset_index(drop=True)
random_forest_test_pred

In [None]:
random_forest_test_pred["Y_Test_Prediction"]=random_forest_test_pred

In [None]:
random_forest_test_pred

In [None]:
random_forest_test_pred.drop(columns = random_forest_test_pred.columns[0], axis = 1, inplace= True)
random_forest_test_pred

In [None]:
# Model Evaluation on Training Dataset
print('R^2:',metrics.r2_score(y_cv,random_forest_test_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_cv, random_forest_test_pred))*(len(y_cv)-1)/(len(y_cv)-random_forest_test_pred.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_cv,random_forest_test_pred))
print('MSE:',metrics.mean_squared_error(y_cv,random_forest_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_cv,random_forest_test_pred)))

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_cv,random_forest_test_pred)
plt.xlabel("Value (bps)")
plt.ylabel("Test Predicted Value(bps)")
plt.title("Testing Value(bps) vs Predicted Train Value(bps)-Rand")
plt.show()

In [None]:
random_forest_test_pred.shape

In [None]:
y_cv.shape

# XGBoost Regressor

In [None]:
import xgboost as xgb
# Import XGBoost Regressor
from xgboost import XGBRegressor

In [None]:
#Create a XGBoost Regressor
reg = XGBRegressor()



#### a) Model Evaluation on Training Dataset (XGBoost)

In [None]:
# Train the model using the training sets 
reg.fit(x_train,y_train.values.ravel())

In [None]:
xgb_pred_train = reg.predict(x_train)

In [None]:
xgb_pred_train

In [None]:
xgb_pred_train = pd.DataFrame(xgb_pred_train)

In [None]:
xgb_pred_train

In [None]:
xgb_pred_train["Y_Test_Prediction"]=xgb_pred_train

In [None]:
xgb_pred_train

In [None]:
xgb_pred_train.drop(columns = xgb_pred_train.columns[0], axis = 1, inplace= True)
xgb_pred_train

In [None]:
# Model Evaluation on Training Dataset
print('R^2:',metrics.r2_score(y_train,xgb_pred_train))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train,xgb_pred_train))*(len(y_train)-1)/(len(y_train)-xgb_pred_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train,xgb_pred_train))
print('MSE:',metrics.mean_squared_error(y_train,xgb_pred_train))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train,xgb_pred_train)))

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train,xgb_pred_train)
plt.xlabel("Value (bps)")
plt.ylabel("Train Predicted Value(bps)")
plt.title("Train Value(bps) vs Predicted Train Value(bps)- XGBoost")
plt.show()

#### b) Model Evaluation on Testing Dataset - XGBoost

In [None]:
xgb_pred_test = reg.predict(x_cv)

In [None]:
xgb_pred_test

In [None]:
xgb_pred_test = pd.DataFrame(xgb_pred_test)
xgb_pred_test

In [None]:
xgb_pred_test["Y_Test_Prediction"]=xgb_pred_test

In [None]:
xgb_pred_test

In [None]:
xgb_pred_test.drop(columns = xgb_pred_test.columns[0], axis = 1, inplace= True)
xgb_pred_test

In [None]:
# Model Evaluation on Training Dataset
print('R^2:',metrics.r2_score(y_cv,xgb_pred_test))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_cv,xgb_pred_test))*(len(y_cv)-1)/(len(y_cv)-xgb_pred_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_cv,xgb_pred_test))
print('MSE:',metrics.mean_squared_error(y_cv,xgb_pred_test))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_cv,xgb_pred_test)))

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_cv,xgb_pred_test)
plt.xlabel("Value (bps)")
plt.ylabel("Test Predicted Value(bps)")
plt.title("Test Value(bps) vs Predicted Test Value(bps)- XGBoost")
plt.show()