<a href="https://colab.research.google.com/github/BillySiaga/Project2025/blob/main/Supervised_Linear_Regression_Checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Supervised Linear Regression Checkpoint
In this checkpoint, you are going to work on the '5G-Energy consumption' dataset that was provided by the  international telecommunication union (ITU) in 2023 as part of a global challenge or competition for data scientists all over the world to solve the 5G energy consumption modelling using machine learning techniques.

The competition is taking place from 2023-07-05 to 2023-09-30. Fore more information click here.

Checkpoint problematic : Network operational expenditure (OPEX) already accounts for around 25 percent of the total telecom operator’s cost, and 90 percent of it is spent on large energy bills. More than 70 percent of this energy is estimated to be consumed by the radio access network (RAN), particularly by the base stations (BSs). Thus, the objective is to build and train a ML model to estimate the energy consumed by different 5G base stations taking into consideration the impact of various engineering configurations, traffic conditions, and energy-saving methods.

Dataset description : This dataset is derived from the original copy and simplified for learning purposes. It includes cell-level traffic statistics of 4G/5G sites collected on different days.

➡️ Dataset link

https://i.imgur.com/Agu9zeP.jpg



##Step 1: Importing Required Libraries

In [2]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Loading the dataset

In [3]:
# load the dataset
data = pd.read_csv("/content/5G_energy_consumption_dataset.csv")
data.head()

Unnamed: 0,Time,BS,Energy,load,ESMODE,TXpower
0,20230101 010000,B_0,64.275037,0.487936,0.0,7.101719
1,20230101 020000,B_0,55.904335,0.344468,0.0,7.101719
2,20230101 030000,B_0,57.698057,0.193766,0.0,7.101719
3,20230101 040000,B_0,55.156951,0.222383,0.0,7.101719
4,20230101 050000,B_0,56.053812,0.175436,0.0,7.101719


In [4]:
data.describe()

Unnamed: 0,Energy,load,ESMODE,TXpower
count,92629.0,92629.0,92629.0,92629.0
mean,28.138997,0.244705,0.081361,6.765427
std,13.934645,0.234677,0.382317,0.309929
min,0.747384,0.0,0.0,5.381166
25%,18.236173,0.05737,0.0,6.427504
50%,24.06577,0.16555,0.0,6.875934
75%,35.724963,0.363766,0.0,6.875934
max,100.0,0.993957,4.0,8.375336


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92629 entries, 0 to 92628
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Time     92629 non-null  object 
 1   BS       92629 non-null  object 
 2   Energy   92629 non-null  float64
 3   load     92629 non-null  float64
 4   ESMODE   92629 non-null  float64
 5   TXpower  92629 non-null  float64
dtypes: float64(4), object(2)
memory usage: 4.2+ MB


In [33]:
# cheking for distinct values in the BS column
distint_BS = data['BS'].unique()
print(distint_BS)

['B_0' 'B_1' 'B_2' 'B_3' 'B_4' 'B_5' 'B_6' 'B_7' 'B_8' 'B_9' 'B_10' 'B_11'
 'B_12' 'B_14' 'B_15' 'B_16' 'B_17' 'B_18' 'B_19' 'B_20' 'B_21' 'B_22'
 'B_23' 'B_24' 'B_25' 'B_26' 'B_27' 'B_28' 'B_29' 'B_30' 'B_31' 'B_32'
 'B_33' 'B_34' 'B_35' 'B_36' 'B_37' 'B_38' 'B_39' 'B_40' 'B_41' 'B_42'
 'B_43' 'B_44' 'B_45' 'B_46' 'B_47' 'B_48' 'B_49' 'B_50' 'B_51' 'B_52'
 'B_53' 'B_54' 'B_55' 'B_56' 'B_57' 'B_58' 'B_59' 'B_60' 'B_61' 'B_62'
 'B_63' 'B_64' 'B_65' 'B_66' 'B_67' 'B_68' 'B_69' 'B_70' 'B_71' 'B_72'
 'B_73' 'B_74' 'B_75' 'B_76' 'B_77' 'B_78' 'B_79' 'B_80' 'B_81' 'B_82'
 'B_83' 'B_84' 'B_85' 'B_86' 'B_87' 'B_88' 'B_89' 'B_90' 'B_91' 'B_92'
 'B_93' 'B_94' 'B_95' 'B_96' 'B_97' 'B_98' 'B_99' 'B_100' 'B_101' 'B_102'
 'B_103' 'B_104' 'B_105' 'B_106' 'B_107' 'B_108' 'B_109' 'B_110' 'B_111'
 'B_112' 'B_113' 'B_114' 'B_115' 'B_116' 'B_117' 'B_118' 'B_119' 'B_120'
 'B_121' 'B_122' 'B_123' 'B_124' 'B_125' 'B_126' 'B_127' 'B_128' 'B_129'
 'B_130' 'B_131' 'B_132' 'B_133' 'B_134' 'B_135' 'B_136' 'B_137'

In [5]:
#data profiling to have an overview of the data set
!pip install ydata-profiling

Collecting ydata-profiling
  Downloading ydata_profiling-4.16.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting visions<0.8.2,>=0.7.5 (from visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling)
  Downloading visions-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata-profiling)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting imagehash==4.3.1 (from ydata-profiling)
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting dacite>=1.8 (from ydata-profiling)
  Downloading dacite-1.9.2-py3-none-any.whl.metadata (17 kB)
Collecting puremagic (from visions<0.8.2,>=0.7.5->visions[type_image_path]<0.8.2,>=0.7.5->

In [6]:
#ydata profiling
from ydata_profiling import ProfileReport
report = data.profile_report()
report.to_file("report.html")



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/6 [00:00<?, ?it/s][A
 17%|█▋        | 1/6 [00:00<00:04,  1.14it/s][A
100%|██████████| 6/6 [00:00<00:00,  6.06it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
#checking for duplicates
duplicates = data.duplicated()
print(data[data.duplicated()])


Empty DataFrame
Columns: [Time, BS, Energy, load, ESMODE, TXpower]
Index: []


In [35]:
# encoding categorical feature 'BS' using label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['BS'] = label_encoder.fit_transform(data['BS'])


In [38]:
data.tail(15)

Unnamed: 0,Time,BS,Energy,load,ESMODE,TXpower
92614,20230102 030000,10,13.153961,0.046981,0.0,7.325859
92615,20230102 070000,10,13.303438,0.048942,0.0,7.325859
92616,20230102 080000,10,13.452915,0.053596,0.0,7.325859
92617,20230102 100000,10,14.947683,0.099885,0.0,7.325859
92618,20230102 110000,10,13.901345,0.064846,0.0,7.325859
92619,20230102 120000,10,14.050822,0.063731,0.0,7.325859
92620,20230102 130000,10,14.798206,0.088788,0.0,7.325859
92621,20230102 140000,10,14.798206,0.0855,0.0,7.325859
92622,20230102 150000,10,14.349776,0.059885,0.0,7.325859
92623,20230102 160000,10,15.09716,0.088692,0.0,7.325859


## Selecting target variable and the features

In [39]:
#target variables
X = data.drop(['Energy','Time',], axis=1)
y = data['Energy']

## Spliting Dataset to training and test

In [40]:
#Spliting dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [41]:
print(X_train.dtypes)

BS           int64
load       float64
ESMODE     float64
TXpower    float64
dtype: object


## Selecting a ML regression algorithm and train it on the training set

In [57]:
#select a ML regression algorithm and train it on the training set (Polynomial Regression)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
model.fit(X_train, y_train)

# model = LinearRegression()
# model.fit(X_train, y_train)

## Assessing the model performance on the test set using relevant evaluation metric

In [55]:
#importing library for model evaluation (MAE)
import sklearn.metrics as metrics
from sklearn.metrics import mean_absolute_error

In [56]:
#Assess your model performance on the test set using relevant evaluation metric
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Absolute Error: 6.293546992018813
Mean Squared Error: 73.32500294706345
R-squared: 0.6136849864858178


## Ways to Imporve the Model

In [None]:
# Improving the model
#from the R-squared metric, the model results are not good.
#To improve the accuracy of the model, I will consider better feature selection and engineering, and regulation(Lasso?Ridge)