<a href="https://colab.research.google.com/github/Mychoyce/Gomycode-Checkpoints/blob/main/Supervised_Learning_Regression_Checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What You're Aiming For

In this checkpoint, you are going to work on the '5G-Energy consumption' dataset that was provided by the international telecommunication union (ITU) in 2023 as part of a global challenge or competition for data scientists all over the world to solve the 5G energy consumption modelling using machine learning techniques.

The competition is taking place from 2023-07-05 to 2023-09-30. Fore more information click here.

Checkpoint problematic : Network operational expenditure (OPEX) already accounts for around 25 percent of the total telecom operator’s cost, and 90 percent of it is spent on large energy bills. More than 70 percent of this energy is estimated to be consumed by the radio access network (RAN), particularly by the base stations (BSs). Thus, the objective is to build and train a ML model to estimate the energy consumed by different 5G base stations taking into consideration the impact of various engineering configurations, traffic conditions, and energy-saving methods.

Dataset description : This dataset is derived from the original copy and simplified for learning purposes. It includes cell-level traffic statistics of 4G/5G sites collected on different days.

➡️ Dataset link

https://i.imgur.com/Agu9zeP.jpg

Instructions

Import you data and perform basic data exploration phase
Display general information about the dataset
Create a pandas profiling reports to gain insights into the dataset
Handle Missing and corrupted values
Remove duplicates, if they exist
Handle outliers, if they exist
Encode categorical features
Select your target variable and the features
Split your dataset to training and test sets
Based on your data exploration phase select a ML regression algorithm and train it on the training set
Assess your model performance on the test set using relevant evaluation metrics
Discuss with your cohort alternative ways to improve your model performance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import metrics



In [None]:
import pandas as pd

In [1]:
# Load the dataset
url = ("/content/sample_data/5G_energy_consumption_dataset.csv")
print(url)


/content/sample_data/5G_energy_consumption_dataset.csv


In [3]:
import pandas as pd
df= pd.read_csv("/content/sample_data/5G_energy_consumption_dataset.csv")
df

Unnamed: 0,Time,BS,Energy,load,ESMODE,TXpower
0,20230101 010000,B_0,64.275037,0.487936,0.0,7.101719
1,20230101 020000,B_0,55.904335,0.344468,0.0,7.101719
2,20230101 030000,B_0,57.698057,0.193766,0.0,7.101719
3,20230101 040000,B_0,55.156951,0.222383,0.0,7.101719
4,20230101 050000,B_0,56.053812,0.175436,0.0,7.101719
...,...,...,...,...,...,...
92624,20230102 170000,B_1018,14.648729,0.087538,0.0,7.325859
92625,20230102 180000,B_1018,14.648729,0.082635,0.0,7.325859
92626,20230102 210000,B_1018,13.452915,0.055538,0.0,7.325859
92627,20230102 220000,B_1018,13.602392,0.058077,0.0,7.325859


In [4]:
!pip install pandas-profiling
! install Ydata Profiling
import ydata_profiling as ydp
# Import the Ydata_profiling library that have been installedimport profile
# Imports profile for creating html or notebook from ydata_profiling result



Collecting pandas-profiling
  Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.4/324.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ydata-profiling (from pandas-profiling)
  Downloading ydata_profiling-4.8.3-py2.py3-none-any.whl (359 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m359.5/359.5 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Collecting visions[type_image_path]<0.7.7,>=0.7.5 (from ydata-profiling->pandas-profiling)
  Downloading visions-0.7.6-py3-none-any.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.8/104.8 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Collecting htmlmin==0.1.12 (from ydata-profiling->pandas-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata-profiling->pandas-profiling)
  Downloading phik-0.12.4-cp310-c

# data preparation

In [5]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92629 entries, 0 to 92628
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Time     92629 non-null  object 
 1   BS       92629 non-null  object 
 2   Energy   92629 non-null  float64
 3   load     92629 non-null  float64
 4   ESMODE   92629 non-null  float64
 5   TXpower  92629 non-null  float64
dtypes: float64(4), object(2)
memory usage: 4.2+ MB


In [6]:
#sum null values
missing_values=df.isnull().sum()
missing_values

Time       0
BS         0
Energy     0
load       0
ESMODE     0
TXpower    0
dtype: int64

In [7]:
#droping the duplicates
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates.head())

              Time   BS     Energy      load  ESMODE   TXpower
0  20230101 010000  B_0  64.275037  0.487936     0.0  7.101719
1  20230101 020000  B_0  55.904335  0.344468     0.0  7.101719
2  20230101 030000  B_0  57.698057  0.193766     0.0  7.101719
3  20230101 040000  B_0  55.156951  0.222383     0.0  7.101719
4  20230101 050000  B_0  56.053812  0.175436     0.0  7.101719


Data transformation


In [9]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [10]:
#changing the data type string to integer
for column in ['BS']:
  df[column]=label_encoder.fit_transform(df[column])

In [8]:
df

Unnamed: 0,Time,BS,Energy,load,ESMODE,TXpower
0,20230101 010000,B_0,64.275037,0.487936,0.0,7.101719
1,20230101 020000,B_0,55.904335,0.344468,0.0,7.101719
2,20230101 030000,B_0,57.698057,0.193766,0.0,7.101719
3,20230101 040000,B_0,55.156951,0.222383,0.0,7.101719
4,20230101 050000,B_0,56.053812,0.175436,0.0,7.101719
...,...,...,...,...,...,...
92624,20230102 170000,B_1018,14.648729,0.087538,0.0,7.325859
92625,20230102 180000,B_1018,14.648729,0.082635,0.0,7.325859
92626,20230102 210000,B_1018,13.452915,0.055538,0.0,7.325859
92627,20230102 220000,B_1018,13.602392,0.058077,0.0,7.325859


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92629 entries, 0 to 92628
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Time     92629 non-null  object 
 1   BS       92629 non-null  int64  
 2   Energy   92629 non-null  float64
 3   load     92629 non-null  float64
 4   ESMODE   92629 non-null  float64
 5   TXpower  92629 non-null  float64
dtypes: float64(4), int64(1), object(1)
memory usage: 4.2+ MB


Data seperation as X and Y

In [11]:
y = df['Energy']
y

0        64.275037
1        55.904335
2        57.698057
3        55.156951
4        56.053812
           ...    
92624    14.648729
92625    14.648729
92626    13.452915
92627    13.602392
92628    13.303438
Name: Energy, Length: 92629, dtype: float64

In [12]:
x = df.drop(['Energy','Time'],axis=1)
x

Unnamed: 0,BS,load,ESMODE,TXpower
0,0,0.487936,0.0,7.101719
1,0,0.344468,0.0,7.101719
2,0,0.193766,0.0,7.101719
3,0,0.222383,0.0,7.101719
4,0,0.175436,0.0,7.101719
...,...,...,...,...
92624,10,0.087538,0.0,7.325859
92625,10,0.082635,0.0,7.325859
92626,10,0.055538,0.0,7.325859
92627,10,0.058077,0.0,7.325859


Data spitting

In [13]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100)

# Print the x_train.dtypes and y_train.dtypes
print("X Train dtypes:", x_train.dtypes)
print("Y Train dtypes :", y_train.dtypes)
print("X Test dtypes:", x_test.dtypes)
print("Y Test dtypes:", y_test.dtypes)


X Train dtypes: BS           int64
load       float64
ESMODE     float64
TXpower    float64
dtype: object
Y Train dtypes : float64
X Test dtypes: BS           int64
load       float64
ESMODE     float64
TXpower    float64
dtype: object
Y Test dtypes: float64


In [14]:
x_train

Unnamed: 0,BS,load,ESMODE,TXpower
63549,533,0.034620,0.000000,6.427504
36105,259,0.064800,0.000000,6.427504
16256,59,0.311160,0.000000,6.427504
67275,571,0.388883,0.000000,6.875934
11441,13,0.037440,0.000000,6.427504
...,...,...,...,...
65615,553,0.215870,0.000000,7.100897
77655,674,0.011300,1.896944,6.875934
79683,695,0.295630,0.000000,6.875934
56088,460,0.070530,0.000000,7.100897


In [15]:
x_test

Unnamed: 0,BS,load,ESMODE,TXpower
75490,651,0.019420,0.0,6.875934
42274,320,0.043646,0.0,6.875934
49027,389,0.088530,0.0,6.875934
5576,459,0.112239,0.0,6.427504
50777,406,0.353620,0.0,6.427504
...,...,...,...,...
88383,782,0.040560,0.0,6.427504
1783,84,0.168855,0.0,7.100897
14013,38,0.025372,0.0,7.100897
67916,577,0.796777,0.0,6.875934


Model Building

Linear regression

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
lr = LinearRegression()
lr.fit(x_train,y_train)


applying the model to make a prediction

In [17]:
y_lr_train_pred = lr.predict(x_train)
y_lr_train_pred = lr.predict(x_test)

In [18]:
y_lr_train_pred

array([22.77289731, 24.36466773, 25.61016096, ..., 28.12500374,
       47.51288724,  6.65339742])

evaluate model performance

In [19]:
!pip install scikit-learn
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
lr_train_mse = mean_squared_error(y_test,y_lr_train_pred)
lr_train_r2 =r2_score(y_test,y_lr_train_pred)

print(f"Mean Square Error is: {lr_train_mse}")
print(f"R Square is: {lr_train_r2}")


Mean Square Error is: 86.55842858733565
R Square is: 0.5572643465808441


In [22]:
# Predict on the test set
y_lr_test_pred = lr.predict(x_test)

# Calculate the mean squared error
lr_test_mse = mean_squared_error(y_test, y_lr_test_pred)

# Create the results DataFrame
lr_results = pd.DataFrame({'Method': ['Linear regression'], 'Training MSE': [lr_train_mse], 'Training R2': [lr_train_r2], 'Test MSE': [lr_test_mse]})

In [23]:
lr_results = pd.DataFrame({'Method': ['Linear regression'], 'Training MSE': [lr_train_mse], 'Training R2': [lr_train_r2], 'Test MSE': [lr_test_mse]})

In [24]:
lr_results

Unnamed: 0,Method,Training MSE,Training R2,Test MSE
0,Linear regression,86.558429,0.557264,86.558429


Training the model
 Model comparison

In [None]:
df_models = pd.concat([lr_results],axis=0).reset_index(drop=True)
df_models

Unnamed: 0,0
0,Linear regression
1,86.558429
2,0.557264


DATA VISUALISATION OF PREDICTION RESULTS

In [None]:
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(5,5))
plt.scatter(x=y_train, y=y_lr_train_pred, c="#7CAE00" )

z = np.polyfit(y_train, y_lr_train_pred, 1)
p = np.poly1d(z)

plt.plot(y_train, p(y_train), '#F8766D')
plt.ylabel('predict Energy')
plt.xlabel('experimental Eenergy')
TEST(0.5,0,'experimental Eenergy')