<a href="https://colab.research.google.com/github/Mychoyce/Gomycode-Checkpoints/blob/main/Supervised_Learning_Regression_Checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What You're Aiming For

In this checkpoint, you are going to work on the '5G-Energy consumption' dataset that was provided by the international telecommunication union (ITU) in 2023 as part of a global challenge or competition for data scientists all over the world to solve the 5G energy consumption modelling using machine learning techniques.

The competition is taking place from 2023-07-05 to 2023-09-30. Fore more information click here.

Checkpoint problematic : Network operational expenditure (OPEX) already accounts for around 25 percent of the total telecom operator’s cost, and 90 percent of it is spent on large energy bills. More than 70 percent of this energy is estimated to be consumed by the radio access network (RAN), particularly by the base stations (BSs). Thus, the objective is to build and train a ML model to estimate the energy consumed by different 5G base stations taking into consideration the impact of various engineering configurations, traffic conditions, and energy-saving methods.

Dataset description : This dataset is derived from the original copy and simplified for learning purposes. It includes cell-level traffic statistics of 4G/5G sites collected on different days.

➡️ Dataset link

https://i.imgur.com/Agu9zeP.jpg

Instructions

Import you data and perform basic data exploration phase
Display general information about the dataset
Create a pandas profiling reports to gain insights into the dataset
Handle Missing and corrupted values
Remove duplicates, if they exist
Handle outliers, if they exist
Encode categorical features
Select your target variable and the features
Split your dataset to training and test sets
Based on your data exploration phase select a ML regression algorithm and train it on the training set
Assess your model performance on the test set using relevant evaluation metrics
Discuss with your cohort alternative ways to improve your model performance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import metrics



In [None]:
import pandas as pd

In [None]:
# Load the dataset
url = ("/content/5G_energy_consumption_dataset.csv")
print(url)


In [None]:
import pandas as pd
df= pd.read_csv('/content/5G_energy_consumption_dataset.csv')
df

Unnamed: 0,Time,BS,Energy,load,ESMODE,TXpower
0,20230101 010000,B_0,64.275037,0.487936,0.0,7.101719
1,20230101 020000,B_0,55.904335,0.344468,0.0,7.101719
2,20230101 030000,B_0,57.698057,0.193766,0.0,7.101719
3,20230101 040000,B_0,55.156951,0.222383,0.0,7.101719
4,20230101 050000,B_0,56.053812,0.175436,0.0,7.101719
...,...,...,...,...,...,...
92624,20230102 170000,B_1018,14.648729,0.087538,0.0,7.325859
92625,20230102 180000,B_1018,14.648729,0.082635,0.0,7.325859
92626,20230102 210000,B_1018,13.452915,0.055538,0.0,7.325859
92627,20230102 220000,B_1018,13.602392,0.058077,0.0,7.325859


In [None]:
!pip install pandas-profiling
! install Ydata Profiling
import ydata_profiling as ydp
# Import the Ydata_profiling library that have been installedimport profile
# Imports profile for creating html or notebook from ydata_profiling result



# data preparation

In [None]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92629 entries, 0 to 92628
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Time     92629 non-null  object 
 1   BS       92629 non-null  object 
 2   Energy   92629 non-null  float64
 3   load     92629 non-null  float64
 4   ESMODE   92629 non-null  float64
 5   TXpower  92629 non-null  float64
dtypes: float64(4), object(2)
memory usage: 4.2+ MB


In [None]:
#sum null values
missing_values=df.isnull().sum()
missing_values

In [None]:
#droping the duplicates
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates.head())

Data transformation


In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [None]:
#changing the data type string to integer
for column in ['BS']:
  df[column]=label_encoder.fit_transform(df[column])

In [None]:
df

Unnamed: 0,Time,BS,Energy,load,ESMODE,TXpower
0,20230101 010000,0,64.275037,0.487936,0.0,7.101719
1,20230101 020000,0,55.904335,0.344468,0.0,7.101719
2,20230101 030000,0,57.698057,0.193766,0.0,7.101719
3,20230101 040000,0,55.156951,0.222383,0.0,7.101719
4,20230101 050000,0,56.053812,0.175436,0.0,7.101719
...,...,...,...,...,...,...
92624,20230102 170000,10,14.648729,0.087538,0.0,7.325859
92625,20230102 180000,10,14.648729,0.082635,0.0,7.325859
92626,20230102 210000,10,13.452915,0.055538,0.0,7.325859
92627,20230102 220000,10,13.602392,0.058077,0.0,7.325859


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92629 entries, 0 to 92628
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Time     92629 non-null  object 
 1   BS       92629 non-null  int64  
 2   Energy   92629 non-null  float64
 3   load     92629 non-null  float64
 4   ESMODE   92629 non-null  float64
 5   TXpower  92629 non-null  float64
dtypes: float64(4), int64(1), object(1)
memory usage: 4.2+ MB


Data seperation as X and Y

In [None]:
y = df['Energy']
y

In [None]:
x = df.drop(['Energy','Time'],axis=1)
x

Data spitting

In [66]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100)

# Print the x_train.dtypes and y_train.dtypes
print("X Train dtypes:", x_train.dtypes)
print("Y Train dtypes :", y_train.dtypes)
print("X Test dtypes:", x_test.dtypes)
print("Y Test dtypes:", y_test.dtypes)


X Train dtypes: BS           int64
load       float64
ESMODE     float64
TXpower    float64
dtype: object
Y Train dtypes : float64
X Test dtypes: BS           int64
load       float64
ESMODE     float64
TXpower    float64
dtype: object
Y Test dtypes: float64


In [None]:
x_train

In [None]:
x_test

Model Building

Linear regression

In [59]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
lr = LinearRegression()
lr.fit(x_train,y_train)


applying the model to make a prediction

In [None]:
y_lr_train_pred = lr.predict(x_train)
y_lr_train_pred = lr.predict(x_test)

In [60]:
y_lr_train_pred

array([22.77289731, 24.36466773, 25.61016096, ..., 28.12500374,
       47.51288724,  6.65339742])

evaluate model performance

In [None]:
!pip install scikit-learn
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
lr_train_mse = mean_squared_error(y_test,y_lr_train_pred)
lr_train_r2 =r2_score(y_test,y_lr_train_pred)

print(f"Mean Square Error is: {lr_train_mse}")
print(f"R Square is: {lr_train_r2}")


In [None]:
lr_results = pd.DataFrame({'Method': ['Linear regression'], 'Training MSE': [lr_train_mse], 'Training R2': [lr_train_r2], 'Test MSE': [lr_test_mse]})

In [84]:
lr_results

Unnamed: 0,0
method,Linear regression
training mse,86.558429
training r2,0.557264


Training the model
 Model comparison

In [86]:
df_models = pd.concat([lr_results],axis=0).reset_index(drop=True)
df_models

Unnamed: 0,0
0,Linear regression
1,86.558429
2,0.557264


DATA VISUALISATION OF PREDICTION RESULTS

In [None]:
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(5,5))
plt.scatter(x=y_train, y=y_lr_train_pred, c="#7CAE00" )

z = np.polyfit(y_train, y_lr_train_pred, 1)
p = np.poly1d(z)

plt.plot(y_train, p(y_train), '#F8766D')
plt.ylabel('predict Energy')
plt.xlabel('experimental Eenergy')
TEST(0.5,0,'experimental Eenergy')