### ML Pipeline (NWDAF)

This notebook contains a Machine Learning (ML) pipeline for training an ML model to detect anomalies in a 5G network. The training pipeline considers a published dataset containing anomalies related to a 5G network.

Paper: https://s3.ap-northeast-2.amazonaws.com/journal-home/journal/jcn/fullText/421/10.pdf </br>
Dataset: https://github.com/sevgicansalih/nwdaf_data/blob/master/nwdaf_data.csv

5G network data analytics function (NWDAF) is a crucial 3GPP standard method. It efficiently collects data from user equipment, network functions, operations, administration, and maintenance (OAM) systems within the 5G Core, Cloud, and Edge networks. This wealth of data is then utilized for powerful 5G analytics, enabling better insights and actions to enhance the overall end-user experience.

Model Training Logical Function (MTLF) is a function that trains ML models and exposes new training services such as an ML model provisioning service.

Analytics Logical Function (AnLF) performs inference with ML models and provides the analytics results to service consumers (e.g., 5G network functions, application functions, and OAM).

An Implementation Study of Network Data Analytic Function in 5G: https://ieeexplore.ieee.org/document/9730290 </br>
Repo with NWDAF implementation example: https://github.com/net-ty/mnc_NWDAF/tree/main/

### Import libraries

In [1]:
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

### Set the parameters

In [2]:
SEED = 42

### Load data

In [3]:
df = pd.read_csv('../data/nwdaf_data.csv')
df.head()

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly
0,0,0,0,0,4.997074,0
1,0,0,0,1,16.004322,0
2,0,0,0,2,52.985386,0
3,0,0,0,3,0.999767,0
4,0,0,0,4,5.000597,0


### Feature Engineering

Considering the feature importance test performed by the paper's authors, they found that the most important features are last2_mean and all percentage change in data rates features.

Selected features:
- last2_mean: Average data rate of the last two ∆t
- per_change_last2: Percentage of the data rate difference between the last two ∆t
- per_change_last3: Percentage of the data rate difference between the t − ∆t and t − 3 × ∆t
- per_change_last4: Percentage of the data rate difference between the t − ∆t and t − 4 × ∆t

The feature engineering process needs to considers the group columns cell_id (5), cat_id (3), and pe_id (5). Therefore, 5 x 3 x 5 = 75.

In [4]:
df[df['t'] == 0]

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly
0,0,0,0,0,4.997074,0
1,0,0,0,1,16.004322,0
2,0,0,0,2,52.985386,0
3,0,0,0,3,0.999767,0
4,0,0,0,4,5.000597,0
...,...,...,...,...,...,...
70,0,4,2,0,3.000244,0
71,0,4,2,1,20.071541,0
72,0,4,2,2,90.101004,0
73,0,4,2,3,0.999545,0


In [5]:
# 1296000 is a multiple of 75
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296000 entries, 0 to 1295999
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   t            1296000 non-null  int64  
 1   cell_id      1296000 non-null  int64  
 2   cat_id       1296000 non-null  int64  
 3   pe_id        1296000 non-null  int64  
 4   load         1296000 non-null  float64
 5   has_anomaly  1296000 non-null  int64  
dtypes: float64(1), int64(5)
memory usage: 59.3 MB


In [6]:
# last2_mean: Average data rate of the last two ∆t
def compute_mean(df_raw, cell_id, cat_id, pe_id, num_dt):
    df_temp = df_raw[(df_raw['cell_id'] == cell_id) & (df_raw['cat_id'] == cat_id) & (df_raw['pe_id'] == pe_id)].copy()

    column_name = f'last{num_dt}_mean'
    df_temp[column_name] = df_temp['load'].rolling(window=num_dt, closed='left').mean()

    return df_temp

In [7]:
compute_mean(df,0,0,1,2)

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly,last2_mean
1,0,0,0,1,16.004322,0,
76,1,0,0,1,16.063114,0,
151,2,0,0,1,16.232611,1,16.033718
226,3,0,0,1,16.202982,1,16.147863
301,4,0,0,1,16.314013,1,16.217797
...,...,...,...,...,...,...,...
1295626,17275,0,0,1,15.844138,0,15.789519
1295701,17276,0,0,1,15.834498,0,15.791448
1295776,17277,0,0,1,15.884860,0,15.839318
1295851,17278,0,0,1,15.843108,0,15.859679


In [8]:
# per_change_last3: Percentage of the data rate difference between the t − ∆t and t − 3 × ∆t
def compute_single_per_change(t, dt, ndt):
    d1 = t - dt
    d2 = t - ndt
    
    return (d1 / d2) - 1

In [9]:
def compute_per_change(df_raw, cell_id, cat_id, pe_id, num_dt):
    df_temp = df_raw[(df_raw['cell_id'] == cell_id) & (df_raw['cat_id'] == cat_id) & (df_raw['pe_id'] == pe_id)].copy()
    
    column_name = f'per_change_last{num_dt}'
    df_temp[column_name] = df_temp['load'].rolling(window=num_dt+1).apply(lambda x: compute_single_per_change(x.iloc[num_dt], x.iloc[num_dt-1], x.iloc[0]))

    return df_temp

In [10]:
compute_single_per_change(16.232611, 16.063114, 16.004322)

-0.25753321447814104

In [11]:
compute_single_per_change(16.202982, 16.232611, 16.063114)

-1.211835444847999

In [12]:
compute_per_change(df,0,0,1,2).head(20)

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly,per_change_last2
1,0,0,0,1,16.004322,0,
76,1,0,0,1,16.063114,0,
151,2,0,0,1,16.232611,1,-0.257534
226,3,0,0,1,16.202982,1,-1.211836
301,4,0,0,1,16.314013,1,0.363985
376,5,0,0,1,16.27773,1,-1.485411
451,6,0,0,1,16.257736,1,-0.644727
526,7,0,0,1,16.213825,1,-0.312867
601,8,0,0,1,16.328005,1,0.624905
676,9,0,0,1,16.271383,1,-1.983749


In [13]:
compute_per_change(df,0,0,1,3).head(20)

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly,per_change_last3
1,0,0,0,1,16.004322,0,
76,1,0,0,1,16.063114,0,
151,2,0,0,1,16.232611,1,
226,3,0,0,1,16.202982,1,-1.149144
301,4,0,0,1,16.314013,1,-0.557468
376,5,0,0,1,16.27773,1,-1.804178
451,6,0,0,1,16.257736,1,-1.365155
526,7,0,0,1,16.213825,1,-0.561713
601,8,0,0,1,16.328005,1,1.271108
676,9,0,0,1,16.271383,1,-5.149244


In [14]:
# The feature engineering process needs to considers the group columns cell_id (5), cat_id (3), and pe_id (5). Therefore, 5 x 3 x 5 = 75.
cell_id_list = df['cell_id'].unique()
cat_id_list = df['cat_id'].unique()
pe_id_list = df['pe_id'].unique()
df_fe_list = []

# generate the features
for cell_id in cell_id_list:
    for cat_id in cat_id_list:
        for pe_id in pe_id_list:
            # lastn_mean 
            df_fe = compute_mean(df_raw=df, cell_id=cell_id, cat_id=cat_id, pe_id=pe_id, num_dt=2)

            # per_change_lastn
            for n in [2,3,4]:
                df_per_change = compute_per_change(df_raw=df, cell_id=cell_id, cat_id=cat_id, pe_id=pe_id, num_dt=n)
                df_fe = df_fe.merge(df_per_change, how='inner', on=['t', 'cell_id', 'cat_id', 'pe_id', 'load', 'has_anomaly'])

            # append to the list
            df_fe_list.append(df_fe)

# concat all the dfs
df_ml = pd.concat(df_fe_list)

# assert the number of lines
len_df_raw = len(df)
len_df_ml = len(df_ml)
assert len_df_raw == len_df_ml

In [15]:
df_ml[(df_ml['cell_id'] == 4) & (df_ml['cat_id'] == 2) & (df_ml['pe_id'] == 4)].head(10)

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly,last2_mean,per_change_last2,per_change_last3,per_change_last4
0,0,4,2,4,5.999293,0,,,,
1,1,4,2,4,5.998804,0,,,,
2,2,4,2,4,6.008583,1,5.999049,0.052582,,
3,3,4,2,4,6.0064,1,6.003694,-1.287391,-1.307144,
4,4,4,2,4,6.019954,1,6.007491,0.191968,-0.359135,-0.343983
5,5,4,2,4,6.019482,1,6.013177,-1.03611,-1.043343,-1.022846
6,6,4,2,4,6.045635,1,6.019718,0.018394,-0.333417,-0.294146
7,7,4,2,4,6.044321,1,6.032558,-1.052914,-1.05394,-1.03466
8,8,4,2,4,6.089671,1,6.044978,0.029847,-0.353887,-0.349509
9,9,4,2,4,6.091241,1,6.066996,-0.966544,-0.965579,-0.978124


In [16]:
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1296000 entries, 0 to 17279
Data columns (total 10 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   t                 1296000 non-null  int64  
 1   cell_id           1296000 non-null  int64  
 2   cat_id            1296000 non-null  int64  
 3   pe_id             1296000 non-null  int64  
 4   load              1296000 non-null  float64
 5   has_anomaly       1296000 non-null  int64  
 6   last2_mean        1295850 non-null  float64
 7   per_change_last2  1295850 non-null  float64
 8   per_change_last3  1295775 non-null  float64
 9   per_change_last4  1295700 non-null  float64
dtypes: float64(5), int64(5)
memory usage: 108.8 MB


In [17]:
df_ml = df_ml.dropna()
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1295700 entries, 4 to 17279
Data columns (total 10 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   t                 1295700 non-null  int64  
 1   cell_id           1295700 non-null  int64  
 2   cat_id            1295700 non-null  int64  
 3   pe_id             1295700 non-null  int64  
 4   load              1295700 non-null  float64
 5   has_anomaly       1295700 non-null  int64  
 6   last2_mean        1295700 non-null  float64
 7   per_change_last2  1295700 non-null  float64
 8   per_change_last3  1295700 non-null  float64
 9   per_change_last4  1295700 non-null  float64
dtypes: float64(5), int64(5)
memory usage: 108.7 MB


In [18]:
df_ml.to_csv('../data/nwdaf_data_processed.csv', index=False)

In [18]:
# at this moment, the minimum t is 4, because of the dropna command
df_ml[(df_ml['cell_id'] == 2) & (df_ml['cat_id'] == 2) & (df_ml['pe_id'] == 4)].head(10)

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly,last2_mean,per_change_last2,per_change_last3,per_change_last4
4,4,2,2,4,6.036855,1,6.018675,0.05267,-0.489219,-0.539958
5,5,2,2,4,6.038776,1,6.027532,-0.906596,-0.902158,-0.950007
6,6,2,2,4,6.062036,1,6.037815,-0.076291,-0.469283,-0.45774
7,7,2,2,4,6.065494,1,6.050406,-0.870578,-0.87926,-0.926872
8,8,2,2,4,6.114099,1,6.063765,-0.066415,-0.354704,-0.370752
9,9,2,2,4,6.111879,1,6.089797,-1.047872,-1.044551,-1.030376
10,10,2,2,4,6.208824,1,6.112989,0.023442,-0.323625,-0.339558
11,11,2,2,4,6.204995,1,6.160352,-1.041128,-1.042133,-1.027453
12,12,2,2,4,6.39586,1,6.206909,0.020476,-0.327894,-0.322597
13,13,2,2,4,6.404012,1,6.300427,-0.959039,-0.958235,-0.972095


In [19]:
df_ml['t'].min()

4

In [20]:
df_ml['t'].max()

17279

In [21]:
# creating the train and test datasets, considering the timestamp t to avoid data leakage among the subsets
# each group has 17.275 records
# 80% for training and 20% for testing
test_start_t = 13824
df_ml_train = df_ml[df_ml['t'] < test_start_t]
df_ml_test = df_ml[df_ml['t'] >= test_start_t]

In [22]:
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1295700 entries, 4 to 17279
Data columns (total 10 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   t                 1295700 non-null  int64  
 1   cell_id           1295700 non-null  int64  
 2   cat_id            1295700 non-null  int64  
 3   pe_id             1295700 non-null  int64  
 4   load              1295700 non-null  float64
 5   has_anomaly       1295700 non-null  int64  
 6   last2_mean        1295700 non-null  float64
 7   per_change_last2  1295700 non-null  float64
 8   per_change_last3  1295700 non-null  float64
 9   per_change_last4  1295700 non-null  float64
dtypes: float64(5), int64(5)
memory usage: 108.7 MB


In [23]:
df_ml_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1036500 entries, 4 to 13823
Data columns (total 10 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   t                 1036500 non-null  int64  
 1   cell_id           1036500 non-null  int64  
 2   cat_id            1036500 non-null  int64  
 3   pe_id             1036500 non-null  int64  
 4   load              1036500 non-null  float64
 5   has_anomaly       1036500 non-null  int64  
 6   last2_mean        1036500 non-null  float64
 7   per_change_last2  1036500 non-null  float64
 8   per_change_last3  1036500 non-null  float64
 9   per_change_last4  1036500 non-null  float64
dtypes: float64(5), int64(5)
memory usage: 87.0 MB


In [24]:
df_ml_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259200 entries, 13824 to 17279
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   t                 259200 non-null  int64  
 1   cell_id           259200 non-null  int64  
 2   cat_id            259200 non-null  int64  
 3   pe_id             259200 non-null  int64  
 4   load              259200 non-null  float64
 5   has_anomaly       259200 non-null  int64  
 6   last2_mean        259200 non-null  float64
 7   per_change_last2  259200 non-null  float64
 8   per_change_last3  259200 non-null  float64
 9   per_change_last4  259200 non-null  float64
dtypes: float64(5), int64(5)
memory usage: 21.8 MB


In [25]:
df_ml_train[(df_ml_train['cell_id'] == 2) & (df_ml_train['cat_id'] == 2) & (df_ml_train['pe_id'] == 4)]

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly,last2_mean,per_change_last2,per_change_last3,per_change_last4
4,4,2,2,4,6.036855,1,6.018675,0.052670,-0.489219,-0.539958
5,5,2,2,4,6.038776,1,6.027532,-0.906596,-0.902158,-0.950007
6,6,2,2,4,6.062036,1,6.037815,-0.076291,-0.469283,-0.457740
7,7,2,2,4,6.065494,1,6.050406,-0.870578,-0.879260,-0.926872
8,8,2,2,4,6.114099,1,6.063765,-0.066415,-0.354704,-0.370752
...,...,...,...,...,...,...,...,...,...,...
13819,13819,2,2,4,6.385369,1,6.382395,-0.538064,-0.652786,-2.178726
13820,13820,2,2,4,6.388382,1,6.384429,-0.384160,-0.574540,-0.642421
13821,13821,2,2,4,6.399217,1,6.386875,-0.217587,-0.311092,-0.395273
13822,13822,2,2,4,6.397797,1,6.393799,-1.150834,-1.114264,-1.099253


In [26]:
df_ml_test[(df_ml_test['cell_id'] == 2) & (df_ml_test['cat_id'] == 2) & (df_ml_test['pe_id'] == 4)]

Unnamed: 0,t,cell_id,cat_id,pe_id,load,has_anomaly,last2_mean,per_change_last2,per_change_last3,per_change_last4
13824,13824,2,2,4,6.208496,1,6.398694,0.009478,0.001962,0.062311
13825,13825,2,2,4,6.207048,1,6.304043,-0.992478,-0.992407,-0.992463
13826,13826,2,2,4,6.111695,1,6.207772,-0.014962,-0.668794,-0.666717
13827,13827,2,2,4,6.110011,1,6.159371,-0.982646,-0.982901,-0.994185
13828,13828,2,2,4,6.065333,1,6.110853,-0.036323,-0.684732,-0.687921
...,...,...,...,...,...,...,...,...,...,...
17275,17275,2,2,4,5.976979,0,5.981692,0.402706,-1.985990,-3.211737
17276,17276,2,2,4,5.980999,0,5.979730,-3.710031,40.615455,-0.581319
17277,17277,2,2,4,5.977824,0,5.978989,-4.753949,-0.318474,0.031386
17278,17278,2,2,4,5.978668,0,5.979412,-1.361957,-0.500586,-1.221186


In [27]:
ml_columns = ['cell_id', 
              'cat_id', 
              'pe_id',
              'load',
              'last2_mean', 
              'per_change_last2', 
              'per_change_last3', 
              'per_change_last4', 
              'has_anomaly']

df_ml_train = df_ml_train[ml_columns]
df_ml_test = df_ml_test[ml_columns]

### Model Training

The authors performed classification on the current status of a network cell in order to detect the existing anomalies by using logistic regression and a widely used tree-based ML algorithm, named extreme gradient boosting (XGBoost). The XGBoost algorithm presented the best performance during the experiments.

In [28]:
df_ml_train.columns

Index(['cell_id', 'cat_id', 'pe_id', 'load', 'last2_mean', 'per_change_last2',
       'per_change_last3', 'per_change_last4', 'has_anomaly'],
      dtype='object')

In [29]:
categorical_features = ['cell_id', 'cat_id', 'pe_id']

for categorical_feature in categorical_features:
    df_ml_train[categorical_feature] = df_ml_train[categorical_feature].astype('category')
    df_ml_test[categorical_feature] = df_ml_test[categorical_feature].astype('category')

In [30]:
df_ml_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1036500 entries, 4 to 13823
Data columns (total 9 columns):
 #   Column            Non-Null Count    Dtype   
---  ------            --------------    -----   
 0   cell_id           1036500 non-null  category
 1   cat_id            1036500 non-null  category
 2   pe_id             1036500 non-null  category
 3   load              1036500 non-null  float64 
 4   last2_mean        1036500 non-null  float64 
 5   per_change_last2  1036500 non-null  float64 
 6   per_change_last3  1036500 non-null  float64 
 7   per_change_last4  1036500 non-null  float64 
 8   has_anomaly       1036500 non-null  int64   
dtypes: category(3), float64(5), int64(1)
memory usage: 58.3 MB


In [31]:
df_ml_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259200 entries, 13824 to 17279
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   cell_id           259200 non-null  category
 1   cat_id            259200 non-null  category
 2   pe_id             259200 non-null  category
 3   load              259200 non-null  float64 
 4   last2_mean        259200 non-null  float64 
 5   per_change_last2  259200 non-null  float64 
 6   per_change_last3  259200 non-null  float64 
 7   per_change_last4  259200 non-null  float64 
 8   has_anomaly       259200 non-null  int64   
dtypes: category(3), float64(5), int64(1)
memory usage: 14.6 MB


In [32]:
target_feature = 'has_anomaly'
features = list(df_ml_train.columns)
features.remove(target_feature)

In [33]:
X_train = df_ml_train[features]
y_train = df_ml_train[target_feature]
X_test = df_ml_test[features]
y_test = df_ml_test[target_feature]

In [34]:
X_train

Unnamed: 0,cell_id,cat_id,pe_id,load,last2_mean,per_change_last2,per_change_last3,per_change_last4
4,0,0,0,5.009872,5.003355,0.119096,-0.430738,-0.462153
5,0,0,0,5.008530,5.006431,-1.242180,-1.279072,-1.124841
6,0,0,0,5.030774,5.009201,0.064207,-0.199441,-0.177762
7,0,0,0,5.032476,5.019652,-0.928902,-0.924681,-0.942263
8,0,0,0,5.072980,5.031625,-0.040338,-0.371544,-0.358180
...,...,...,...,...,...,...,...,...
13819,4,2,4,6.415692,6.414208,-0.329457,-0.458678,-0.769493
13820,4,2,4,6.415900,6.415096,-0.851872,-0.895578,-0.913970
13821,4,2,4,6.413151,6.415796,0.081560,1.037010,2.599204
13822,4,2,4,6.413153,6.414525,-1.000532,-1.000576,-1.001085


In [35]:
y_train

4        1
5        1
6        1
7        1
8        1
        ..
13819    1
13820    1
13821    1
13822    1
13823    1
Name: has_anomaly, Length: 1036500, dtype: int64

In [36]:
X_test

Unnamed: 0,cell_id,cat_id,pe_id,load,last2_mean,per_change_last2,per_change_last3,per_change_last4
13824,0,0,0,5.173411,5.333229,0.017839,0.038877,0.026057
13825,0,0,0,5.173505,5.254027,-1.000583,-1.000593,-1.000605
13826,0,0,0,5.092449,5.173458,0.001160,-0.665324,-0.661373
13827,0,0,0,5.089812,5.132977,-0.968494,-0.968459,-0.989230
13828,0,0,0,5.049533,5.091131,-0.061440,-0.675092,-0.674846
...,...,...,...,...,...,...,...,...
17275,4,2,4,6.015589,6.012935,1.086498,-0.042541,-0.076923
17276,4,2,4,6.014108,6.013795,-1.702608,-7.195608,-1.653195
17277,4,2,4,6.013451,6.014849,-0.692440,-1.453651,0.570809
17278,4,2,4,6.018021,6.013779,0.168124,0.879562,-0.240850


In [37]:
y_test

13824    1
13825    1
13826    1
13827    1
13828    1
        ..
17275    0
17276    0
17277    0
17278    0
17279    0
Name: has_anomaly, Length: 259200, dtype: int64

In [38]:
# baseline model
xgb_model = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic', tree_method='hist', enable_categorical=True, random_state=SEED)
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)

In [39]:
y_pred

array([1, 1, 1, ..., 0, 1, 1])

In [40]:
confusion_matrix(y_test, y_pred)

array([[  9308,  44467],
       [  9970, 195455]], dtype=int64)

In [41]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.48      0.17      0.25     53775
           1       0.81      0.95      0.88    205425

    accuracy                           0.79    259200
   macro avg       0.65      0.56      0.57    259200
weighted avg       0.75      0.79      0.75    259200



In [42]:
# HPO
xgb_model_cv = XGBClassifier(objective='binary:logistic', enable_categorical=True, random_state=SEED)

parameters = {
    'max_depth': range(2, 10, 1),
    'n_estimators': range(60, 220, 40),
    'learning_rate': [0.1, 0.01, 0.05],
    'tree_method': ['approx', 'hist']
}

# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
grid_search = GridSearchCV(
    estimator=xgb_model_cv,
    param_grid=parameters,
    scoring = 'f1',
    n_jobs = 10,
    cv = 10,
    verbose=True
)

grid_search.fit(X_train, y_train)

grid_search.best_estimator_

Fitting 10 folds for each of 192 candidates, totalling 1920 fits


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=9, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=180, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...)

In [43]:
xgb_model = grid_search.best_estimator_
y_pred = xgb_model.predict(X_test)

In [44]:
confusion_matrix(y_test, y_pred)

array([[ 35629,  18146],
       [ 12458, 192967]], dtype=int64)

In [45]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.66      0.70     53775
           1       0.91      0.94      0.93    205425

    accuracy                           0.88    259200
   macro avg       0.83      0.80      0.81    259200
weighted avg       0.88      0.88      0.88    259200



In [46]:
xgb_model.save_model('../model_files/xgb_model.json')

### Inference

In [4]:
xgb_model = XGBClassifier(objective='binary:logistic', enable_categorical=True, random_state=SEED)
xgb_model.load_model('../model_files/xgb_model.json')

In [5]:
xgb_model

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric=None,
              feature_types=['c', 'c', 'c', 'float', 'float', 'float', 'float',
                             'float'],
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...)

In [49]:
# get a random record from the df_ml_test WITHOUT anomaly for inference 

X_inference1 = pd.DataFrame({
    'cell_id': [2],
    'cat_id': [2],
    'pe_id': [4],
    'load': [5.976979],
    'last2_mean': [5.981692],
    'per_change_last2': [0.402706],
    'per_change_last3': [-1.985990],
    'per_change_last4': [-3.211737]
})

X_inference1

Unnamed: 0,cell_id,cat_id,pe_id,load,last2_mean,per_change_last2,per_change_last3,per_change_last4
0,2,2,4,5.976979,5.981692,0.402706,-1.98599,-3.211737


In [50]:
y_pred = xgb_model.predict(X_inference1)
y_pred

array([0])

In [51]:
# get a random record from the df_ml_test WITH anomaly for inference 

X_inference2 = pd.DataFrame({
    'cell_id': [2],
    'cat_id': [2],
    'pe_id': [4],
    'load': [6.208496],
    'last2_mean': [6.398694],
    'per_change_last2': [0.009478],
    'per_change_last3': [0.001962],
    'per_change_last4': [0.062311]
})

X_inference2

Unnamed: 0,cell_id,cat_id,pe_id,load,last2_mean,per_change_last2,per_change_last3,per_change_last4
0,2,2,4,6.208496,6.398694,0.009478,0.001962,0.062311


In [52]:
y_pred = xgb_model.predict(X_inference2)
y_pred

array([1])