# Anomaly detection in cellular networks

## Introduction

The purpose of this homework is to solve a classification problem proposed as a competition in the Kaggle InClass platform, where each team of two members will try to get the maximum score. You can apply any of the concepts and techniques studied in class for exploratory data analysis, feature selection and classification.

## Goal

The objective of the network optimization team is to analyze traces of past activity, which will be used to train an ML system capable of classifying samples of current activity as:
##### • 0 (normal): current activity corresponds to normal behavior of any working day and. Therefore, no re-configuration or redistribution of resources is needed.
##### • 1 (unusual): current activity slightly differs from the behavior usually observed for that time of the day (e.g. due to a strike, demonstration, sports event, etc.), which should trigger a reconfiguration of the base station.

## Import Packages

In [228]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pickle

## Loading the data

In [229]:
train_data = pd.read_csv('ML-MATT-CompetitionQT1920_train.csv', encoding='latin')
train_data

Unnamed: 0,Time,CellName,PRBUsageUL,PRBUsageDL,meanThr_DL,meanThr_UL,maxThr_DL,maxThr_UL,meanUE_DL,meanUE_UL,maxUE_DL,maxUE_UL,maxUE_UL+DL,Unusual
0,10:45,3BLTE,11.642,1.393,0.370,0.041,15.655,0.644,1.114,1.025,4.0,3.0,7,1
1,9:45,1BLTE,21.791,1.891,0.537,0.268,10.273,1.154,1.353,1.085,6.0,4.0,10,1
2,7:45,9BLTE,0.498,0.398,0.015,0.010,0.262,0.164,0.995,0.995,1.0,1.0,2,1
3,2:45,4ALTE,1.891,1.095,0.940,0.024,60.715,0.825,1.035,0.995,2.0,2.0,4,1
4,3:30,10BLTE,0.303,0.404,0.016,0.013,0.348,0.168,1.011,1.011,2.0,1.0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36899,21:30,6ALTE,7.579,0.808,0.164,0.037,7.980,0.375,1.101,0.010,4.0,3.0,7,0
36900,9:45,8CLTE,9.095,1.213,0.189,0.030,19.510,1.583,1.122,1.031,4.0,2.0,6,0
36901,13:30,9BLTE,4.378,0.896,0.341,0.030,12.037,0.540,1.065,1.005,4.0,3.0,7,1
36902,12:30,3CLTE,13.339,2.728,0.559,0.065,28.187,0.894,1.223,1.061,5.0,4.0,9,0


## Understanding the data

##### • Time : hour of the day (in the format hh:mm) when the sample was generated.
##### • CellName1: text string used to uniquely identify the cell that generated the current sample. CellName is in the form xαLTE, where x identifies the base station, and α the cell within that base station (see the example in the right figure).
##### • PRBUsageUL and PRBUsageDL: level of resource utilization in that cell measured as the portion of Physical Radio Blocks (PRB) that were in use (%) in the previous 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
##### • meanThr_DL and meanThr_UL: average carried traffic (in Mbps) during the past 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
##### • maxThr_DL and maxThr_UL: maximum carried traffic (in Mbps) measured in the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
##### • meanUE_DL and meanUE_UL: average number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
##### • maxUE_DL and maxUE_UL: maximum number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
##### • maxUE_UL+DL: maximum number of user equipment (UE) devices that were active simultaneously in the last 15 minutes, regardless of UL and DL.
##### • Unusual: labels for supervised learning. A value of 0 determines that the sample corresponds to normal operation, a value of 1 identifies unusual behavior.

## Handling Unnecessary Features

In [230]:
train_data = train_data.drop(['CellName', 'Time'], axis=1)
train_data

Unnamed: 0,PRBUsageUL,PRBUsageDL,meanThr_DL,meanThr_UL,maxThr_DL,maxThr_UL,meanUE_DL,meanUE_UL,maxUE_DL,maxUE_UL,maxUE_UL+DL,Unusual
0,11.642,1.393,0.370,0.041,15.655,0.644,1.114,1.025,4.0,3.0,7,1
1,21.791,1.891,0.537,0.268,10.273,1.154,1.353,1.085,6.0,4.0,10,1
2,0.498,0.398,0.015,0.010,0.262,0.164,0.995,0.995,1.0,1.0,2,1
3,1.891,1.095,0.940,0.024,60.715,0.825,1.035,0.995,2.0,2.0,4,1
4,0.303,0.404,0.016,0.013,0.348,0.168,1.011,1.011,2.0,1.0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...
36899,7.579,0.808,0.164,0.037,7.980,0.375,1.101,0.010,4.0,3.0,7,0
36900,9.095,1.213,0.189,0.030,19.510,1.583,1.122,1.031,4.0,2.0,6,0
36901,4.378,0.896,0.341,0.030,12.037,0.540,1.065,1.005,4.0,3.0,7,1
36902,13.339,2.728,0.559,0.065,28.187,0.894,1.223,1.061,5.0,4.0,9,0


In [231]:
# train_data['Time'] = [int(time.split(':')[0]) for time in train_data['Time']]
# train_data

## Checking for null values

In [232]:
train_data.isnull().value_counts()

PRBUsageUL  PRBUsageDL  meanThr_DL  meanThr_UL  maxThr_DL  maxThr_UL  meanUE_DL  meanUE_UL  maxUE_DL  maxUE_UL  maxUE_UL+DL  Unusual
False       False       False       False       False      False      False      False      False     False     False        False      36815
                                                                                            True      True      False        False         84
                                                                                                                True         False          5
Name: count, dtype: int64

In [233]:
# Dropping the null values
train_data.dropna(inplace=True)

In [234]:
train_data.isnull().value_counts()

PRBUsageUL  PRBUsageDL  meanThr_DL  meanThr_UL  maxThr_DL  maxThr_UL  meanUE_DL  meanUE_UL  maxUE_DL  maxUE_UL  maxUE_UL+DL  Unusual
False       False       False       False       False      False      False      False      False     False     False        False      36815
Name: count, dtype: int64

All the null values have been handled

## Checking if the dataset is balanced

In [235]:
train_data['Unusual'].value_counts()

Unusual
0    26648
1    10167
Name: count, dtype: int64

## Onehot encoding the Time column

In [236]:
# from sklearn.preprocessing import OneHotEncoder
# onehot_encoder = OneHotEncoder()
# time_encoder = onehot_encoder.fit_transform(train_data[['Time']])
# time_encoder

In [237]:
# time_encoder_df = pd.DataFrame(time_encoder.toarray(), columns=onehot_encoder.get_feature_names_out(['Time']))
# time_encoder_df

In [238]:
# ## Combine one hot encoded columns with original data
# train_data = pd.concat([train_data.drop('Time', axis=1), time_encoder_df], axis=1)
# train_data

In [239]:
# Dropping the null values
train_data.dropna(inplace=True)

In [240]:
train_data

Unnamed: 0,PRBUsageUL,PRBUsageDL,meanThr_DL,meanThr_UL,maxThr_DL,maxThr_UL,meanUE_DL,meanUE_UL,maxUE_DL,maxUE_UL,maxUE_UL+DL,Unusual
0,11.642,1.393,0.370,0.041,15.655,0.644,1.114,1.025,4.0,3.0,7,1
1,21.791,1.891,0.537,0.268,10.273,1.154,1.353,1.085,6.0,4.0,10,1
2,0.498,0.398,0.015,0.010,0.262,0.164,0.995,0.995,1.0,1.0,2,1
3,1.891,1.095,0.940,0.024,60.715,0.825,1.035,0.995,2.0,2.0,4,1
4,0.303,0.404,0.016,0.013,0.348,0.168,1.011,1.011,2.0,1.0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...
36899,7.579,0.808,0.164,0.037,7.980,0.375,1.101,0.010,4.0,3.0,7,0
36900,9.095,1.213,0.189,0.030,19.510,1.583,1.122,1.031,4.0,2.0,6,0
36901,4.378,0.896,0.341,0.030,12.037,0.540,1.065,1.005,4.0,3.0,7,1
36902,13.339,2.728,0.559,0.065,28.187,0.894,1.223,1.061,5.0,4.0,9,0


## Save the encoder

In [241]:
# with open('onehot_encoder_time.pkl', 'wb') as file:
#     pickle.dump(time_encoder, file)

## Checking for any duplicated data

In [242]:
train_data.duplicated().value_counts()

False    36596
True       219
Name: count, dtype: int64

In [243]:
## Dropping duplicated records
train_data.drop_duplicates(inplace=True)

In [244]:
train_data.duplicated().value_counts()

False    36596
Name: count, dtype: int64

## Dividing the dataset into independent and dependent features

In [245]:
X = train_data.drop('Unusual', axis=1)
y = train_data['Unusual']

# Split the data in training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [246]:
X

Unnamed: 0,PRBUsageUL,PRBUsageDL,meanThr_DL,meanThr_UL,maxThr_DL,maxThr_UL,meanUE_DL,meanUE_UL,maxUE_DL,maxUE_UL,maxUE_UL+DL
0,11.642,1.393,0.370,0.041,15.655,0.644,1.114,1.025,4.0,3.0,7
1,21.791,1.891,0.537,0.268,10.273,1.154,1.353,1.085,6.0,4.0,10
2,0.498,0.398,0.015,0.010,0.262,0.164,0.995,0.995,1.0,1.0,2
3,1.891,1.095,0.940,0.024,60.715,0.825,1.035,0.995,2.0,2.0,4
4,0.303,0.404,0.016,0.013,0.348,0.168,1.011,1.011,2.0,1.0,3
...,...,...,...,...,...,...,...,...,...,...,...
36899,7.579,0.808,0.164,0.037,7.980,0.375,1.101,0.010,4.0,3.0,7
36900,9.095,1.213,0.189,0.030,19.510,1.583,1.122,1.031,4.0,2.0,6
36901,4.378,0.896,0.341,0.030,12.037,0.540,1.065,1.005,4.0,3.0,7
36902,13.339,2.728,0.559,0.065,28.187,0.894,1.223,1.061,5.0,4.0,9


In [247]:
X_train

array([[-0.92120248, -0.58169799, -0.70455384, ..., -1.26005869,
        -1.51206656, -1.4202881 ],
       [-0.57246643, -0.76249208, -0.76247725, ..., -1.26005869,
        -0.78259486, -1.08840196],
       [-0.84909942,  0.14192587, -0.16945182, ..., -0.11798726,
        -0.05312316, -0.09274356],
       ...,
       [-0.92132146, -0.72042613, -0.71834513, ..., -0.68902297,
        -0.78259486, -0.75651583],
       [-0.76486022,  0.18712439, -0.42321154, ...,  0.45304845,
        -0.05312316,  0.23914257],
       [ 0.3893835 , -0.35525787, -0.32943077, ...,  0.45304845,
        -0.05312316,  0.23914257]])

In [248]:
X_train.shape

(29276, 11)

In [249]:
X_test.shape

(7320, 11)

In [250]:
## Save the scaler
with open('scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)

## ANN Implementation

In [251]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping

In [252]:
model = Sequential([
    Input(shape=(X_train.shape[1],)),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [253]:
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_19 (Dense)            (None, 128)               1536      
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_20 (Dense)            (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_21 (Dense)            (None, 32)                2080      
                                                                 
 dense_22 (Dense)            (None, 1)                 33        
                                                                 
Total params: 11,905
Trainable params: 11,905
Non-trai

## Compiling the model

In [254]:
## compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [255]:
## Set up Early Stoppping
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

## Training the model

In [256]:
## Train the model
history = model.fit(
    X_train, y_train, validation_data=(X_test, y_test), epochs=100, callbacks=[early_stopping_callback], batch_size=32
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100


## Saving the model

In [257]:
model.save('model.h5')

## Checking the f1 score of the model

In [258]:
# Make predictions on the test set (replace X_test with your actual test data)
y_pred_prob = model.predict(X_test)

# Convert predicted probabilities to binary class labels
y_pred = np.where(y_pred_prob > 0.5, 1, 0)  # Assuming binary classification



In [259]:
# Calculate F1 score (replace y_test with your actual test labels)
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

F1 Score: 0.6001936108422071
