# Detecção de anomalias de dataframe pelo método Weigth Moving Average(WMA)
_(fonte: DALMAZO, Bruno L. et al. Performance Analysis of Network Traffic Predictors in the Cloud. 1. ed. Pinhal de Marrocos, Coimbra, Portuga: Springer Science+Business Media, 2016. v. 1.)_ 


Através da biblioteca pandas, foi analisado um dataframe importado de um repositório do GitHub de alguns dados de navegação.

In [1]:
##importar biblioteca
import pandas

##definir dataframe
df = pandas.read_csv('/content/drive/MyDrive/Furg/2022/2° Semestre/Algoritmos Computacionais II/anomalias.csv')

##definir coluna do dataframe
c1 = df['Average Packet Size']
c2 = df['Flow IAT Std']
c3 = df['Flow Bytes/s']
a = df['target']


$$WMA = \frac{\sum_{i=1}^{n}(w_i*v_i)}{\sum_{i=1}^{n}(w_i)}$$

Onde:

$w_i$: vetor de peso

$v_i$: vetor dos dados

In [2]:
##definir tamanho janela móvel
janela = 10

##criar vetor de pesos para a WMA
w = []

for i in range(len(c1)):

  for j in range(janela):

    w.append((j+1)*(1/sum(list(range(1,janela+1)))))


In [None]:
##criar vetor de modelo
vPModel1 = []
vPModel2 = []
vPModel3 = []

##criar vetor do valor vezes o peso
WV1 = []
WV2 = []
WV3 = []

for i in range (len(c1) - janela + 1):

  for j in range(janela+1):

    WV1.append(c1[j]*w[j])
    WV2.append(c2[j]*w[j])
    WV3.append(c3[j]*w[j])


##exponential moving average      
for i in range (len(c1) - janela + 1):

  somW = w[i : i + janela]

  somWV1 = WV1[i : i + janela]
  somWV2 = WV2[i : i + janela]
  somWV3 = WV3[i : i + janela]

  EMA1 = sum(somW)/sum(somWV1)
  EMA2 = sum(somW)/sum(somWV2)
  EMA3 = sum(somW)/sum(somWV3)
  
  vPModel1.append(EMA1)
  vPModel2.append(EMA2)
  vPModel3.append(EMA3)


In [None]:
p1 = []
p2 = []
p3 = []

for i in range (len(c1) - janela + 1):

  nextValue1 = 0
  nextValue2 = 0
  nextValue3 = 0

  for j in range (janela):

    if (i-j>=0):

      tmp1 = vPModel1[j] * c1[i-j] 
      tmp2 = vPModel2[j] * c2[i-j]
      tmp3 = vPModel3[j] * c3[i-j]

    nextValue1 = (nextValue1 + tmp1)  
    nextValue2 = (nextValue2 + tmp2) 
    nextValue3 = (nextValue3 + tmp3) 

  p1.append(nextValue1)  
  p2.append(nextValue2)
  p3.append(nextValue3)


$$MAPE = \left(\frac{1}{N} \sum_{t=1}^N \frac{\left|X_t-\hat{X}_t\right|}{|\bar{X}|}\right) \times 100$$

Onde: 

$N$= tamanho do vetor

$X_t$ = valor real

$\hat{X}$ = valor esperado

$\bar{X}$ = média do vetor

In [None]:
somatorio1 = []
somatorio2 = []
somatorio3 = []

m1 = sum(c1)/len(c1)
m2 = sum(c2)/len(c2)
m3 = sum(c3)/len(c3)

for i in range(len(c1)-janela):

  soma1 = abs(c1[i]-p1[i])/m1
  soma2 = abs(c2[i]-p2[i])/m2
  soma3 = abs(c3[i]-p3[i])/m3

  somatorio1.append(soma1)
  somatorio2.append(soma2)
  somatorio3.append(soma3)


MAPE1 = (sum(somatorio1)/len(c1))*100
MAPE2 = (sum(somatorio2)/len(c2))*100
MAPE3 = (sum(somatorio3)/len(c3))*100


print(f"MAPE Average Packet Size: {MAPE1}\nMAPE Fwd Header Length: {MAPE2}\nMAPE Flow Bytes/s: {MAPE3}")

MAPE Average Packet Size: 2037.71712602556
MAPE Fwd Header Length: 2102.8513808918524
MAPE Flow Bytes/s: 2060.2344870098686


$$N M S E=\frac{1}{\sigma^2} \frac{1}{N} \sum_{t=1}^N\left(X_t-\hat{X}_t\right)^2$$

Onde: 

$\sigma^2$ = desvio padrão ao quadrado

$N$= tamanho do vetor

$X_t$ = valor real

$\hat{X}$ = valor esperado

In [None]:
dqm1 = []
dqm2 = []
dqm3 = []

for i in range(len(c1)):
  dqm1.append(pow((c1[i]-m1),(2)))
  dqm2.append(pow((c2[i]-m2),(2)))
  dqm3.append(pow((c3[i]-m3),(2)))

dp1 = sum(dqm1)/len(c1)
dp2 = sum(dqm2)/len(c2)
dp3 = sum(dqm3)/len(c3)

dqp1 = []
dqp2 = []
dqp3 = []

for i in range(len(c1)-10):
  dqp1.append(pow((c1[i]-p1[i]),(2)))
  dqp2.append(pow((c2[i]-p2[i]),(2)))
  dqp3.append(pow((c3[i]-p3[i]),(2)))

NMSE1 = sum(dqp1)/(len(c1)*dp1)
NMSE2 = sum(dqp2)/(len(c1)*dp2)
NMSE3 = sum(dqp3)/(len(c1)*dp3)  


print(f"NMSE Average Packet Size: {NMSE1}\nNMSE Fwd Header Length: {NMSE2}\nNMSE Flow Bytes/s: {NMSE3}")

NMSE Average Packet Size: 579.2721649511924
NMSE Fwd Header Length: 572.2290410349236
NMSE Flow Bytes/s: 576.7504666869852


# Análise dados

In [None]:
print("Dado 1 || Predição 1|| Dado 2 || Predição 2 || Dado 3 || Predição 3 ")
for i in range (20):
  print(f"{c1[i]:.2f}   ||   {p1[i]:.2f}     || {c2[i]:.2f}   ||   {p2[i]:.2f}    || {c3[i]:.2f}   ||  {p3[i]:.2f}   ")

Dado 1 || Predição 1|| Dado 2 || Predição 2 || Dado 3 || Predição 3 
0.24   ||   4.87     || 0.20   ||   4.14    || 0.22   ||  4.65   
0.27   ||   4.82     || 0.27   ||   4.19    || 0.27   ||  4.63   
0.16   ||   4.69     || 0.16   ||   4.14    || 0.16   ||  4.53   
0.24   ||   4.70     || 0.20   ||   4.15    || 0.22   ||  4.52   
0.24   ||   4.75     || 0.18   ||   4.15    || 0.22   ||  4.55   
0.16   ||   4.61     || 0.16   ||   4.08    || 0.16   ||  4.43   
0.22   ||   4.56     || 0.22   ||   4.14    || 0.22   ||  4.42   
0.24   ||   4.60     || 0.20   ||   4.17    || 0.22   ||  4.45   
1.00   ||   6.18     || 1.00   ||   5.88    || 1.00   ||  6.07   
1.00   ||   7.88     || 1.00   ||   7.70    || 1.00   ||  7.82   
1.00   ||   9.44     || 1.00   ||   9.44    || 1.00   ||  9.44   
1.00   ||   10.81     || 1.00   ||   10.84    || 1.00   ||  10.82   
0.14   ||   10.86     || 0.14   ||   10.84    || 0.14   ||  10.84   
0.14   ||   10.70     || 0.14   ||   10.75    || 0.14   ||  10.72  

## Através da análise foi considerado anomalia se:

$$ \hat{X}_{t+1}  - X_t > \hat{X}_t$$

 

In [None]:

anop = 1
anor = 0

for i in range(len(c1) - janela ):
  
  if (p3[i+1] - c3[i+1]) > p3[i] :

     if a[i] == 'UDP':

      anor += 1

     anop += 1
  
  
print(f"Anomalias prevista: {anop}\nAnomalias reais: {anor}\nTaxa de acerto: {(anor/anop)*100:.2f}%")

Anomalias prevista: 34888
Anomalias reais: 14093
Taxa de acerto: 40.39%


|--|MAPE|NMSE|Taxa de acerto|
|--|--|--|--|
|Average Packet Size|2037.71|579.27|40.42%|
|Fwd Header Length|2102.85|572.22|40.37%|
|Flow Bytes/s|2060.23|576.75|40.39%|