# Anomaly detection for datacenter

You have collected logs of virtual machines working in datacenters (actually, the dataset is a real data from a number of real virtual machines). The logs are CPU load and memory load (RAM) for every VM. Your goal is to build a system detecting abnormal behavior of the VMs so that the system administrator can notice them and pay attention.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import mixture
from matplotlib.colors import LogNorm

In [None]:
from matplotlib.patches import Ellipse

In [None]:
%matplotlib inline

## Setting up the environment

The log file you are going to work with is `system-load.csv`. You may want to open it in text editor or Excel to examine its internal structure.

In [None]:
input_filename = "data/datacenter/system-load.csv"

## Loading the data

In [None]:
df_load = pd.read_csv(input_filename)
type(df_load)

In [None]:
X = df_load.values

Let us take a look at the dataset.

In [None]:
df_load.head(5)

In [None]:
X[:5]

## Training Gaussian mixture model

Train Gaussian mixture model on the datacenter data.

In [None]:
clf = mixture.GaussianMixture(covariance_type="full")

clf.fit(X)

## Setting up model parameters

Set up the number of Gaussians and abnormality threshold. Note, there are no labels for the points in this dataset. We do not know which (if any) servers behaving abnormally. Think how would you decide on the threshold. Visualizing the results could help.

In [None]:
plt.plot(np.sort(clf.score_samples(X)))
treshhold = np.quantile(clf.score_samples(X), .05)

In [None]:
scores = clf.score_samples(X)
idx = np.where(scores <= treshhold)
outliers = df_load.loc[idx]


##### Model parameters:
The best way is to take default parameters for the model. But it is important to analize the data and count the number of components.

In [None]:
plt.scatter(df_load["cpu_load"], df_load["ram_usage"], c="b")

- n_components - the number of searching components. as far as we have almost centered data, n_components can be as default, 1.

## Plotting the results

Visualize all the points from the dataset and density estimation of your model over them. Draw all abnormal points (falling below the threshold) in red.

In [None]:
def make_ellipses(gmm, ax):
    covariances = gmm.covariances_[0][:2, :2]
    v, w = np.linalg.eigh(covariances)
    u = w[0] / np.linalg.norm(w[0])
    angle = np.arctan2(u[1], u[0])
    angle = 180 * angle / np.pi  # convert to degrees
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    ell = Ellipse(gmm.means_[0, :2], v[0], v[1], 180 + angle, color="g")
    ell.set_clip_box(ax.bbox)
    ell.set_alpha(0.5)
    ax.add_artist(ell)
    ax.set_aspect('equal', 'datalim')
    

In [None]:
plt.scatter(df_load["cpu_load"], df_load["ram_usage"], c="b")
plt.scatter(outliers["cpu_load"], outliers["ram_usage"], c="r")

# make_ellipses(clf, ax)
plt.show()

My trials to draw an ellipse was unsuccessful. sorry =(