In [1]:
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

- Isolation forest is a machine learning algorithm for anomaly detection.
It's an unsupervised learning algorithm that identifies anomaly by isolating outliers in the data.



- Isolation Forest explicitly identifies anomalies instead of profiling normal data points.




- Isolation Forest is based on the Decision Tree algorithm. It isolates the outliers by randomly selecting a feature from the given set of features and then randomly selecting a split value between the max and min values of that feature. This random partitioning of features will produce shorter paths in trees for the anomalous data points, thus distinguishing them from the rest of the data. By using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.

As with other outlier detection methods, an anomaly score is required for decision making. Each observation is given an anomaly score and the following decision can be made on its basis:
- A score close to 1 indicates anomalies
- Score much smaller than 0.5 indicates normal observations
- If all scores are close to 0.5 then the entire sample does not seem to have clearly distinct anomalies

In [9]:
df = pd.read_csv('df_what_if_serbia_5m.csv').drop(columns='Unnamed: 0')
#df = df_orig.drop(columns='period_start_time')

In [12]:
df['cell_id'] = df['cell_id'].astype(str)

In [13]:
df[df['cell_id'] == '0']


Unnamed: 0,period_start_time,cell_id,dlchbw,w_end_user_avg_thp,w_dl_prb_utilization,w_pdcp_sdu_volume_dl,avg_act_ues_dl,freq_band
0,2020-03-01,0,100.0,15.900526,5.257143,1329.986571,0.13,3
1,2020-03-03,0,100.0,17.177806,5.328571,1384.068429,0.24,3
2,2020-03-04,0,100.0,17.664947,5.171429,1300.318857,0.08,3
3,2020-03-12,0,100.0,20.373883,5.557143,1382.601000,0.11,3
4,2020-03-27,0,100.0,25.490997,2.242857,472.384857,0.08,3
...,...,...,...,...,...,...,...,...
1177558,2020-04-23,0,100.0,30.305593,3.371429,739.510571,0.07,3
1177561,2020-05-05,0,100.0,23.818647,2.385714,491.505286,0.11,3
1177562,2020-05-09,0,100.0,27.830378,3.985714,894.241143,0.11,3
1177564,2020-05-11,0,100.0,28.141841,4.342857,997.973000,0.11,3


In [3]:
features = ['dlchbw', 'w_end_user_avg_thp',
       'w_dl_prb_utilization', 'w_pdcp_sdu_volume_dl', 'avg_act_ues_dl',
       'freq_band']

Parameters for Isolation Forest:

- Number of estimators: n_estimators refers to the number of base estimators or trees in the ensemble, i.e. the number of trees that will get built in the forest. This is an integer parameter and is optional. The default value is 100.

- Max samples: max_samples is the number of samples to be drawn to train each base estimator. If max_samples is more than the number of samples provided, all samples will be used for all trees. The default value of max_samples is 'auto'. If 'auto', then max_samples=min(256, n_samples)

- Contamination: This is a parameter that the algorithm is quite sensitive to; it refers to the expected proportion of outliers in the data set. This is used when fitting to define the threshold on the scores of the samples. The default value is 'auto'. If ‘auto’, the threshold value will be determined as in the original paper of Isolation Forest.

- Max features: All the base estimators are not trained with all the features available in the dataset. It is the number of features to draw from the total features to train each base estimator or tree.The default value of max features is one.

In [4]:
model=IsolationForest(n_estimators=50, max_samples='auto', contamination=float(0.05),max_features=1.0)
model.fit(df[features])

IsolationForest(contamination=0.05, n_estimators=50)

In [5]:
df['scores']=model.decision_function(df[features])
df['anomaly']=model.predict(df[features])
df.head(20)

Unnamed: 0,period_start_time,cell_id,dlchbw,w_end_user_avg_thp,w_dl_prb_utilization,w_pdcp_sdu_volume_dl,avg_act_ues_dl,freq_band,scores,anomaly
0,2020-03-01,0,100.0,15.900526,5.257143,1329.986571,0.13,3,0.084673,1
1,2020-03-03,0,100.0,17.177806,5.328571,1384.068429,0.24,3,0.096355,1
2,2020-03-04,0,100.0,17.664947,5.171429,1300.318857,0.08,3,0.086425,1
3,2020-03-12,0,100.0,20.373883,5.557143,1382.601,0.11,3,0.109553,1
4,2020-03-27,0,100.0,25.490997,2.242857,472.384857,0.08,3,0.07008,1
5,2020-04-03,0,100.0,34.83443,2.95,1438.687,0.04,3,-0.01532,-1
6,2020-04-07,0,100.0,39.489579,3.483333,1577.689714,0.06,3,-0.033089,-1
7,2020-04-16,0,100.0,25.233737,2.628571,637.203143,0.05,3,0.074199,1
8,2020-04-29,0,100.0,30.352474,3.257143,820.757429,0.11,3,0.036413,1
9,2020-05-02,0,100.0,22.990221,2.071429,449.05,0.07,3,0.056638,1


In [6]:
anomaly=df.loc[df['anomaly']==-1]
anomaly_index=list(anomaly.index)

In [7]:
anomaly.shape

(65220, 10)

In [8]:
anomaly.to_csv('whatif_with_outliers.csv')

In [3]:
def load(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [4]:
model = load('/Users/marijamiljkovic/Desktop/outliers-detection-data/playground/marija_OutliersDetection/bb85c12e-e600-4304-8092-6d6d891c82d6/bb85c12e-e600-4304-8092-6d6d891c82d6_IsolationForest')



In [5]:
model

{'modelVersion': '1.0',
 'targetName': 'is_outlier',
 'targetType': 'int',
 'inputFields': [{'FieldName': 'Unnamed: 0', 'FieldType': dtype('int64')},
  {'FieldName': 'period_start_time', 'FieldType': dtype('O')},
  {'FieldName': 'cell_id', 'FieldType': dtype('int64')},
  {'FieldName': 'dlchbw', 'FieldType': dtype('float64')},
  {'FieldName': 'w_end_user_avg_thp', 'FieldType': dtype('float64')},
  {'FieldName': 'w_dl_prb_utilization', 'FieldType': dtype('float64')},
  {'FieldName': 'w_pdcp_sdu_volume_dl', 'FieldType': dtype('float64')},
  {'FieldName': 'avg_act_ues_dl', 'FieldType': dtype('float64')},
  {'FieldName': 'freq_band', 'FieldType': dtype('int64')},
  {'FieldName': 'scores', 'FieldType': dtype('float64')},
  {'FieldName': 'is_outlier', 'FieldType': dtype('int64')}],
 'outputFields': [{'FieldName': 'is_outlier', 'FieldType': 'int'}],
 'modelType': 'outliers_detection',
 'modelID': 'IsolationForest',
 'model': IsolationForest(behaviour='old', contamination=0.05, max_samples=1.0,