# Rule Extraction for Unsupervised Outlier Detection

Example of usage of a library that wrapping an unsupervised outlier detection algorithm (OneClassSVM) of scikit-learn [1] it can infer rules that are comprehensible for human beings, so the'll be able to easily understand why an specific data point is labeled as an outlier, using to do so a method called SVM+Prototypes [2] as described in [3]. To show it's capabilities the outlier analysis is applied on the student's performance dataset [4]

In [1]:
# Libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import sys
from lib.unsupervised_rules import ocsvm_rule_extractor

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Dataset
dataset_mat = pd.read_csv('dataset/student-mat.csv', sep=';')
dataset_mat.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [5]:
# Use a subsample of columns mixing numerical and categorical features
dataset_mat = dataset_mat[["studytime", "G3", "sex", "school"]]
dataset_mat.head()

Unnamed: 0,studytime,G3,sex,school
0,2,6,F,GP
1,2,6,F,GP
2,2,10,F,GP
3,3,15,F,GP
4,2,10,F,GP


In [6]:
# Encoding categorical columns 
obj_df = dataset_mat.select_dtypes(include=['object']).copy() # se eligen las variables categoricas (object)
print(obj_df.columns)

lb_encoder = LabelEncoder()

for col in obj_df.columns:
    dataset_mat[col] = lb_encoder.fit_transform(dataset_mat[col])
    
dataset_mat.head()

Index(['sex', 'school'], dtype='object')


Unnamed: 0,studytime,G3,sex,school
0,2,6,0,0
1,2,6,0,0
2,2,10,0,0
3,3,15,0,0
4,2,10,0,0


In [7]:
# List of different columns types
numerical_cols = ["studytime","G3"]
categorical_cols = [x for x in list(dataset_mat.columns) if x not in numerical_cols]

In [8]:
# Hyperparameters to use
dct_params = {'nu':0.1, 'kernel':"rbf", 'gamma':0.1}

In [9]:
_, df_result = ocsvm_rule_extractor(dataset_mat, numerical_cols, categorical_cols, dct_params)

Initialization method and algorithm are deterministic. Setting n_init to 1.
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 1, iteration: 1/100, moves: 0, ncost: 588.5942150036827
NOT anomaly...
Rule Nº 1: IF sex = 0 AND school = 0 AND studytime <= 4 AND G3 <= 15 AND studytime >= 1 AND G3 >= 8 
Rule Nº 2: IF sex = 0 AND school = 1 AND studytime <= 2 AND G3 <= 0 AND studytime >= 2 AND G3 >= 0 
Rule Nº 3: IF sex = 1 AND school = 0 AND studytime <= 4 AND G3 <= 13 AND studytime >= 2 AND G3 >= 8 


In [10]:
# Rules obtained
df_result.head()

Unnamed: 0,sex,school,studytime_max,G3_max,studytime_min,G3_min
0,0,0,4,15,1,8
1,0,1,2,0,2,0
2,1,0,4,13,2,8


### References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html

[2] H. Núñez, C. Angulo, and A. Català. Rule extraction from support vector machines. In *European Symposium on Artificial Neural Networks (ESANN)*, pages 107–112, 2002.

[3] D. Martens, J. Huysmans, R. Setiono, J. Vanthienen, and B. Baesens. Rule Extraction from Support Vector Machines: An Overview of Issues and Application in Credit Scoring. 2008. https://pdfs.semanticscholar.org/f4d6/25688d0bd8b73cbc61454c1b701385bea214.pdf

[4] https://archive.ics.uci.edu/ml/datasets/student+performance 