### Exploring the impact of clustering on the quality of SMOTE preprocessing. Comparative analysis
##### Maksym Malicki, Jacek Glapiński
###### Wrocław University of Technology
In this notebook we present a comparative analysis of the impact of clustering using various methods on the quality of SMOTE preprocessing.

#### load_dataset()
This method allows us to load datasets listed in the paper.

In [1]:
import numpy as np

def load_dataset(file_path):
    data = []
    labels = []

    with open(file_path, 'r') as f:
        for line in f:
            if line.startswith('@'):
                continue
            line_data = line.strip().split(',')
            sample_class = line_data[-1].strip().lower().replace(" ", "")
            label = 1 if sample_class == 'positive' else 0
            converted_data = []
            for x in line_data[:-1]:
                try:
                    converted_data.append(float(x))
                except ValueError:
                    converted_data.append(ord(x))
            data.append(converted_data)
            labels.append(label)
    X = np.array(data)
    y = np.array(labels)

    return X, y

#### Implementacja SMOTE_Medium i testy

In [2]:
#https://medium.com/@corymaklin/synthetic-minority-over-sampling-technique-smote-7d419696b88c

from random import randrange, uniform
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score

def SMOTE_ByMedium(sample: np.array, N: int, k: int) -> np.array:
    
    T, num_attrs = sample.shape
    
    # If N is less than 100%, randomize the minority class samples as only a random percent of them will be SMOTEd
    if N < 100:
        T = round(N / 100 * T)
        N = 100
    # The amount of SMOTE is assumed to be in integral multiples of 100
    N = int(N / 100)

    synthetic = np.zeros([T * N, num_attrs])
    new_index = 0
    nbrs = NearestNeighbors(n_neighbors=k+1).fit(sample.values)
    def populate(N, i, nnarray):
        
        nonlocal new_index
        nonlocal synthetic
        nonlocal sample
        while N != 0:
            nn = randrange(1, k+1)
            for attr in range(num_attrs):
                dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
                gap = uniform(0, 1)
                synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
            new_index += 1
            N = N - 1
    
    for i in range(T):
        nnarray = nbrs.kneighbors(sample.iloc[i].values.reshape(1, -1), return_distance=False)[0]
        populate(N, i, nnarray)
    
    return synthetic




In [3]:
#Smote test
# tutaj jest tymczasowy test Smote żeby ogarnąć poprawne liczebności
import plotly.express as px
temp_Cluter = np.array([[1,3,2,4,6,2,1,3,6,4,5,1,5,2,6,8,7,9,5,8,6,8,7,9,5,7,4,5,7],
                        [3,1,2,6,2,5,4,3,2,1,3,3,2,1,6,5,7,6,4,8,9,5,8,4,7,5,6,8,4]])
temp_Cluter_y = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1])
print('temp_Cluter shape: ',temp_Cluter.shape)
print('temp_Cluter_y shape: ',temp_Cluter_y.shape)
fig = px.scatter(x=temp_Cluter[0],y=temp_Cluter[1],title="Dane wejściowe")
fig.show()

temp_Cluter shape:  (2, 29)
temp_Cluter_y shape:  (29,)


Liczebność zbioru wejściowego:

In [4]:
import math
# teraz określamy ile chcemy próbek dorobić:
# chcemy mieć 100 próbek
desired = 100
print('Chcemy uzyskać 100 próbek wyjściowych')
#obecnie mamy size próbek
size  = temp_Cluter.shape[1]

todoSamples = int(((desired/size)-1 )*100)
print("Przewidujemy, że dosaniemy próbek wyjściowych: ", (todoSamples/100+1)*len(temp_Cluter_y))
print((todoSamples/100+1))
print(math.floor(todoSamples/100)+1)
temp_out = SMOTE_ByMedium(pd.DataFrame(temp_Cluter.T), todoSamples, 5)

Chcemy uzyskać 100 próbek wyjściowych
Przewidujemy, że dosaniemy próbek wyjściowych:  99.76
3.44
3


In [5]:
temp_out.T
print(len(temp_out.T[0]))
print(len(temp_Cluter[0]))
print(len(temp_out.T[0])+len(temp_Cluter[0]))

58
29
87


In [6]:
temp_Cluter_OUT = [[*temp_Cluter[0],*temp_out.T[0]],[*temp_Cluter[1],*temp_out.T[1]]]
print(temp_Cluter_OUT)

[[1, 3, 2, 4, 6, 2, 1, 3, 6, 4, 5, 1, 5, 2, 6, 8, 7, 9, 5, 8, 6, 8, 7, 9, 5, 7, 4, 5, 7, 1.3690754101997085, 1.0, 3.89770276140223, 3.800373379274607, 1.8890510635080628, 2.225904372697848, 4.0, 2.798169950957382, 5.401230265923091, 4.6980457311653065, 1.9389950734052661, 2.468571818753612, 1.4546281065198574, 1.4502140860743626, 4.468999231683339, 2.018339682955472, 5.467581959128726, 5.782061459580324, 3.981690387918577, 3.2407675172573294, 5.1173438553927975, 5.096084672621962, 1.0, 1.0, 5.900672711264272, 5.0, 1.8807399494490578, 2.6113425483387536, 5.380629243554709, 6.623603485610565, 7.285483912315038, 8.047348869371085, 6.124324175462384, 7.0, 8.975435336958418, 8.837205912765139, 4.156594269835232, 5.0, 7.194562308385896, 6.494892767159826, 6.189738023790144, 6.8453178974537705, 7.968047131947961, 8.379574226401937, 5.74341041050232, 7.961859290107166, 8.892901757903177, 7.777589822540757, 5.370320211202975, 4.612648450406457, 7.832193333788991, 7.0, 4.0, 4.0, 5.27438669593740

In [7]:
fig = px.scatter(x= temp_out.T[0], y= temp_out.T[1],title='Wygenerowane')
fig.show()

In [8]:
fig = px.scatter(x= temp_Cluter_OUT[0], y= temp_Cluter_OUT[1], title= 'Wygenerowane + wejściowe o łącznej liczbie równej desired')
fig.show()
len(temp_Cluter_OUT[0])

87

#### Clustering with SMOTE

In [9]:
from sklearn.cluster import KMeans, MeanShift

def oversample_clustered_data(X, y, X_minority, y_minority, X_majority, y_majority, cluster_labeled_data):
    #wygenerowane próbki mniejszościowe wrzucamy tutaj:
    X_generated = []

    # Do oceny ile próbek trzeba będzie wygenerować trzeba przeliczyć różnicę między liczebnością klasy mniejszościowej i większościowej, będzie to potrzebne później w pętli dla każego z klastrów
    majority_minority_difference =  list(y).count(0) - list(y).count(1)
    if(majority_minority_difference < 0):
        print("[ERR] Monority class has grater count to Majority class")
        exit()
    # Do oceny ile próbek będzie trzeba wygenerować potrzebne też będzie informacja o liczbności klasy mniejszościowej
    minority_count = list(y).count(1)

    # Dla każdego z klastrów:
    cluster_labels = np.unique(cluster_labeled_data) #  pobranie info jakie mamy labele klastrów [0,1,...]

    # Wszystkie wygenerowane syntetycznie obiekty trafią tutaj
    syntetic_data = []
    for cluster in cluster_labels:
        # w pierwszej kolejności filtrujemy indeksy zbioru mniejszościowego aby pozostawić tylko indeksy należące do klastra który w danej iteracji pętli analizujemy
        cluster_samples_indices_minority = np.where(cluster_labeled_data == cluster)[0]
        if len(cluster_samples_indices_minority)<4:
            continue
        # ustalamy ile próbek trzeba wygenerować w tym klastrze
        # ustalamy procentowo liczebność (udział) klastra względem całej klasy mniejszościowej
        percentage_of_count = len(cluster_samples_indices_minority)/minority_count
        # zabezpieczenie jeśli coś na tym etapie szłoby bardzo nie tak
        if(percentage_of_count>1 or percentage_of_count<0):
            print("Kurza twarz.. something is no yes! Procenty nie mogą być większe niż 100 ani mniejsze niż 0")
            exit()
        
        #następnie przemnarzamy udział procentowy klastra przez różnicę między zbiorem większościowym i mniejszościowym uzyskując ilość próbek do augmentacji
        num_of_samples_to_generate_in_this_cluster = percentage_of_count * majority_minority_difference

        # Do samej funkcji augmentacji wymagana jest warość procentowa gdzie 100 to 100% określająca ile próbek ma zostać wygenerowanych względem danych wejściowych
        # w tym celu należy podzielić ilość próbek które chcemy wygenerować przez liczbę próbek które posiadamy już
        todoSamples = 100*(num_of_samples_to_generate_in_this_cluster/len(cluster_samples_indices_minority))

        # Teraz musimy niestety przejść z np.array na dataframe bo w ten sposób działa funkcja ale robimy to już w parametrze funkcji
        # Oversampling... totaj k = 5.. jest to hiperparametr który nie wydaje się być istotny w tej implementacji. Jeśli będą większe zmiany trzeba spojrzeć na niego
        temp_out = SMOTE_ByMedium(pd.DataFrame(X_minority[cluster_samples_indices_minority]), todoSamples, 3)
        syntetic_data.append(temp_out)
        
        #Ta implementacja Smote jeśli tworzy więcej niż 100% próbek z próbek które dostała traci informację o części dziesiętnej procentów - to powoduje, że powstaje mniej resamplowanych próbek
        if todoSamples>100:
            todoSamples_2 = todoSamples%100
            temp_out_2 = SMOTE_ByMedium(pd.DataFrame(X_minority[cluster_samples_indices_minority]),todoSamples_2 , 3)
            syntetic_data.append(temp_out_2)
        
        

    #tutaj wymagane jest połączenie wyjścia syntetic_data razem z X i y
    X_resampled = X
    y_resampled  = y

    for clusterOUT in syntetic_data:
        X_resampled = np.block([[X_resampled], [clusterOUT]])
        y_resampled = [*y_resampled,*np.ones(clusterOUT.shape[0])]# jedynki dla tego, że 1 to klasa mniejszościowa
    return X_resampled, y_resampled

def KMeans_SMOTE(X, y, num_clusters):
    # Wybieramy indeksy klasy mniejszościowej i większościowej
    minority_indices = np.where(y == 1)[0] # dla naszych datasetów klasa o labelu 1 zawsze jest mniejszościowa
    majority_indices = np.where(y == 0)[0]

    # Sortujemy Dane i Labele na minority i majority
    X_minority = X[minority_indices]
    y_minority = y[minority_indices]

    X_majority = X[majority_indices]
    y_majority = y[majority_indices]
    
    # Wykonujemy klasteryzację na klasie mniejszościowej
    kmeans_labels_minority = KMeans(n_clusters=num_clusters, random_state=0, n_init="auto").fit_predict(X_minority)

    # Zwrócona wartość to indeksy odwołujące się jedynie do klasy mniejszościowej! trzeba o tym teraz pamiętać.
    return oversample_clustered_data(X, y, X_minority, y_minority, X_majority, y_majority, kmeans_labels_minority)


def MeanShift_SMOTE(X, y):
    # TAK WIEM,TO POWINNA BYĆ FUNKCJA, BO WIĘKSZOŚĆ LINIJEK MEANSHIFT_SMOTE I KMEANS_SMOTE SIĘ POWTARZA ... może kiedyś tu upiększymy
    # Wybieramy indeksy klasy mniejszościowej i większościowej
    minority_indices = np.where(y == 1)[0] # dla naszych datasetów klasa o labelu 1 zawsze jest mniejszościowa
    majority_indices = np.where(y == 0)[0]

    # Sortujemy Dane i Labele na minority i majority
    X_minority = X[minority_indices]
    y_minority = y[minority_indices]

    X_majority = X[majority_indices]
    y_majority = y[majority_indices]
    
    # Wykonujemy klasteryzację na klasie mniejszościowej
    mean_shift_labels_minority = MeanShift().fit_predict(X_minority)
    return oversample_clustered_data(X, y, X_minority, y_minority, X_majority, y_majority, mean_shift_labels_minority)

#### Sprawdzenie który ze zbiorów jest Większościowy

In [10]:
# tymczasowy algorytm do analizy która z klas jest mniejszościowa
def ReturnsMinorityLabel(file_path):
    labels = []
    with open(file_path, 'r') as f:
        for line in f:
            if line.startswith('@'):
                continue
            line_data = line.strip().split(',')
            sample_class = line_data[-1].strip().lower().replace(" ", "")
            label = 1 if sample_class == 'positive' else 0
            labels.append(label)
    countOf0 = labels.count(0)
    countOf1 = labels.count(1)
    print("---------STATS---------") 
    print('countOf0: ',countOf0)
    print('countOf1: ',countOf1)
    print("----------OUT----------") 
    if countOf0 == countOf1:
        print("!!!!!kurcze zbiorki są zbalansowane!!!!!!")
        return(-1)
    if countOf0 > countOf1:
        print("Więcej klasy 0")
        return(0)
    if countOf0 < countOf1:
        print("Więcej klasy 1")
        return(1)

import os

#sprawdzenie czy działa

directories = ['mild-imbalance', 'high-imbalance']
results_of_test = []
for directory in directories:
    print(f"Processing files in directory: {directory}")
    files = os.listdir(directory)
    
    for file_name in files:
        file_path = os.path.join(directory, file_name)
        print(f"File: {file_path}")
        results_of_test.append(ReturnsMinorityLabel(file_path))
    
print(results_of_test)
print(results_of_test.count(1))

Processing files in directory: mild-imbalance
File: mild-imbalance\page-blocks0.dat
---------STATS---------
countOf0:  4913
countOf1:  559
----------OUT----------
Więcej klasy 0
File: mild-imbalance\pima.dat
---------STATS---------
countOf0:  500
countOf1:  268
----------OUT----------
Więcej klasy 0
File: mild-imbalance\segment0.dat
---------STATS---------
countOf0:  1979
countOf1:  329
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle0.dat
---------STATS---------
countOf0:  647
countOf1:  199
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle1.dat
---------STATS---------
countOf0:  629
countOf1:  217
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle2.dat
---------STATS---------
countOf0:  628
countOf1:  218
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle3.dat
---------STATS---------
countOf0:  634
countOf1:  212
----------OUT----------
Więcej klasy 0
File: mild-imbalance\wisconsin.dat
---------STATS---------
coun

#### Experiment for single dataset

In [11]:
from imblearn.over_sampling import SMOTE, RandomOverSampler, BorderlineSMOTE
from sklearn import svm
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import precision_score, recall_score
from imblearn.metrics import specificity_score

def experiment(X, y):
    preprocessings = {
        "KMeansSMOTE": True,
        "MeansShiftSMOTE": True,
        "SMOTE": SMOTE(),
        "ROS": RandomOverSampler(),
        "BorderlineSMOTE": BorderlineSMOTE(),
    }
    classifier = RandomForestClassifier(random_state=42)
    rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=1234)
    result = {}
    #wyeksportuję sobie uzyskane datasety poza funkcję żeby zobaczyć jak wyglądają oversamplowane
    export_X = []
    export_y = []
    for key in preprocessings:
        precision_scores = []
        recall_scores = []
        specifity_scores = []
        for train_index, test_index in rskf.split(X,y):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]
            if key == "KMeansSMOTE":
                X_train_oversampled, y_train_oversampled = KMeans_SMOTE(X_train, y_train, 10)
            elif key == "MeansShiftSMOTE":
                X_train_oversampled, y_train_oversampled = MeanShift_SMOTE(X_train, y_train)
            else:
                X_train_oversampled, y_train_oversampled = preprocessings[key].fit_resample(X_train, y_train)
            export_X = X_train_oversampled
            export_y = y_train_oversampled
            classifier.fit(X_train_oversampled, y_train_oversampled)
            predict = classifier.predict(X_test)
            precision_scores.append(precision_score(y_test, predict))
            recall_scores.append(recall_score(y_test, predict))
            specifity_scores.append(specificity_score(y_test, predict))
        mean_precision_score = np.mean(precision_scores)
        std_precision_score = np.std(precision_scores)
        mean_recall_score = np.mean(recall_scores)
        std_recall_score = np.std(recall_scores)
        mean_specifity_score = np.mean(specifity_scores)
        std_specifity_score = np.std(specifity_scores)
#         print(f"Precission score {key}: %.3f (%.3f)" % (mean_precision_score, std_precision_score))
#         print(f"Specifity score {key}: %.3f (%.3f)" % (mean_specifity_score, std_specifity_score))
#         print(f"Recall score {key}: %.3f (%.3f)" % (mean_recall_score, std_recall_score))
        result[key] = {
            "precission_scores": precision_scores,
            "recall_scores": recall_scores,
            "specifity_scores": specifity_scores,
            "mean_precission_score": mean_precision_score,
            "mean_recall_scores": mean_recall_score,
            "mean_specifity_scores": mean_specifity_score,
        }
    return result, export_X, export_y

#### Running experiments on the datasets

In [12]:
import os

directories = ['mild-imbalance', 'high-imbalance']
results = {}
oversampled_datasets = pd.DataFrame()
for directory in directories:
    print(f"Processing files in directory: {directory}")
    files = os.listdir(directory)
    results = {}
    for file_name in files:
        file_path = os.path.join(directory, file_name)
        print(f"File: {file_path}")
        X, y = load_dataset(file_path)
        experiment_result, oversampled_X, oversampled_y = experiment(X, y)
        oversampled_datasets[file_path] = [oversampled_X,oversampled_y]
        results[file_name] = experiment_result
# print(results)

Processing files in directory: mild-imbalance
File: mild-imbalance\page-blocks0.dat
File: mild-imbalance\pima.dat
File: mild-imbalance\segment0.dat
File: mild-imbalance\vehicle0.dat
File: mild-imbalance\vehicle1.dat
File: mild-imbalance\vehicle2.dat
File: mild-imbalance\vehicle3.dat
File: mild-imbalance\wisconsin.dat
File: mild-imbalance\yeast1.dat
File: mild-imbalance\yeast3.dat
Processing files in directory: high-imbalance
File: high-imbalance\abalone-17_vs_7-8-9-10.dat



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



File: high-imbalance\abalone19.dat



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



File: high-imbalance\kr-vs-k-three_vs_eleven.dat
File: high-imbalance\kr-vs-k-zero-one_vs_draw.dat
File: high-imbalance\shuttle-2_vs_5.dat
File: high-imbalance\shuttle-c0-vs-c4.dat
File: high-imbalance\yeast-0-2-5-6_vs_3-7-8-9.dat
File: high-imbalance\yeast-0-2-5-7-9_vs_3-6-8.dat
File: high-imbalance\yeast4.dat
File: high-imbalance\yeast5.dat


#### Przechowywanie danych z eksperymentu i podgląd datasetów

In [13]:
# To sobie kopujemy i mamy zapisane rezultaty, w tym miejscu można też odczytać te rezultaty żeby bez wykonywana na swoim kompie eksperymentów mieć dostęp do przeliczonych już danych
print(results)
#results = {'abalone-17_vs_7-8-9-10.dat': {'KMeansSMOTE': {'precission_scores': [0.17647058823529413, 0.21428571428571427, 0.19230769230769232, 0.3181818181818182, 0.3, 0.26666666666666666, 0.3125, 0.25, 0.29411764705882354, 0.11764705882352941], 'recall_scores': [0.25, 0.25, 0.4166666666666667, 0.6363636363636364, 0.2727272727272727, 0.3333333333333333, 0.4166666666666667, 0.3333333333333333, 0.45454545454545453, 0.18181818181818182], 'specifity_scores': [0.9692982456140351, 0.9758771929824561, 0.9539473684210527, 0.9671052631578947, 0.9846491228070176, 0.9758771929824561, 0.9758771929824561, 0.9736842105263158, 0.9736842105263158, 0.9671052631578947], 'mean_precission_score': 0.24421771855595384, 'mean_recall_scores': 0.35454545454545455, 'mean_specifity_scores': 0.9717105263157894}, 'MeansShiftSMOTE': {'precission_scores': [0.22727272727272727, 0.3, 0.2916666666666667, 0.25, 0.3076923076923077, 0.21428571428571427, 0.2916666666666667, 0.3333333333333333, 0.25, 0.16666666666666666], 'recall_scores': [0.4166666666666667, 0.5, 0.5833333333333334, 0.5454545454545454, 0.36363636363636365, 0.25, 0.5833333333333334, 0.5, 0.5454545454545454, 0.2727272727272727], 'specifity_scores': [0.9627192982456141, 0.9692982456140351, 0.9627192982456141, 0.9605263157894737, 0.9802631578947368, 0.9758771929824561, 0.9627192982456141, 0.9736842105263158, 0.9605263157894737, 0.9671052631578947], 'mean_precission_score': 0.26325840825840824, 'mean_recall_scores': 0.45606060606060617, 'mean_specifity_scores': 0.9675438596491229}, 'SMOTE': {'precission_scores': [0.3181818181818182, 0.25, 0.2, 0.30434782608695654, 0.4, 0.4, 0.3333333333333333, 0.3333333333333333, 0.25, 0.2], 'recall_scores': [0.5833333333333334, 0.4166666666666667, 0.3333333333333333, 0.6363636363636364, 0.36363636363636365, 0.3333333333333333, 0.5, 0.5, 0.5454545454545454, 0.36363636363636365], 'specifity_scores': [0.9671052631578947, 0.9671052631578947, 0.9649122807017544, 0.9649122807017544, 0.9868421052631579, 0.9868421052631579, 0.9736842105263158, 0.9736842105263158, 0.9605263157894737, 0.9649122807017544], 'mean_precission_score': 0.29891963109354414, 'mean_recall_scores': 0.45757575757575764, 'mean_specifity_scores': 0.9710526315789474}, 'ROS': {'precission_scores': [0.5, 0.4, 0.6666666666666666, 0.4, 0.3333333333333333, 0.0, 0.5, 0.5, 0.3333333333333333, 0.3333333333333333], 'recall_scores': [0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.18181818181818182, 0.09090909090909091, 0.0, 0.08333333333333333, 0.16666666666666666, 0.18181818181818182, 0.18181818181818182], 'specifity_scores': [0.9956140350877193, 0.993421052631579, 0.9978070175438597, 0.993421052631579, 0.9956140350877193, 1.0, 0.9978070175438597, 0.9956140350877193, 0.9912280701754386, 0.9912280701754386], 'mean_precission_score': 0.39666666666666667, 'mean_recall_scores': 0.13863636363636367, 'mean_specifity_scores': 0.9951754385964913}, 'BorderlineSMOTE': {'precission_scores': [0.2727272727272727, 0.3333333333333333, 0.25, 0.35, 0.5714285714285714, 0.4444444444444444, 0.3333333333333333, 0.38461538461538464, 0.29411764705882354, 0.23529411764705882], 'recall_scores': [0.5, 0.4166666666666667, 0.3333333333333333, 0.6363636363636364, 0.36363636363636365, 0.3333333333333333, 0.4166666666666667, 0.4166666666666667, 0.45454545454545453, 0.36363636363636365], 'specifity_scores': [0.9649122807017544, 0.9780701754385965, 0.9736842105263158, 0.9714912280701754, 0.993421052631579, 0.9890350877192983, 0.9780701754385965, 0.9824561403508771, 0.9736842105263158, 0.9714912280701754], 'mean_precission_score': 0.3469294104588222, 'mean_recall_scores': 0.4234848484848485, 'mean_specifity_scores': 0.9776315789473683}}, 'abalone19.dat': {'KMeansSMOTE': {'precission_scores': [0.14285714285714285, 0.09090909090909091, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0, 0.0], 'recall_scores': [0.16666666666666666, 0.16666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14285714285714285, 0.0, 0.0], 'specifity_scores': [0.9927623642943305, 0.9879372738238842, 0.9963768115942029, 0.9891304347826086, 0.9951690821256038, 0.9891435464414958, 0.9891435464414958, 0.9951690821256038, 0.998792270531401, 0.9927536231884058], 'mean_precission_score': 0.04337662337662338, 'mean_recall_scores': 0.047619047619047616, 'mean_specifity_scores': 0.9926378035349034}, 'MeansShiftSMOTE': {'precission_scores': [0.037037037037037035, 0.1111111111111111, 0.0, 0.06451612903225806, 0.0, 0.0, 0.0, 0.09090909090909091, 0.10526315789473684, 0.0], 'recall_scores': [0.16666666666666666, 0.3333333333333333, 0.0, 0.2857142857142857, 0.0, 0.0, 0.0, 0.2857142857142857, 0.2857142857142857, 0.0], 'specifity_scores': [0.9686369119420989, 0.9806996381182147, 0.9879227053140096, 0.964975845410628, 0.9770531400966184, 0.9662243667068757, 0.9734620024125452, 0.9758454106280193, 0.9794685990338164, 0.9746376811594203], 'mean_precission_score': 0.040883652598423394, 'mean_recall_scores': 0.13571428571428573, 'mean_specifity_scores': 0.9748926300822246}, 'SMOTE': {'precission_scores': [0.05555555555555555, 0.09090909090909091, 0.0625, 0.0, 0.0, 0.0, 0.0, 0.16666666666666666, 0.11764705882352941, 0.0], 'recall_scores': [0.16666666666666666, 0.3333333333333333, 0.14285714285714285, 0.0, 0.0, 0.0, 0.0, 0.42857142857142855, 0.2857142857142857, 0.0], 'specifity_scores': [0.9794933655006032, 0.9758745476477684, 0.9818840579710145, 0.9806763285024155, 0.9818840579710145, 0.9806996381182147, 0.9746682750301568, 0.9818840579710145, 0.9818840579710145, 0.9746376811594203], 'mean_precission_score': 0.04932783719548426, 'mean_recall_scores': 0.13571428571428573, 'mean_specifity_scores': 0.9793586067842636}, 'ROS': {'precission_scores': [0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0], 'recall_scores': [0.0, 0.0, 0.14285714285714285, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14285714285714285, 0.0], 'specifity_scores': [1.0, 1.0, 0.998792270531401, 0.998792270531401, 1.0, 1.0, 1.0, 1.0, 1.0, 0.998792270531401], 'mean_precission_score': 0.15, 'mean_recall_scores': 0.02857142857142857, 'mean_specifity_scores': 0.9996376811594203}, 'BorderlineSMOTE': {'precission_scores': [0.0, 0.08333333333333333, 0.125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0], 'recall_scores': [0.0, 0.16666666666666666, 0.14285714285714285, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14285714285714285, 0.0], 'specifity_scores': [0.9939686369119421, 0.9867310012062727, 0.9915458937198067, 0.998792270531401, 0.998792270531401, 0.9975874547647768, 0.9891435464414958, 0.9951690821256038, 0.998792270531401, 0.9963768115942029], 'mean_precission_score': 0.07083333333333333, 'mean_recall_scores': 0.04523809523809524, 'mean_specifity_scores': 0.9946899238358304}}, 'kr-vs-k-three_vs_eleven.dat': {'KMeansSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8823529411764706], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 0.9882352941176471, 'mean_specifity_scores': 1.0}, 'MeansShiftSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8823529411764706], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 0.9882352941176471, 'mean_specifity_scores': 1.0}, 'SMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8823529411764706], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 0.9882352941176471, 'mean_specifity_scores': 1.0}, 'ROS': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8823529411764706], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 0.9882352941176471, 'mean_specifity_scores': 1.0}, 'BorderlineSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8823529411764706], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 0.9882352941176471, 'mean_specifity_scores': 1.0}}, 'kr-vs-k-zero-one_vs_draw.dat': {'KMeansSMOTE': {'precission_scores': [1.0, 0.9473684210526315, 0.9523809523809523, 0.9523809523809523, 1.0, 0.9523809523809523, 0.8947368421052632, 1.0, 1.0, 1.0], 'recall_scores': [0.9523809523809523, 0.8571428571428571, 0.9523809523809523, 0.9523809523809523, 0.9523809523809523, 0.9523809523809523, 0.8095238095238095, 0.9523809523809523, 0.9523809523809523, 0.8571428571428571], 'specifity_scores': [1.0, 0.998211091234347, 0.998211091234347, 0.998211091234347, 1.0, 0.9982142857142857, 0.9964221824686941, 1.0, 1.0, 1.0], 'mean_precission_score': 0.9699248120300752, 'mean_recall_scores': 0.919047619047619, 'mean_specifity_scores': 0.9989269741886021}, 'MeansShiftSMOTE': {'precission_scores': [1.0, 0.9473684210526315, 0.9523809523809523, 0.9545454545454546, 1.0, 0.9545454545454546, 0.8571428571428571, 1.0, 1.0, 1.0], 'recall_scores': [0.9523809523809523, 0.8571428571428571, 0.9523809523809523, 1.0, 0.9523809523809523, 1.0, 0.8571428571428571, 0.9523809523809523, 0.9523809523809523, 0.8571428571428571], 'specifity_scores': [1.0, 0.998211091234347, 0.998211091234347, 0.998211091234347, 1.0, 0.9982142857142857, 0.9946332737030411, 1.0, 1.0, 1.0], 'mean_precission_score': 0.966598313966735, 'mean_recall_scores': 0.9333333333333333, 'mean_specifity_scores': 0.9987480833120369}, 'SMOTE': {'precission_scores': [1.0, 0.9473684210526315, 0.9090909090909091, 0.9545454545454546, 1.0, 0.9523809523809523, 0.8571428571428571, 1.0, 1.0, 1.0], 'recall_scores': [0.9523809523809523, 0.8571428571428571, 0.9523809523809523, 1.0, 0.9523809523809523, 0.9523809523809523, 0.8571428571428571, 0.9523809523809523, 0.9523809523809523, 0.8571428571428571], 'specifity_scores': [1.0, 0.998211091234347, 0.9964221824686941, 0.998211091234347, 1.0, 0.9982142857142857, 0.9946332737030411, 1.0, 1.0, 1.0], 'mean_precission_score': 0.9620528594212804, 'mean_recall_scores': 0.9285714285714286, 'mean_specifity_scores': 0.9985691924354715}, 'ROS': {'precission_scores': [0.9130434782608695, 0.95, 0.9523809523809523, 0.9545454545454546, 1.0, 0.9545454545454546, 0.8571428571428571, 1.0, 1.0, 0.9473684210526315], 'recall_scores': [1.0, 0.9047619047619048, 0.9523809523809523, 1.0, 0.9523809523809523, 1.0, 0.8571428571428571, 0.9523809523809523, 0.9523809523809523, 0.8571428571428571], 'specifity_scores': [0.9964285714285714, 0.998211091234347, 0.998211091234347, 0.998211091234347, 1.0, 0.9982142857142857, 0.9946332737030411, 1.0, 1.0, 0.998211091234347], 'mean_precission_score': 0.9529026617928219, 'mean_recall_scores': 0.9428571428571428, 'mean_specifity_scores': 0.9982120495783287}, 'BorderlineSMOTE': {'precission_scores': [1.0, 0.9473684210526315, 0.9090909090909091, 0.9545454545454546, 0.9090909090909091, 0.9523809523809523, 0.8636363636363636, 1.0, 1.0, 1.0], 'recall_scores': [0.9047619047619048, 0.8571428571428571, 0.9523809523809523, 1.0, 0.9523809523809523, 0.9523809523809523, 0.9047619047619048, 0.9523809523809523, 0.9523809523809523, 0.8571428571428571], 'specifity_scores': [1.0, 0.998211091234347, 0.9964221824686941, 0.998211091234347, 0.9964221824686941, 0.9982142857142857, 0.9946332737030411, 1.0, 1.0, 1.0], 'mean_precission_score': 0.9536113009797219, 'mean_recall_scores': 0.9285714285714286, 'mean_specifity_scores': 0.998211410682341}}, 'shuttle-2_vs_5.dat': {'KMeansSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'MeansShiftSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'SMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'ROS': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'BorderlineSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}}, 'shuttle-c0-vs-c4.dat': {'KMeansSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'MeansShiftSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'SMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'ROS': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}, 'BorderlineSMOTE': {'precission_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'recall_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'specifity_scores': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'mean_precission_score': 1.0, 'mean_recall_scores': 1.0, 'mean_specifity_scores': 1.0}}, 'yeast-0-2-5-6_vs_3-7-8-9.dat': {'KMeansSMOTE': {'precission_scores': [0.5217391304347826, 0.8, 0.6666666666666666, 0.5909090909090909, 0.5789473684210527, 0.6428571428571429, 0.6666666666666666, 0.6666666666666666, 0.75, 0.6111111111111112], 'recall_scores': [0.6, 0.6, 0.5, 0.65, 0.5789473684210527, 0.45, 0.7, 0.7, 0.6, 0.5789473684210527], 'specifity_scores': [0.9392265193370166, 0.9834254143646409, 0.9723756906077348, 0.9502762430939227, 0.9558011049723757, 0.9723756906077348, 0.9613259668508287, 0.9613259668508287, 0.9779005524861878, 0.9613259668508287], 'mean_precission_score': 0.649556384373318, 'mean_recall_scores': 0.5957894736842105, 'mean_specifity_scores': 0.96353591160221}, 'MeansShiftSMOTE': {'precission_scores': [0.5454545454545454, 0.8, 0.6666666666666666, 0.6086956521739131, 0.6470588235294118, 0.75, 0.6666666666666666, 0.6086956521739131, 0.65, 0.55], 'recall_scores': [0.6, 0.6, 0.5, 0.7, 0.5789473684210527, 0.45, 0.6, 0.7, 0.65, 0.5789473684210527], 'specifity_scores': [0.9447513812154696, 0.9834254143646409, 0.9723756906077348, 0.9502762430939227, 0.9668508287292817, 0.9834254143646409, 0.9668508287292817, 0.9502762430939227, 0.9613259668508287, 0.9502762430939227], 'mean_precission_score': 0.6493238006665116, 'mean_recall_scores': 0.5957894736842105, 'mean_specifity_scores': 0.9629834254143648}, 'SMOTE': {'precission_scores': [0.5217391304347826, 0.7647058823529411, 0.6470588235294118, 0.5416666666666666, 0.6111111111111112, 0.5833333333333334, 0.7, 0.7368421052631579, 0.631578947368421, 0.5263157894736842], 'recall_scores': [0.6, 0.65, 0.55, 0.65, 0.5789473684210527, 0.35, 0.7, 0.7, 0.6, 0.5263157894736842], 'specifity_scores': [0.9392265193370166, 0.9779005524861878, 0.9668508287292817, 0.9392265193370166, 0.9613259668508287, 0.9723756906077348, 0.9668508287292817, 0.9723756906077348, 0.9613259668508287, 0.9502762430939227], 'mean_precission_score': 0.626435178953351, 'mean_recall_scores': 0.5905263157894736, 'mean_specifity_scores': 0.9607734806629834}, 'ROS': {'precission_scores': [0.6470588235294118, 0.9090909090909091, 0.75, 0.6666666666666666, 0.6923076923076923, 0.7777777777777778, 0.7333333333333333, 0.8125, 0.7058823529411765, 0.5789473684210527], 'recall_scores': [0.55, 0.5, 0.45, 0.6, 0.47368421052631576, 0.35, 0.55, 0.65, 0.6, 0.5789473684210527], 'specifity_scores': [0.9668508287292817, 0.994475138121547, 0.9834254143646409, 0.9668508287292817, 0.9779005524861878, 0.988950276243094, 0.9779005524861878, 0.9834254143646409, 0.9723756906077348, 0.9558011049723757], 'mean_precission_score': 0.7273564924068021, 'mean_recall_scores': 0.5302631578947368, 'mean_specifity_scores': 0.9767955801104972}, 'BorderlineSMOTE': {'precission_scores': [0.5454545454545454, 0.7647058823529411, 0.6428571428571429, 0.5454545454545454, 0.6, 0.7, 0.6111111111111112, 0.6, 0.6842105263157895, 0.5882352941176471], 'recall_scores': [0.6, 0.65, 0.45, 0.6, 0.47368421052631576, 0.35, 0.55, 0.6, 0.65, 0.5263157894736842], 'specifity_scores': [0.9447513812154696, 0.9779005524861878, 0.9723756906077348, 0.9447513812154696, 0.9668508287292817, 0.9834254143646409, 0.9613259668508287, 0.9558011049723757, 0.9668508287292817, 0.9613259668508287], 'mean_precission_score': 0.6282029047663722, 'mean_recall_scores': 0.545, 'mean_specifity_scores': 0.96353591160221}}, 'yeast-0-2-5-7-9_vs_3-6-8.dat': {'KMeansSMOTE': {'precission_scores': [0.8823529411764706, 0.9375, 0.75, 0.782608695652174, 0.7272727272727273, 0.75, 1.0, 0.7894736842105263, 0.782608695652174, 0.7272727272727273], 'recall_scores': [0.75, 0.75, 0.9, 0.9, 0.42105263157894735, 0.75, 0.9, 0.75, 0.9, 0.8421052631578947], 'specifity_scores': [0.988950276243094, 0.994475138121547, 0.9668508287292817, 0.9723756906077348, 0.9834254143646409, 0.9723756906077348, 1.0, 0.9779005524861878, 0.9723756906077348, 0.9668508287292817], 'mean_precission_score': 0.8129089471236799, 'mean_recall_scores': 0.7863157894736842, 'mean_specifity_scores': 0.9795580110497237}, 'MeansShiftSMOTE': {'precission_scores': [0.75, 0.8, 0.8181818181818182, 0.782608695652174, 0.8181818181818182, 0.75, 1.0, 0.8235294117647058, 0.7727272727272727, 0.7619047619047619], 'recall_scores': [0.75, 0.8, 0.9, 0.9, 0.47368421052631576, 0.75, 0.85, 0.7, 0.85, 0.8421052631578947], 'specifity_scores': [0.9723756906077348, 0.9779005524861878, 0.9779005524861878, 0.9723756906077348, 0.988950276243094, 0.9723756906077348, 1.0, 0.9834254143646409, 0.9723756906077348, 0.9723756906077348], 'mean_precission_score': 0.8077133778412551, 'mean_recall_scores': 0.7815789473684209, 'mean_specifity_scores': 0.9790055248618783}, 'SMOTE': {'precission_scores': [0.7142857142857143, 0.8333333333333334, 0.8571428571428571, 0.75, 0.9333333333333333, 0.75, 1.0, 0.7777777777777778, 0.8888888888888888, 0.7727272727272727], 'recall_scores': [0.75, 0.75, 0.9, 0.9, 0.7368421052631579, 0.75, 0.8, 0.7, 0.8, 0.8947368421052632], 'specifity_scores': [0.9668508287292817, 0.9834254143646409, 0.9834254143646409, 0.9668508287292817, 0.994475138121547, 0.9723756906077348, 1.0, 0.9779005524861878, 0.988950276243094, 0.9723756906077348], 'mean_precission_score': 0.8277489177489178, 'mean_recall_scores': 0.7981578947368421, 'mean_specifity_scores': 0.9806629834254142}, 'ROS': {'precission_scores': [0.8823529411764706, 0.9375, 0.8571428571428571, 0.8095238095238095, 0.9090909090909091, 0.8235294117647058, 1.0, 0.7894736842105263, 0.9444444444444444, 0.7619047619047619], 'recall_scores': [0.75, 0.75, 0.9, 0.85, 0.5263157894736842, 0.7, 0.8, 0.75, 0.85, 0.8421052631578947], 'specifity_scores': [0.988950276243094, 0.994475138121547, 0.9834254143646409, 0.9779005524861878, 0.994475138121547, 0.9834254143646409, 1.0, 0.9779005524861878, 0.994475138121547, 0.9723756906077348], 'mean_precission_score': 0.8714962819258485, 'mean_recall_scores': 0.7718421052631579, 'mean_specifity_scores': 0.9867403314917127}, 'BorderlineSMOTE': {'precission_scores': [0.8823529411764706, 0.8235294117647058, 0.8947368421052632, 0.8181818181818182, 0.8125, 0.7, 0.9473684210526315, 0.875, 0.75, 0.7619047619047619], 'recall_scores': [0.75, 0.7, 0.85, 0.9, 0.6842105263157895, 0.7, 0.9, 0.7, 0.75, 0.8421052631578947], 'specifity_scores': [0.988950276243094, 0.9834254143646409, 0.988950276243094, 0.9779005524861878, 0.9834254143646409, 0.9668508287292817, 0.994475138121547, 0.988950276243094, 0.9723756906077348, 0.9723756906077348], 'mean_precission_score': 0.826557419618565, 'mean_recall_scores': 0.7776315789473685, 'mean_specifity_scores': 0.9817679558011049}}, 'yeast4.dat': {'KMeansSMOTE': {'precission_scores': [0.5, 0.42857142857142855, 0.3157894736842105, 0.5454545454545454, 0.8, 0.5714285714285714, 1.0, 0.14285714285714285, 0.4444444444444444, 0.5], 'recall_scores': [0.3, 0.3, 0.6, 0.5454545454545454, 0.4, 0.4, 0.4, 0.1, 0.36363636363636365, 0.5], 'specifity_scores': [0.9895470383275261, 0.9860627177700348, 0.9547038327526133, 0.9825174825174825, 0.9965034965034965, 0.9895470383275261, 1.0, 0.9790940766550522, 0.9825174825174825, 0.9825174825174825], 'mean_precission_score': 0.5248545606440344, 'mean_recall_scores': 0.390909090909091, 'mean_specifity_scores': 0.9843010647888697}, 'MeansShiftSMOTE': {'precission_scores': [0.2727272727272727, 0.2222222222222222, 0.3, 0.42857142857142855, 0.4, 0.4, 0.6666666666666666, 0.09090909090909091, 0.4, 0.29411764705882354], 'recall_scores': [0.3, 0.2, 0.6, 0.5454545454545454, 0.6, 0.6, 0.4, 0.1, 0.36363636363636365, 0.5], 'specifity_scores': [0.9721254355400697, 0.975609756097561, 0.9512195121951219, 0.972027972027972, 0.9685314685314685, 0.9686411149825784, 0.9930313588850174, 0.9651567944250871, 0.9790209790209791, 0.958041958041958], 'mean_precission_score': 0.3475214328155504, 'mean_recall_scores': 0.4209090909090909, 'mean_specifity_scores': 0.9703406349747812}, 'SMOTE': {'precission_scores': [0.25, 0.4, 0.35294117647058826, 0.5, 0.38461538461538464, 0.3125, 0.5714285714285714, 0.0, 0.38461538461538464, 0.375], 'recall_scores': [0.3, 0.4, 0.6, 0.5454545454545454, 0.5, 0.5, 0.4, 0.0, 0.45454545454545453, 0.6], 'specifity_scores': [0.9686411149825784, 0.9790940766550522, 0.9616724738675958, 0.9790209790209791, 0.972027972027972, 0.9616724738675958, 0.9895470383275261, 0.9651567944250871, 0.972027972027972, 0.965034965034965], 'mean_precission_score': 0.3531100517129929, 'mean_recall_scores': 0.43, 'mean_specifity_scores': 0.9713895860237324}, 'ROS': {'precission_scores': [0.75, 0.6666666666666666, 0.5, 0.2, 1.0, 0.8, 1.0, 0.0, 0.6666666666666666, 0.5], 'recall_scores': [0.3, 0.2, 0.2, 0.09090909090909091, 0.3, 0.4, 0.1, 0.0, 0.18181818181818182, 0.5], 'specifity_scores': [0.9965156794425087, 0.9965156794425087, 0.9930313588850174, 0.986013986013986, 1.0, 0.9965156794425087, 1.0, 0.9895470383275261, 0.9965034965034965, 0.9825174825174825], 'mean_precission_score': 0.6083333333333333, 'mean_recall_scores': 0.22727272727272724, 'mean_specifity_scores': 0.9937160400575035}, 'BorderlineSMOTE': {'precission_scores': [0.2727272727272727, 0.2222222222222222, 0.21428571428571427, 0.3333333333333333, 0.4, 0.4, 0.5, 0.0, 0.4444444444444444, 0.35714285714285715], 'recall_scores': [0.3, 0.2, 0.3, 0.36363636363636365, 0.4, 0.6, 0.4, 0.0, 0.36363636363636365, 0.5], 'specifity_scores': [0.9721254355400697, 0.975609756097561, 0.9616724738675958, 0.972027972027972, 0.9790209790209791, 0.9686411149825784, 0.9860627177700348, 0.9721254355400697, 0.9825174825174825, 0.9685314685314685], 'mean_precission_score': 0.31441558441558437, 'mean_recall_scores': 0.3427272727272727, 'mean_specifity_scores': 0.9738334835895811}}, 'yeast5.dat': {'KMeansSMOTE': {'precission_scores': [0.5384615384615384, 0.6666666666666666, 0.5555555555555556, 1.0, 0.8571428571428571, 0.8571428571428571, 0.625, 0.8, 0.75, 0.7142857142857143], 'recall_scores': [0.7777777777777778, 0.6666666666666666, 0.5555555555555556, 0.8888888888888888, 0.75, 0.6666666666666666, 0.5555555555555556, 0.8888888888888888, 1.0, 0.625], 'specifity_scores': [0.9791666666666666, 0.9895833333333334, 0.9861111111111112, 1.0, 0.9965277777777778, 0.9965277777777778, 0.9895833333333334, 0.9930555555555556, 0.9895833333333334, 0.9930555555555556], 'mean_precission_score': 0.7364255189255189, 'mean_recall_scores': 0.7375, 'mean_specifity_scores': 0.9913194444444444}, 'MeansShiftSMOTE': {'precission_scores': [0.5, 0.7272727272727273, 0.5, 1.0, 0.7777777777777778, 0.75, 0.625, 0.75, 0.6428571428571429, 0.7142857142857143], 'recall_scores': [0.7777777777777778, 0.8888888888888888, 0.6666666666666666, 0.8888888888888888, 0.875, 0.6666666666666666, 0.5555555555555556, 1.0, 1.0, 0.625], 'specifity_scores': [0.9756944444444444, 0.9895833333333334, 0.9791666666666666, 1.0, 0.9930555555555556, 0.9930555555555556, 0.9895833333333334, 0.9895833333333334, 0.9826388888888888, 0.9930555555555556], 'mean_precission_score': 0.6987193362193362, 'mean_recall_scores': 0.7944444444444445, 'mean_specifity_scores': 0.9885416666666667}, 'SMOTE': {'precission_scores': [0.5, 0.6, 0.5454545454545454, 0.8888888888888888, 0.8, 0.8, 0.6, 0.8181818181818182, 0.6428571428571429, 0.7142857142857143], 'recall_scores': [0.8888888888888888, 0.6666666666666666, 0.6666666666666666, 0.8888888888888888, 1.0, 0.8888888888888888, 0.6666666666666666, 1.0, 1.0, 0.625], 'specifity_scores': [0.9722222222222222, 0.9861111111111112, 0.9826388888888888, 0.9965277777777778, 0.9930555555555556, 0.9930555555555556, 0.9861111111111112, 0.9930555555555556, 0.9826388888888888, 0.9930555555555556], 'mean_precission_score': 0.6909668109668111, 'mean_recall_scores': 0.8291666666666666, 'mean_specifity_scores': 0.9878472222222221}, 'ROS': {'precission_scores': [0.6363636363636364, 0.75, 0.5714285714285714, 1.0, 0.875, 0.8571428571428571, 0.6, 0.8181818181818182, 0.6666666666666666, 0.6666666666666666], 'recall_scores': [0.7777777777777778, 0.6666666666666666, 0.4444444444444444, 0.8888888888888888, 0.875, 0.6666666666666666, 0.3333333333333333, 1.0, 0.8888888888888888, 0.5], 'specifity_scores': [0.9861111111111112, 0.9930555555555556, 0.9895833333333334, 1.0, 0.9965277777777778, 0.9965277777777778, 0.9930555555555556, 0.9930555555555556, 0.9861111111111112, 0.9930555555555556], 'mean_precission_score': 0.7441450216450217, 'mean_recall_scores': 0.7041666666666666, 'mean_specifity_scores': 0.9927083333333334}, 'BorderlineSMOTE': {'precission_scores': [0.5, 0.6666666666666666, 0.45454545454545453, 0.8888888888888888, 0.7272727272727273, 0.7272727272727273, 0.5714285714285714, 0.75, 0.6428571428571429, 0.75], 'recall_scores': [0.8888888888888888, 0.6666666666666666, 0.5555555555555556, 0.8888888888888888, 1.0, 0.8888888888888888, 0.4444444444444444, 1.0, 1.0, 0.75], 'specifity_scores': [0.9722222222222222, 0.9895833333333334, 0.9791666666666666, 0.9965277777777778, 0.9895833333333334, 0.9895833333333334, 0.9895833333333334, 0.9895833333333334, 0.9826388888888888, 0.9930555555555556], 'mean_precission_score': 0.667893217893218, 'mean_recall_scores': 0.8083333333333332, 'mean_specifity_scores': 0.9871527777777779}}}
print(oversampled_datasets)
#oversampled_datasets = {}

{'abalone-17_vs_7-8-9-10.dat': {'KMeansSMOTE': {'precission_scores': [0.2631578947368421, 0.21428571428571427, 0.17857142857142858, 0.2916666666666667, 0.3, 0.2727272727272727, 0.29411764705882354, 0.23529411764705882, 0.25, 0.14285714285714285], 'recall_scores': [0.4166666666666667, 0.25, 0.4166666666666667, 0.6363636363636364, 0.2727272727272727, 0.25, 0.4166666666666667, 0.3333333333333333, 0.45454545454545453, 0.18181818181818182], 'specifity_scores': [0.9692982456140351, 0.9758771929824561, 0.9495614035087719, 0.9627192982456141, 0.9846491228070176, 0.9824561403508771, 0.9736842105263158, 0.9714912280701754, 0.9671052631578947, 0.9736842105263158], 'mean_precission_score': 0.24426778845509492, 'mean_recall_scores': 0.36287878787878786, 'mean_specifity_scores': 0.9710526315789473}, 'MeansShiftSMOTE': {'precission_scores': [0.2, 0.22727272727272727, 0.23076923076923078, 0.2727272727272727, 0.3076923076923077, 0.21428571428571427, 0.3181818181818182, 0.35294117647058826, 0.2608695652

In [19]:
# Wizualizacja naszego modelu i sprawdzenie zbalansowania
for dataset in oversampled_datasets:
    visualize_X, visualize_y = load_dataset(dataset)
    print('-----dataset: {}-----'.format(dataset))
    print('Stats:\n')
    print('Num of probes with label 0 before oversampling:\n {} \n'.format(list(visualize_y).count(0)))
    print('Num of probes with label 1 before oversampling:\n {} \n'.format(list(visualize_y).count(1)))
    print('Num of probes with label 0 after oversampling:\n {} \n'.format(list(oversampled_datasets[dataset][1]).count(0)))
    print('Num of probes with label 1 after oversampling:\n {} \n'.format(list(oversampled_datasets[dataset][1]).count(1)))

-----dataset: mild-imbalance\page-blocks0.dat-----
Stats:

Num of probes with label 0 before oversampling:
 4913 

Num of probes with label 1 before oversampling:
 559 

Num of probes with label 0 after oversampling:
 3931 

Num of probes with label 1 after oversampling:
 3931 

-----dataset: mild-imbalance\pima.dat-----
Stats:

Num of probes with label 0 before oversampling:
 500 

Num of probes with label 1 before oversampling:
 268 

Num of probes with label 0 after oversampling:
 400 

Num of probes with label 1 after oversampling:
 400 

-----dataset: mild-imbalance\segment0.dat-----
Stats:

Num of probes with label 0 before oversampling:
 1979 

Num of probes with label 1 before oversampling:
 329 

Num of probes with label 0 after oversampling:
 1584 

Num of probes with label 1 after oversampling:
 1584 

-----dataset: mild-imbalance\vehicle0.dat-----
Stats:

Num of probes with label 0 before oversampling:
 647 

Num of probes with label 1 before oversampling:
 199 

Num of pro

### Statistically significantly better preprocessings in given datasets with given metrics

In [None]:
from scipy.stats import ttest_rel, wilcoxon, shapiro
from tabulate import tabulate

alfa = .05
methods = ["KMeansSMOTE", "MeansShiftSMOTE", "SMOTE", "ROS", "BorderlineSMOTE"]
metrics = ["precission_scores","recall_scores","specifity_scores"]

for directory in results:
    for metric in metrics:
        w_statistic = np.zeros((len(methods), len(methods)))
        p_value = np.zeros((len(methods), len(methods)))
        test_used = np.empty((len(methods), len(methods)), dtype=object)
        for i, preprocessing_method in enumerate(methods):
            for j, comparison_preprocessing_method in enumerate(methods):
                metric_results_one = results[directory][preprocessing_method][metric]
                metric_results_two = results[directory][comparison_preprocessing_method][metric]
                tmp_p_val = 0
                try:
                    _, tmp_p_val = wilcoxon(metric_results_one, metric_results_two)
                except: 
                    _, tmp_p_val = 0, 1
                mean_metric_one = np.mean(metric_results_one)
                mean_metric_two = np.mean(metric_results_two)
                if  tmp_p_val <= alfa:
                    if mean_metric_one - mean_metric_two > 0:
                        w_statistic[i,j], p_value[i,j] = 1, 1
                    else:
                        w_statistic[i,j], p_value[i,j] = 0, 1
                else:
                    if mean_metric_one - mean_metric_two > 0:
                        w_statistic[i,j], p_value[i,j] = 1, 0
                    else:
                        w_statistic[i,j], p_value[i,j] = 0, 0
        stat_better = w_statistic * p_value
        stat_better_table = tabulate(stat_better, methods)
        print(f"Statistically significantly better {metric}:")
        print(stat_better_table)
        print()
        print()
        print()
        

Statistically significantly better precission_scores:
  KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
-------------  -----------------  -------  -----  -----------------
            0                  0        0      0                  0
            0                  0        0      0                  0
            1                  0        0      0                  0
            1                  1        0      0                  0
            1                  1        1      0                  0



Statistically significantly better recall_scores:
  KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
-------------  -----------------  -------  -----  -----------------
            0                  0        0      1                  0
            1                  0        0      1                  0
            1                  0        0      1                  0
            0                  0        0      0                  0
         

### Statistically significantly better preprocessings for all datasets

In [None]:
from scipy.stats import rankdata, ranksums

methods = ["KMeansSMOTE", "MeansShiftSMOTE", "SMOTE", "ROS", "BorderlineSMOTE"]
metrics = ["mean_precission_score", "mean_recall_scores", "mean_specifity_scores"]
for metric in metrics:
    mean = []
    for directory in results:
        preprocessing_mean = []
        for i, preprocessing_method in enumerate(methods):
            preprocessing_mean.append(results[directory][preprocessing_method][metric])
        mean.append(preprocessing_mean)

    ranks = []
    for mean_score in mean:
        ranks.append(rankdata(mean_score).tolist())
    ranks = np.array(ranks)

    alfa = .05
    w_statistic = np.zeros((len(methods), len(methods)))
    p_value = np.zeros((len(methods), len(methods)))
    for i in range(len(methods)):
        for j in range(len(methods)):
            w_statistic[i, j], p_value[i, j] = ranksums(ranks.T[i], ranks.T[j])
    names_column = np.expand_dims(np.array(list(methods)), axis=1)
    w_statistic_table = np.concatenate((names_column, w_statistic), axis=1)
    w_statistic_table = tabulate(w_statistic_table, methods, floatfmt=".2f")
    p_value_table = np.concatenate((names_column, p_value), axis=1)
    p_value_table = tabulate(p_value_table, methods, floatfmt=".2f")
    advantage = np.zeros((len(methods), len(methods)))
    advantage[w_statistic > 0] = 1
    advantage_table = tabulate(np.concatenate(
    (names_column, advantage), axis=1), methods)
    significance = np.zeros((len(methods), len(methods)))
    significance[p_value <= alfa] = 1
    statisticaly_better = advantage * significance
    statisticaly_better_table = tabulate(np.concatenate(
    (names_column, statisticaly_better), axis=1), methods)
    print(f"Metric: {metric}")
    print("Statistical significance (alpha = 0.05):")
    print(statisticaly_better_table)
    print()
    print()

Metric: mean_precission_score
Statistical significance (alpha = 0.05):
                   KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
---------------  -------------  -----------------  -------  -----  -----------------
KMeansSMOTE                  0                  0        0      0                  0
MeansShiftSMOTE              0                  0        0      0                  0
SMOTE                        0                  0        0      0                  0
ROS                          0                  1        1      0                  1
BorderlineSMOTE              0                  0        0      0                  0


Metric: mean_recall_scores
Statistical significance (alpha = 0.05):
                   KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
---------------  -------------  -----------------  -------  -----  -----------------
KMeansSMOTE                  0                  0        0      0                  0
MeansShif

# Testy parowe
## Cel testów parowych
W poprzedniej części pracy wykonano ocenę klasyfikatorów za pomocą metryk i walidacji krzyżowej. W ten sposób uzyskano osobną ocenę dla przypadków klasyfikatorów gdzie różnicami były: zbiory uczące, różne modele uczenia maszynowego oraz główny cel pracy czyli różne metody balansowania danych (sprawdzić czy niema zmian). Uzyskane wyniki należy w tym momencie porównać aby ocenić czy wyniki uzyskane w przypadkach gdzie wykorzystano metodę SMOTE wykonywaną na klastrach powstałych z danych niezbalansowanych są lepsze od pozostałych metod oversamplingu. Aby ocenić tą zależność można uśrednić uzyskane metryki i ocenić czy średnio model wykorzystujący zaproponowaną metodę oversamplingu uzyskuje lepsze wyniki, jednak test taki nie może zostać uznany za prawidłowy ponieważ mógł wynikać z przypadku. W tym celu należy wziąć pod uwagę również parametr określający jak wyniki wchodzące w skład średniej są od niej oddalone (odchylenie standardowe zbiorów). W tym celu przez badaczy wykorzystywane są testy statystyczne takie jak test T- Studenta oraz test Wilcoxona. Poniżej opisane zostaną testy które mogą zostać wykorzystane w pracy.
## Wkorzystane testy statystyczne 
Testy statystyczne wykorzystuje się dla różńych zbiorów danych aby ocenić czy różnica między nimi jest statystycznie istotna.
W tej pracy celem jest porównanie metryk dla modeli wykorzystujących mechanizm balansowania danych opartych o metodę SMOTE działającą na pojedynczych klastrach danych wejściowych oraz innych popularnie wykorzystywanych narzędzi oversamplingu. W ten sposób dane można podzielić na pary: zaproponowana w pracy metoda i inna metoda oversamplingu. Taki sposób podziału danych determinuje wykorzystanie testów parowych.
### Test T-Studenta
Test T-Studenta to test parametryczny (opierający się o porównanie parametrów populacji takich jak odchylenie standardowe czy średnia )
Warunkiem koniecznym do zastosowania testu T studenta jest założenie, że porównywane zbiory są normalne. Testowanie normalności zbiorów zostanie omówione w dalszej części pracy. Jeśli Testowanie normalności wykazało, że próbki nie są normalne wtedy można wykorzystać inne testy o mniejszej precyzji czyli testy nieparametryczne. Należy jednak napiętać, że jeśli to możliwe powinienny zostać przeprowadzone testy parametryczne takie jest test T studenta lub analiza wariancji.\
Testy statystyczne parametryczne : https://pogotowiestatystyczne.pl/slowniki/testy-parametryczne/#:~:text=Testy%20parametryczne%20to%20rodzaj%20test%C3%B3w,standardowe%20lub%20innych%20statystykach%20opisowych.
### Test Wilcoxona
Test Wilcoxona to test nieparametryczny wykonywany na bazie próbek populacji a nie na jej parametrach. Wykorzystuje on różnicę między próbkami w przypadkach wykorzystania dwóchróżnych hiperparametrów modelu. Różnica każdejz próbek zostaje zakwalifikowana do jednego ze zbiorów $T_{-}$ gdy różnica jest ujemna lub $T_{+}$ gdy jest dodatnia. W ten sposób uzyskano dwa zbiory. Wszystkim różnicom w tym momencie usuwany zostaje znak a przyznana zostaje  $ranga$(tu można walnąć dokładniejszy opis ale teraz trochę małoczasu) a następnie sumowane są wszystkie ranki w zbiorach $T_{-}$ i $T_{+}$. Pod uwagę bierze się mniejszą sumę rang oraz sprawdzana jest ona w tablicy wartości sum wag Wilxocona.
### Test założenia Normalności
Aby ocenić czy próbki pewnej populacji mają określony rozkład można wykorzystać Test Kołmogorowa-Smirnowa. Aby określić czy wyniki badań możemy ocenić za pomocą testu T studenta trzeba spełnić założenie normalności a więc określić czy próbki mogą pochodzić z wałsciwości o charakterze rozkładu normalngo. W tym celu test Kołmogorowa-Smirnowa należy wykonać dla uzyskanych wcześniej wyników i rozkładu normalnego.

biblio: https://www.scirp.org/html/6-1241391_107034.htm
https://onlinelibrary.wiley.com/doi/full/10.1002/9781118445112.stat06558

Do wykonania oceny założenia normalności wykorzystana zostanie funkcja kstest pochodząca z pakiety scipy \
Tutaj jak to robić: https://medium.com/@ricardojaviermartnezsustegui/kolmog%C3%B3rov-smirnov-test-in-python-step-by-step-1b7532021bd2

a jest jeszcze coś takiego
scipy.stats.normaltest