### Exploring the impact of clustering on the quality of SMOTE preprocessing. Comparative analysis
##### Maksym Malicki, Jacek Glapiński
###### Wrocław University of Technology
In this notebook we present a comparative analysis of the impact of clustering using various methods on the quality of SMOTE preprocessing.

#### load_dataset()
This method allows us to load datasets listed in the paper.

In [1]:
import numpy as np

def load_dataset(file_path):
    data = []
    labels = []

    with open(file_path, 'r') as f:
        for line in f:
            if line.startswith('@'):
                continue
            line_data = line.strip().split(',')
            sample_class = line_data[-1].strip().lower().replace(" ", "")
            label = 1 if sample_class == 'positive' else 0
            converted_data = []
            for x in line_data[:-1]:
                try:
                    converted_data.append(float(x))
                except ValueError:
                    converted_data.append(ord(x))
            data.append(converted_data)
            labels.append(label)
    X = np.array(data)
    y = np.array(labels)

    return X, y

#### Implementacja SMOTE_Medium i testy

In [2]:
#https://medium.com/@corymaklin/synthetic-minority-over-sampling-technique-smote-7d419696b88c

from random import randrange, uniform
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score

def SMOTE_ByMedium(sample: np.array, N: int, k: int) -> np.array:
    
    T, num_attrs = sample.shape
    
    # If N is less than 100%, randomize the minority class samples as only a random percent of them will be SMOTEd
    if N < 100:
        T = round(N / 100 * T)
        N = 100
    # The amount of SMOTE is assumed to be in integral multiples of 100
    N = int(N / 100)

    synthetic = np.zeros([T * N, num_attrs])
    new_index = 0
    nbrs = NearestNeighbors(n_neighbors=k+1).fit(sample.values)
    def populate(N, i, nnarray):
        
        nonlocal new_index
        nonlocal synthetic
        nonlocal sample
        while N != 0:
            nn = randrange(1, k+1)
            for attr in range(num_attrs):
                dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
                gap = uniform(0, 1)
                synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
            new_index += 1
            N = N - 1
    
    for i in range(T):
        nnarray = nbrs.kneighbors(sample.iloc[i].values.reshape(1, -1), return_distance=False)[0]
        populate(N, i, nnarray)
    
    return synthetic




In [3]:
#Smote test
# tutaj jest tymczasowy test Smote żeby ogarnąć poprawne liczebności
import plotly.express as px
temp_Cluter = np.array([[1,3,2,4,6,2,1,3,6,4,5,1,5,2,6,8,7,9,5,8,6,8,7,9,5,7,4,5,7],
                        [3,1,2,6,2,5,4,3,2,1,3,3,2,1,6,5,7,6,4,8,9,5,8,4,7,5,6,8,4]])
temp_Cluter_y = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1])
print('temp_Cluter shape: ',temp_Cluter.shape)
print('temp_Cluter_y shape: ',temp_Cluter_y.shape)
fig = px.scatter(x=temp_Cluter[0],y=temp_Cluter[1],title="Dane wejściowe")
fig.show()

temp_Cluter shape:  (2, 29)
temp_Cluter_y shape:  (29,)


Liczebność zbioru wejściowego:

In [4]:
import math
# teraz określamy ile chcemy próbek dorobić:
# chcemy mieć 100 próbek
desired = 100
print('Chcemy uzyskać 100 próbek wyjściowych')
#obecnie mamy size próbek
size  = temp_Cluter.shape[1]

todoSamples = int(((desired/size)-1 )*100)
print("Przewidujemy, że dosaniemy próbek wyjściowych: ", (todoSamples/100+1)*len(temp_Cluter_y))
print((todoSamples/100+1))
print(math.floor(todoSamples/100)+1)
temp_out = SMOTE_ByMedium(pd.DataFrame(temp_Cluter.T), todoSamples, 5)

Chcemy uzyskać 100 próbek wyjściowych
Przewidujemy, że dosaniemy próbek wyjściowych:  99.76
3.44
3


In [5]:
temp_out.T
print(len(temp_out.T[0]))
print(len(temp_Cluter[0]))
print(len(temp_out.T[0])+len(temp_Cluter[0]))

58
29
87


In [6]:
temp_Cluter_OUT = [[*temp_Cluter[0],*temp_out.T[0]],[*temp_Cluter[1],*temp_out.T[1]]]
print(temp_Cluter_OUT)

[[1, 3, 2, 4, 6, 2, 1, 3, 6, 4, 5, 1, 5, 2, 6, 8, 7, 9, 5, 8, 6, 8, 7, 9, 5, 7, 4, 5, 7, 1.7241194550736987, 1.018029520976006, 2.014698726968331, 2.0041814183347135, 1.9443397129275697, 2.0, 4.0, 4.463692941961797, 5.431938429820341, 5.094079469958102, 1.1304591920139553, 2.744649286452857, 1.4612429814184549, 1.0, 4.485674711517438, 1.0914211023671598, 5.804206857934645, 5.733878023602682, 4.086423786704449, 4.032056696628841, 5.0, 5.3053134971432145, 1.0, 2.1988006135389693, 5.0, 5.0, 2.3621457663817305, 2.9956417294796642, 4.7816137490297805, 5.064325774144167, 8.0, 7.025268142360551, 7.0, 7.323965542248555, 8.128408245803469, 8.917063933032072, 5.0, 4.518328758834053, 7.649805943748927, 7.068115562414267, 5.679128075734463, 5.718672570139301, 7.64974665703329, 7.99491163997972, 6.354298134782576, 7.0, 8.662332811633943, 8.603491830399527, 4.82577536126675, 4.7024380977097495, 7.07029418704406, 7.224138306210325, 4.0, 4.0427501078064, 4.29605561962639, 4.351956184284541, 8.24128778

In [7]:
fig = px.scatter(x= temp_out.T[0], y= temp_out.T[1],title='Wygenerowane')
fig.show()

In [8]:
fig = px.scatter(x= temp_Cluter_OUT[0], y= temp_Cluter_OUT[1], title= 'Wygenerowane + wejściowe o łącznej liczbie równej desired')
fig.show()
len(temp_Cluter_OUT[0])

87

#### Clustering with SMOTE

In [9]:
from sklearn.cluster import KMeans, MeanShift

def oversample_clustered_data(X, y, X_minority, y_minority, X_majority, y_majority, cluster_labeled_data):
    #wygenerowane próbki mniejszościowe wrzucamy tutaj:
    X_generated = []

    # Do oceny ile próbek trzeba będzie wygenerować trzeba przeliczyć różnicę między liczebnością klasy mniejszościowej i większościowej, będzie to potrzebne później w pętli dla każego z klastrów
    majority_minority_difference =  list(y).count(0) - list(y).count(1)
    if(majority_minority_difference < 0):
        print("[ERR] Monority class has grater count to Majority class")
        exit()
    # Do oceny ile próbek będzie trzeba wygenerować potrzebne też będzie informacja o liczbności klasy mniejszościowej
    minority_count = list(y).count(1)

    # Dla każdego z klastrów:
    cluster_labels = np.unique(cluster_labeled_data) #  pobranie info jakie mamy labele klastrów [0,1,...]

    # Wszystkie wygenerowane syntetycznie obiekty trafią tutaj
    syntetic_data = []
    for cluster in cluster_labels:
        # w pierwszej kolejności filtrujemy indeksy zbioru mniejszościowego aby pozostawić tylko indeksy należące do klastra który w danej iteracji pętli analizujemy
        cluster_samples_indices_minority = np.where(cluster_labeled_data == cluster)[0]
        if len(cluster_samples_indices_minority)<4:
            continue
        # ustalamy ile próbek trzeba wygenerować w tym klastrze
        # ustalamy procentowo liczebność (udział) klastra względem całej klasy mniejszościowej
        percentage_of_count = len(cluster_samples_indices_minority)/minority_count
        # zabezpieczenie jeśli coś na tym etapie szłoby bardzo nie tak
        if(percentage_of_count>1 or percentage_of_count<0):
            print("Kurza twarz.. something is no yes! Procenty nie mogą być większe niż 100 ani mniejsze niż 0")
            exit()
        
        #następnie przemnarzamy udział procentowy klastra przez różnicę między zbiorem większościowym i mniejszościowym uzyskując ilość próbek do augmentacji
        num_of_samples_to_generate_in_this_cluster = percentage_of_count * majority_minority_difference

        # Do samej funkcji augmentacji wymagana jest warość procentowa gdzie 100 to 100% określająca ile próbek ma zostać wygenerowanych względem danych wejściowych
        # w tym celu należy podzielić ilość próbek które chcemy wygenerować przez liczbę próbek które posiadamy już
        todoSamples = 100*(num_of_samples_to_generate_in_this_cluster/len(cluster_samples_indices_minority))

        # Teraz musimy niestety przejść z np.array na dataframe bo w ten sposób działa funkcja ale robimy to już w parametrze funkcji
        # Oversampling... totaj k = 5.. jest to hiperparametr który nie wydaje się być istotny w tej implementacji. Jeśli będą większe zmiany trzeba spojrzeć na niego
        temp_out = SMOTE_ByMedium(pd.DataFrame(X_minority[cluster_samples_indices_minority]), todoSamples, 3)
        syntetic_data.append(temp_out)
        
        #Ta implementacja Smote jeśli tworzy więcej niż 100% próbek z próbek które dostała traci informację o części dziesiętnej procentów - to powoduje, że powstaje mniej resamplowanych próbek
        if todoSamples>100:
            todoSamples_2 = todoSamples%100
            temp_out_2 = SMOTE_ByMedium(pd.DataFrame(X_minority[cluster_samples_indices_minority]),todoSamples_2 , 3)
            syntetic_data.append(temp_out_2)
        
        

    #tutaj wymagane jest połączenie wyjścia syntetic_data razem z X i y
    X_resampled = X
    y_resampled  = y

    for clusterOUT in syntetic_data:
        X_resampled = np.block([[X_resampled], [clusterOUT]])
        y_resampled = [*y_resampled,*np.ones(clusterOUT.shape[0])]# jedynki dla tego, że 1 to klasa mniejszościowa
    return X_resampled, y_resampled

def KMeans_SMOTE(X, y, num_clusters):
    # Wybieramy indeksy klasy mniejszościowej i większościowej
    minority_indices = np.where(y == 1)[0] # dla naszych datasetów klasa o labelu 1 zawsze jest mniejszościowa
    majority_indices = np.where(y == 0)[0]

    # Sortujemy Dane i Labele na minority i majority
    X_minority = X[minority_indices]
    y_minority = y[minority_indices]

    X_majority = X[majority_indices]
    y_majority = y[majority_indices]
    
    # Wykonujemy klasteryzację na klasie mniejszościowej
    kmeans_labels_minority = KMeans(n_clusters=num_clusters, random_state=0, n_init="auto").fit_predict(X_minority)

    # Zwrócona wartość to indeksy odwołujące się jedynie do klasy mniejszościowej! trzeba o tym teraz pamiętać.
    return oversample_clustered_data(X, y, X_minority, y_minority, X_majority, y_majority, kmeans_labels_minority)


def MeanShift_SMOTE(X, y):
    # TAK WIEM,TO POWINNA BYĆ FUNKCJA, BO WIĘKSZOŚĆ LINIJEK MEANSHIFT_SMOTE I KMEANS_SMOTE SIĘ POWTARZA ... może kiedyś tu upiększymy
    # Wybieramy indeksy klasy mniejszościowej i większościowej
    minority_indices = np.where(y == 1)[0] # dla naszych datasetów klasa o labelu 1 zawsze jest mniejszościowa
    majority_indices = np.where(y == 0)[0]

    # Sortujemy Dane i Labele na minority i majority
    X_minority = X[minority_indices]
    y_minority = y[minority_indices]

    X_majority = X[majority_indices]
    y_majority = y[majority_indices]
    
    # Wykonujemy klasteryzację na klasie mniejszościowej
    mean_shift_labels_minority = MeanShift().fit_predict(X_minority)
    return oversample_clustered_data(X, y, X_minority, y_minority, X_majority, y_majority, mean_shift_labels_minority)

#### Sprawdzenie który ze zbiorów jest Większościowy

In [10]:
# tymczasowy algorytm do analizy która z klas jest mniejszościowa
def ReturnsMinorityLabel(file_path):
    labels = []
    with open(file_path, 'r') as f:
        for line in f:
            if line.startswith('@'):
                continue
            line_data = line.strip().split(',')
            sample_class = line_data[-1].strip().lower().replace(" ", "")
            label = 1 if sample_class == 'positive' else 0
            labels.append(label)
    countOf0 = labels.count(0)
    countOf1 = labels.count(1)
    print("---------STATS---------") 
    print('countOf0: ',countOf0)
    print('countOf1: ',countOf1)
    print("----------OUT----------") 
    if countOf0 == countOf1:
        print("!!!!!kurcze zbiorki są zbalansowane!!!!!!")
        return(-1)
    if countOf0 > countOf1:
        print("Więcej klasy 0")
        return(0)
    if countOf0 < countOf1:
        print("Więcej klasy 1")
        return(1)

import os

#sprawdzenie czy działa

directories = ['mild-imbalance', 'high-imbalance']
results_of_test = []
for directory in directories:
    print(f"Processing files in directory: {directory}")
    files = os.listdir(directory)
    
    for file_name in files:
        file_path = os.path.join(directory, file_name)
        print(f"File: {file_path}")
        results_of_test.append(ReturnsMinorityLabel(file_path))
    
print(results_of_test)
print(results_of_test.count(1))

Processing files in directory: mild-imbalance
File: mild-imbalance\page-blocks0.dat
---------STATS---------
countOf0:  4913
countOf1:  559
----------OUT----------
Więcej klasy 0
File: mild-imbalance\pima.dat
---------STATS---------
countOf0:  500
countOf1:  268
----------OUT----------
Więcej klasy 0
File: mild-imbalance\segment0.dat
---------STATS---------
countOf0:  1979
countOf1:  329
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle0.dat
---------STATS---------
countOf0:  647
countOf1:  199
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle1.dat
---------STATS---------
countOf0:  629
countOf1:  217
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle2.dat
---------STATS---------
countOf0:  628
countOf1:  218
----------OUT----------
Więcej klasy 0
File: mild-imbalance\vehicle3.dat
---------STATS---------
countOf0:  634
countOf1:  212
----------OUT----------
Więcej klasy 0
File: mild-imbalance\wisconsin.dat
---------STATS---------
coun

#### Experiment for single dataset

In [11]:
from imblearn.over_sampling import SMOTE, RandomOverSampler, BorderlineSMOTE
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import precision_score, recall_score
from imblearn.metrics import specificity_score

def experiment(X, y):
    preprocessings = {
        "KMeansSMOTE": True,
        "MeansShiftSMOTE": True,
        "SMOTE": SMOTE(),
        "ROS": RandomOverSampler(),
        "BorderlineSMOTE": BorderlineSMOTE(),
    }
    classifier = RandomForestClassifier(random_state=42)
    rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=1234)
    result = {}
    #wyeksportuję sobie uzyskane datasety poza funkcję żeby zobaczyć jak wyglądają oversamplowane
    export_experimentData = {}
    for key in preprocessings:
        export_experimentData[key] = {}
        precision_scores = []
        recall_scores = []
        specifity_scores = []
        i = 0 # iterator for loop 
        for train_index, test_index in rskf.split(X,y):
            export_experimentData[key]['split_'+str(i)] = {}
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]
            if key == "KMeansSMOTE":
                X_train_oversampled, y_train_oversampled = KMeans_SMOTE(X_train, y_train, 10)
            elif key == "MeansShiftSMOTE":
                X_train_oversampled, y_train_oversampled = MeanShift_SMOTE(X_train, y_train)
            else:
                X_train_oversampled, y_train_oversampled = preprocessings[key].fit_resample(X_train, y_train)
            export_experimentData[key]['split_'+str(i)]['train_oversampled'] = [X_train_oversampled, y_train_oversampled]
            export_experimentData[key]['split_'+str(i)]['train'] = [X_train, y_train]
            export_experimentData[key]['split_'+str(i)]['test'] = [X_test, y_test]
            
            classifier.fit(X_train_oversampled, y_train_oversampled)
            predict = classifier.predict(X_test)
            precision_scores.append(precision_score(y_test, predict))
            recall_scores.append(recall_score(y_test, predict))
            specifity_scores.append(specificity_score(y_test, predict))
            i = i+1
        mean_precision_score = np.mean(precision_scores)
        std_precision_score = np.std(precision_scores)
        mean_recall_score = np.mean(recall_scores)
        std_recall_score = np.std(recall_scores)
        mean_specifity_score = np.mean(specifity_scores)
        std_specifity_score = np.std(specifity_scores)
#         print(f"Precission score {key}: %.3f (%.3f)" % (mean_precision_score, std_precision_score))
#         print(f"Specifity score {key}: %.3f (%.3f)" % (mean_specifity_score, std_specifity_score))
#         print(f"Recall score {key}: %.3f (%.3f)" % (mean_recall_score, std_recall_score))
        result[key] = {
            "precission_scores": precision_scores,
            "recall_scores": recall_scores,
            "specifity_scores": specifity_scores,
            "mean_precission_score": mean_precision_score,
            "mean_recall_scores": mean_recall_score,
            "mean_specifity_scores": mean_specifity_score,
        }
    return result, export_experimentData

#### Running experiments on the datasets

In [12]:
import os

directories = ['mild-imbalance', 'high-imbalance']
results = {}
export_experimentData = {}
for directory in directories:
    print(f"Processing files in directory: {directory}")
    files = os.listdir(directory)
    for file_name in files:
        file_path = os.path.join(directory, file_name)
        print(f"File: {file_path}")
        X, y = load_dataset(file_path)
        experiment_result, export_experimentData_for_File = experiment(X, y)
        export_experimentData[file_name] = export_experimentData_for_File
        results[file_name] = experiment_result
# print(results)

Processing files in directory: mild-imbalance
File: mild-imbalance\page-blocks0.dat
File: mild-imbalance\pima.dat
File: mild-imbalance\segment0.dat
File: mild-imbalance\vehicle0.dat
File: mild-imbalance\vehicle1.dat
File: mild-imbalance\vehicle2.dat
File: mild-imbalance\vehicle3.dat
File: mild-imbalance\wisconsin.dat
File: mild-imbalance\yeast1.dat
File: mild-imbalance\yeast3.dat
Processing files in directory: high-imbalance
File: high-imbalance\abalone-17_vs_7-8-9-10.dat



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



File: high-imbalance\abalone19.dat



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



File: high-imbalance\kr-vs-k-three_vs_eleven.dat
File: high-imbalance\kr-vs-k-zero-one_vs_draw.dat
File: high-imbalance\shuttle-2_vs_5.dat
File: high-imbalance\shuttle-c0-vs-c4.dat
File: high-imbalance\yeast-0-2-5-6_vs_3-7-8-9.dat
File: high-imbalance\yeast-0-2-5-7-9_vs_3-6-8.dat
File: high-imbalance\yeast4.dat
File: high-imbalance\yeast5.dat


#### Przechowywanie danych z eksperymentu

In [26]:
import pickle
f = open('store.pckl', 'wb')
#####!!!! NIEBEZPIECZNE !!!!#####
pickle.dump([results,export_experimentData], f)

In [1]:
import pickle
f = open('store.pckl', 'rb')
results,export_experimentData = pickle.load(f)
f.close()

#### Przegląd uzyskanych danych

In [2]:
# Wizualizacja naszego modelu i sprawdzenie zbalansowania
for dataset in export_experimentData: #['page-blocks0.dat']['KMeansSMOTE']['split_0']['train']
    for oversampling_memthod in export_experimentData[dataset]:
        for split_number in export_experimentData[dataset][oversampling_memthod]:
            print('-----dataset: {}-----'.format(dataset))
            print('-----oversampling_memthod: {}-----'.format(oversampling_memthod))
            print('-----split_number: {}-----'.format(split_number))
            print('Stats:\n')
            print('Num of probes with label 0 before oversampling:\n {} \n'.format(list(export_experimentData[dataset][oversampling_memthod][split_number]['train'][1]).count(0)))
            print('Num of probes with label 1 before oversampling:\n {} \n'.format(list(export_experimentData[dataset][oversampling_memthod][split_number]['train'][1]).count(1)))
            print('Num of probes with label 0 after oversampling:\n {} \n'.format(list(export_experimentData[dataset][oversampling_memthod][split_number]['train_oversampled'][1]).count(0)))
            print('Num of probes with label 1 after oversampling:\n {} \n'.format(list(export_experimentData[dataset][oversampling_memthod][split_number]['train_oversampled'][1]).count(1)))

-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----split_number: split_0-----
Stats:

Num of probes with label 0 before oversampling:
 3930 

Num of probes with label 1 before oversampling:
 447 

Num of probes with label 0 after oversampling:
 3930 

Num of probes with label 1 after oversampling:
 3852 

-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----split_number: split_1-----
Stats:

Num of probes with label 0 before oversampling:
 3930 

Num of probes with label 1 before oversampling:
 447 

Num of probes with label 0 after oversampling:
 3930 

Num of probes with label 1 after oversampling:
 3829 

-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----split_number: split_2-----
Stats:

Num of probes with label 0 before oversampling:
 3930 

Num of probes with label 1 before oversampling:
 448 

Num of probes with label 0 after oversampling:
 3930 

Num of probes with label 1 after

In [16]:
for dataset in results:
    for method in results[dataset]:
        for metryka in results[dataset][method]:
            print('-----dataset: {}-----'.format(dataset))
            print('-----oversampling_memthod: {}-----'.format(method))
            print('-----metryka: {}-----\n'.format(metryka))
            print(results[dataset][method][metryka])

-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.8095238095238095, 0.7983870967741935, 0.7983870967741935, 0.8225806451612904, 0.8203125, 0.860655737704918, 0.848, 0.7518248175182481, 0.8434782608695652, 0.7954545454545454]
-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.9107142857142857, 0.8839285714285714, 0.8918918918918919, 0.9107142857142857, 0.9375, 0.9375, 0.9464285714285714, 0.9279279279279279, 0.8660714285714286, 0.9375]
-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9755849440488301, 0.9745676500508647, 0.9745676500508647, 0.9775967413441955, 0.9765784114052953, 0.982706002034588, 0.9806714140386572, 0.965412004069176, 0.9816700610997964, 0.9725050916496945]
-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission

##### Przegląd rozkładów metryk w KAŻDYM DATASECIE metricsForEval = 'precission_scores'

In [3]:
import pandas as pd
import plotly.express as px

metricsForEval = 'precission_scores'

viewDF = pd.DataFrame()
for dataset in results:
    for method in results[dataset]:
        viewDF[method] = results[dataset][method][metricsForEval]
        for metryka in results[dataset][method]:
            print('-----dataset: {}-----'.format(dataset))
            print('-----oversampling_memthod: {}-----'.format(method))
            print('-----metryka: {}-----\n'.format(metryka))
            print(results[dataset][method][metryka])
    fig = px.histogram(viewDF, nbins=100)
    fig.show()

-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.8095238095238095, 0.7983870967741935, 0.7983870967741935, 0.8225806451612904, 0.8203125, 0.860655737704918, 0.848, 0.7518248175182481, 0.8434782608695652, 0.7954545454545454]
-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.9107142857142857, 0.8839285714285714, 0.8918918918918919, 0.9107142857142857, 0.9375, 0.9375, 0.9464285714285714, 0.9279279279279279, 0.8660714285714286, 0.9375]
-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9755849440488301, 0.9745676500508647, 0.9745676500508647, 0.9775967413441955, 0.9765784114052953, 0.982706002034588, 0.9806714140386572, 0.965412004069176, 0.9816700610997964, 0.9725050916496945]
-----dataset: page-blocks0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission

-----dataset: pima.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.6545454545454545, 0.660377358490566, 0.6567164179104478, 0.6119402985074627, 0.6727272727272727, 0.6470588235294118, 0.5538461538461539, 0.625, 0.7142857142857143, 0.7346938775510204]
-----dataset: pima.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.6666666666666666, 0.6481481481481481, 0.8148148148148148, 0.7735849056603774, 0.6981132075471698, 0.8148148148148148, 0.6666666666666666, 0.6481481481481481, 0.7547169811320755, 0.6792452830188679]
-----dataset: pima.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.81, 0.82, 0.77, 0.74, 0.82, 0.76, 0.71, 0.79, 0.84, 0.87]
-----dataset: pima.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.6531191371393504
-----dataset: pima.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_re

-----dataset: segment0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[1.0, 1.0, 0.9705882352941176, 1.0, 0.9848484848484849, 1.0, 1.0, 1.0, 1.0, 0.9552238805970149]
-----dataset: segment0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[1.0, 0.9848484848484849, 1.0, 0.9538461538461539, 0.9848484848484849, 1.0, 1.0, 1.0, 0.9692307692307692, 0.9696969696969697]
-----dataset: segment0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[1.0, 1.0, 0.9949494949494949, 1.0, 0.9974683544303797, 1.0, 1.0, 1.0, 1.0, 0.9924050632911392]
-----dataset: segment0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.9910660600739616
-----dataset: segment0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_recall_scores-----

0.9862470862470861
-----dataset: segment0.dat-----
-----oversampling_memthod: KMeansSMOTE----

-----dataset: vehicle0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.8636363636363636, 0.9069767441860465, 0.8636363636363636, 0.9090909090909091, 0.9444444444444444, 0.9512195121951219, 0.9069767441860465, 0.875, 0.9069767441860465, 0.9743589743589743]
-----dataset: vehicle0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.95, 0.975, 0.95, 1.0, 0.8717948717948718, 0.975, 0.975, 0.875, 0.975, 0.9743589743589743]
-----dataset: vehicle0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9538461538461539, 0.9689922480620154, 0.9534883720930233, 0.9689922480620154, 0.9846153846153847, 0.9846153846153847, 0.9689922480620154, 0.9612403100775194, 0.9689922480620154, 0.9923076923076923]
-----dataset: vehicle0.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.9102316799920317
-----dataset: vehicle0.dat-----
-----oversam

-----dataset: vehicle1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.75, 0.5535714285714286, 0.5714285714285714, 0.5652173913043478, 0.41509433962264153, 0.54, 0.6, 0.5777777777777777, 0.5384615384615384, 0.5535714285714286]
-----dataset: vehicle1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.6136363636363636, 0.7209302325581395, 0.6511627906976745, 0.6046511627906976, 0.5, 0.6136363636363636, 0.7674418604651163, 0.6046511627906976, 0.6511627906976745, 0.7045454545454546]
-----dataset: vehicle1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9285714285714286, 0.8015873015873016, 0.8333333333333334, 0.8412698412698413, 0.752, 0.8174603174603174, 0.8253968253968254, 0.8492063492063492, 0.8095238095238095, 0.8]
-----dataset: vehicle1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.5665122475737734
-----dat

-----dataset: vehicle2.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.9777777777777777, 1.0, 0.9555555555555556, 0.9767441860465116, 1.0, 0.9777777777777777, 0.9545454545454546, 0.9761904761904762, 1.0, 0.9772727272727273]
-----dataset: vehicle2.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[1.0, 0.9534883720930233, 1.0, 0.9545454545454546, 1.0, 1.0, 0.9767441860465116, 0.9534883720930233, 1.0, 0.9772727272727273]
-----dataset: vehicle2.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9920634920634921, 1.0, 0.9841269841269841, 0.992, 1.0, 0.9920634920634921, 0.9841269841269841, 0.9920634920634921, 1.0, 0.992]
-----dataset: vehicle2.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.9795863955166281
-----dataset: vehicle2.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_recall_scores-----


-----dataset: vehicle3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.5116279069767442, 0.5681818181818182, 0.5227272727272727, 0.6170212765957447, 0.5573770491803278, 0.6382978723404256, 0.5365853658536586, 0.6, 0.5454545454545454, 0.48333333333333334]
-----dataset: vehicle3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.5116279069767442, 0.5952380952380952, 0.5476190476190477, 0.6904761904761905, 0.7906976744186046, 0.6976744186046512, 0.5238095238095238, 0.7142857142857143, 0.5714285714285714, 0.6744186046511628]
-----dataset: vehicle3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.8346456692913385, 0.8503937007874016, 0.8346456692913385, 0.8582677165354331, 0.7857142857142857, 0.8661417322834646, 0.8503937007874016, 0.84251968503937, 0.84251968503937, 0.753968253968254]
-----dataset: vehicle3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
---

-----dataset: wisconsin.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.9591836734693877, 0.9791666666666666, 0.96, 0.9019607843137255, 0.94, 1.0, 0.8703703703703703, 1.0, 0.9387755102040817, 0.94]
-----dataset: wisconsin.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.9791666666666666, 0.9791666666666666, 1.0, 0.9787234042553191, 0.9791666666666666, 0.9791666666666666, 0.9791666666666666, 0.9791666666666666, 0.9787234042553191, 0.9791666666666666]
-----dataset: wisconsin.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9775280898876404, 0.9887640449438202, 0.9775280898876404, 0.9438202247191011, 0.9659090909090909, 1.0, 0.9213483146067416, 1.0, 0.9662921348314607, 0.9659090909090909]
-----dataset: wisconsin.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.9489457005024231
-----dataset: wisconsin.dat-----
---

-----dataset: yeast1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.5952380952380952, 0.6122448979591837, 0.5377358490566038, 0.6506024096385542, 0.6, 0.5783132530120482, 0.5638297872340425, 0.6375, 0.6161616161616161, 0.5463917525773195]
-----dataset: yeast1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.5813953488372093, 0.6976744186046512, 0.6627906976744186, 0.627906976744186, 0.6, 0.5581395348837209, 0.6162790697674418, 0.5930232558139535, 0.7093023255813954, 0.6235294117647059]
-----dataset: yeast1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.8388625592417062, 0.8199052132701422, 0.7677725118483413, 0.8625592417061612, 0.8388625592417062, 0.8341232227488151, 0.8056872037914692, 0.8625592417061612, 0.8199052132701422, 0.7914691943127962]
-----dataset: yeast1.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score

-----dataset: yeast3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.6756756756756757, 0.8, 0.6923076923076923, 0.7435897435897436, 0.7894736842105263, 0.8235294117647058, 0.725, 0.7209302325581395, 0.8, 0.7647058823529411]
-----dataset: yeast3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.78125, 0.7272727272727273, 0.8181818181818182, 0.8787878787878788, 0.9375, 0.875, 0.8787878787878788, 0.9393939393939394, 0.7272727272727273, 0.8125]
-----dataset: yeast3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9547169811320755, 0.9772727272727273, 0.9545454545454546, 0.9621212121212122, 0.9696969696969697, 0.9773584905660377, 0.9583333333333334, 0.9545454545454546, 0.9772727272727273, 0.9696969696969697]
-----dataset: yeast3.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.7535212322459424
-----dataset: yeast3.d

-----dataset: abalone-17_vs_7-8-9-10.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.21052631578947367, 0.1875, 0.24, 0.2608695652173913, 0.2727272727272727, 0.25, 0.2777777777777778, 0.26666666666666666, 0.2777777777777778, 0.1875]
-----dataset: abalone-17_vs_7-8-9-10.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.3333333333333333, 0.25, 0.5, 0.5454545454545454, 0.2727272727272727, 0.25, 0.4166666666666667, 0.3333333333333333, 0.45454545454545453, 0.2727272727272727]
-----dataset: abalone-17_vs_7-8-9-10.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9671052631578947, 0.9714912280701754, 0.9583333333333334, 0.9627192982456141, 0.9824561403508771, 0.9802631578947368, 0.9714912280701754, 0.9758771929824561, 0.9714912280701754, 0.9714912280701754]
-----dataset: abalone-17_vs_7-8-9-10.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_p

-----dataset: abalone19.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.16666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0, 0.0]
-----dataset: abalone19.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.16666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14285714285714285, 0.0, 0.0]
-----dataset: abalone19.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9939686369119421, 0.9903498190591074, 0.9975845410628019, 0.9915458937198067, 0.9939613526570048, 0.9891435464414958, 0.9879372738238842, 0.9951690821256038, 0.998792270531401, 0.9915458937198067]
-----dataset: abalone19.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.03666666666666667
-----dataset: abalone19.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_recall_scores-----

0.030952380952380953
-----dataset: abalone19.dat-----
-

-----dataset: kr-vs-k-three_vs_eleven.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: kr-vs-k-three_vs_eleven.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8823529411764706]
-----dataset: kr-vs-k-three_vs_eleven.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: kr-vs-k-three_vs_eleven.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

1.0
-----dataset: kr-vs-k-three_vs_eleven.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_recall_scores-----

0.9882352941176471
-----dataset: kr-vs-k-three_vs_eleven.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_specifity_scores-----

1.0
-----dataset: kr-vs-k-three

-----dataset: kr-vs-k-zero-one_vs_draw.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[1.0, 0.9473684210526315, 0.9523809523809523, 0.9523809523809523, 1.0, 0.9523809523809523, 0.9444444444444444, 1.0, 1.0, 1.0]
-----dataset: kr-vs-k-zero-one_vs_draw.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.9523809523809523, 0.8571428571428571, 0.9523809523809523, 0.9523809523809523, 0.9523809523809523, 0.9523809523809523, 0.8095238095238095, 0.9523809523809523, 0.9523809523809523, 0.8095238095238095]
-----dataset: kr-vs-k-zero-one_vs_draw.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[1.0, 0.998211091234347, 0.998211091234347, 0.998211091234347, 1.0, 0.9982142857142857, 0.998211091234347, 1.0, 1.0, 1.0]
-----dataset: kr-vs-k-zero-one_vs_draw.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.9748955722639933
-----dataset:

-----dataset: shuttle-2_vs_5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: shuttle-2_vs_5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: shuttle-2_vs_5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: shuttle-2_vs_5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

1.0
-----dataset: shuttle-2_vs_5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_recall_scores-----

1.0
-----dataset: shuttle-2_vs_5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_specifity_scores-----

1.0
-----dataset: shuttle-2_vs_5.dat-----
-----oversampling_memthod: MeansShiftSMOTE-----
-----metryka: precission_

-----dataset: shuttle-c0-vs-c4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: shuttle-c0-vs-c4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: shuttle-c0-vs-c4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
-----dataset: shuttle-c0-vs-c4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

1.0
-----dataset: shuttle-c0-vs-c4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_recall_scores-----

1.0
-----dataset: shuttle-c0-vs-c4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_specifity_scores-----

1.0
-----dataset: shuttle-c0-vs-c4.dat-----
-----oversampling_memthod: MeansShiftSMOTE-----
-----metryk

-----dataset: yeast-0-2-5-6_vs_3-7-8-9.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.5217391304347826, 0.631578947368421, 0.7142857142857143, 0.6, 0.5789473684210527, 0.6923076923076923, 0.6086956521739131, 0.7222222222222222, 0.6842105263157895, 0.55]
-----dataset: yeast-0-2-5-6_vs_3-7-8-9.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.6, 0.6, 0.5, 0.6, 0.5789473684210527, 0.45, 0.7, 0.65, 0.65, 0.5789473684210527]
-----dataset: yeast-0-2-5-6_vs_3-7-8-9.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9392265193370166, 0.9613259668508287, 0.9779005524861878, 0.9558011049723757, 0.9558011049723757, 0.9779005524861878, 0.9502762430939227, 0.9723756906077348, 0.9668508287292817, 0.9502762430939227]
-----dataset: yeast-0-2-5-6_vs_3-7-8-9.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.6303987253529587
-----d

-----dataset: yeast-0-2-5-7-9_vs_3-6-8.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.7894736842105263, 0.85, 0.8181818181818182, 0.782608695652174, 0.6923076923076923, 0.7368421052631579, 0.9444444444444444, 0.7777777777777778, 0.8181818181818182, 0.7619047619047619]
-----dataset: yeast-0-2-5-7-9_vs_3-6-8.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.75, 0.85, 0.9, 0.9, 0.47368421052631576, 0.7, 0.85, 0.7, 0.9, 0.8421052631578947]
-----dataset: yeast-0-2-5-7-9_vs_3-6-8.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9779005524861878, 0.9834254143646409, 0.9779005524861878, 0.9723756906077348, 0.9779005524861878, 0.9723756906077348, 0.994475138121547, 0.9779005524861878, 0.9779005524861878, 0.9723756906077348]
-----dataset: yeast-0-2-5-7-9_vs_3-6-8.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.79717227

-----dataset: yeast4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.42857142857142855, 0.3333333333333333, 0.2777777777777778, 0.6, 0.6666666666666666, 0.5, 0.8, 0.0, 0.4, 0.4166666666666667]
-----dataset: yeast4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.3, 0.3, 0.5, 0.5454545454545454, 0.4, 0.3, 0.4, 0.0, 0.36363636363636365, 0.5]
-----dataset: yeast4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9860627177700348, 0.9790940766550522, 0.9547038327526133, 0.986013986013986, 0.993006993006993, 0.9895470383275261, 0.9965156794425087, 0.9825783972125436, 0.9790209790209791, 0.9755244755244755]
-----dataset: yeast4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.4423015873015873
-----dataset: yeast4.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_recall_scores-----

0.36090909090

-----dataset: yeast5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: precission_scores-----

[0.5384615384615384, 0.6666666666666666, 0.5, 1.0, 0.8571428571428571, 0.8333333333333334, 0.625, 0.8888888888888888, 0.6923076923076923, 0.75]
-----dataset: yeast5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: recall_scores-----

[0.7777777777777778, 0.6666666666666666, 0.5555555555555556, 0.8888888888888888, 0.75, 0.5555555555555556, 0.5555555555555556, 0.8888888888888888, 1.0, 0.75]
-----dataset: yeast5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: specifity_scores-----

[0.9791666666666666, 0.9895833333333334, 0.9826388888888888, 1.0, 0.9965277777777778, 0.9965277777777778, 0.9895833333333334, 0.9965277777777778, 0.9861111111111112, 0.9930555555555556]
-----dataset: yeast5.dat-----
-----oversampling_memthod: KMeansSMOTE-----
-----metryka: mean_precission_score-----

0.7351800976800977
-----dataset: yeast5.dat-----
-----oversamplin

##### Przegląd uśrednionych statystyk dla folderu metricsForEval = 'precission_scores'

In [12]:
import os
import pandas as pd
import plotly.express as px

metricsForEval = 'precission_scores'
metricsForEval = 'recall_scores'

viewDF = pd.DataFrame()

files_in_midbalance = os.listdir('mild-imbalance')

tepm_mid_imbalance = []
tepm_high_imbalance = []

for dataset in results:
    if dataset in files_in_midbalance:
        for method in results[dataset]:
            tepm_mid_imbalance = [*tepm_mid_imbalance, *results[dataset][method][metricsForEval]]
    else:
        for method in results[dataset]:
            tepm_high_imbalance = [*tepm_high_imbalance, *results[dataset][method][metricsForEval]]

viewDF['mild-imbalance'] = tepm_mid_imbalance
viewDF['high-imbalance'] = tepm_high_imbalance
fig = px.histogram(viewDF, nbins=100, title= metricsForEval)
fig.show()

### Statistically significantly better preprocessings in given datasets with given metrics

In [17]:
from scipy.stats import ttest_rel, wilcoxon, shapiro
from tabulate import tabulate

alfa = .05
methods = ["KMeansSMOTE", "MeansShiftSMOTE", "SMOTE", "ROS", "BorderlineSMOTE"]
metrics = ["precission_scores","recall_scores","specifity_scores"]

for directory in results:
    for metric in metrics:
        w_statistic = np.zeros((len(methods), len(methods)))
        p_value = np.zeros((len(methods), len(methods)))
        test_used = np.empty((len(methods), len(methods)), dtype=object)
        for i, preprocessing_method in enumerate(methods):
            for j, comparison_preprocessing_method in enumerate(methods):
                metric_results_one = results[directory][preprocessing_method][metric]
                metric_results_two = results[directory][comparison_preprocessing_method][metric]
                tmp_p_val = 0
                try:
                    _, tmp_p_val = wilcoxon(metric_results_one, metric_results_two)
                except: 
                    _, tmp_p_val = 0, 1
                mean_metric_one = np.mean(metric_results_one)
                mean_metric_two = np.mean(metric_results_two)
                if  tmp_p_val <= alfa:
                    if mean_metric_one - mean_metric_two > 0:
                        w_statistic[i,j], p_value[i,j] = 1, 1
                    else:
                        w_statistic[i,j], p_value[i,j] = 0, 1
                else:
                    if mean_metric_one - mean_metric_two > 0:
                        w_statistic[i,j], p_value[i,j] = 1, 0
                    else:
                        w_statistic[i,j], p_value[i,j] = 0, 0
        stat_better = w_statistic * p_value
        stat_better_table = tabulate(stat_better, methods)
        print(f"Statistically significantly better {metric}:")
        print(stat_better_table)
        print()
        print()
        print()
        

Statistically significantly better precission_scores:
  KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
-------------  -----------------  -------  -----  -----------------
            0                  0        0      0                  0
            0                  0        0      0                  0
            0                  0        0      0                  0
            1                  1        1      0                  1
            0                  0        0      0                  0



Statistically significantly better recall_scores:
  KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
-------------  -----------------  -------  -----  -----------------
            0                  0        0      1                  0
            0                  0        0      1                  0
            0                  1        0      1                  0
            0                  0        0      0                  0
         

### Statistically significantly better preprocessings for all datasets

In [18]:
from scipy.stats import rankdata, ranksums

methods = ["KMeansSMOTE", "MeansShiftSMOTE", "SMOTE", "ROS", "BorderlineSMOTE"]
metrics = ["mean_precission_score", "mean_recall_scores", "mean_specifity_scores"]
for metric in metrics:
    mean = []
    for directory in results:
        preprocessing_mean = []
        for i, preprocessing_method in enumerate(methods):
            preprocessing_mean.append(results[directory][preprocessing_method][metric])
        mean.append(preprocessing_mean)

    ranks = []
    for mean_score in mean:
        ranks.append(rankdata(mean_score).tolist())
    ranks = np.array(ranks)

    alfa = .05
    w_statistic = np.zeros((len(methods), len(methods)))
    p_value = np.zeros((len(methods), len(methods)))
    for i in range(len(methods)):
        for j in range(len(methods)):
            w_statistic[i, j], p_value[i, j] = ranksums(ranks.T[i], ranks.T[j])
    names_column = np.expand_dims(np.array(list(methods)), axis=1)
    w_statistic_table = np.concatenate((names_column, w_statistic), axis=1)
    w_statistic_table = tabulate(w_statistic_table, methods, floatfmt=".2f")
    p_value_table = np.concatenate((names_column, p_value), axis=1)
    p_value_table = tabulate(p_value_table, methods, floatfmt=".2f")
    advantage = np.zeros((len(methods), len(methods)))
    advantage[w_statistic > 0] = 1
    advantage_table = tabulate(np.concatenate(
    (names_column, advantage), axis=1), methods)
    significance = np.zeros((len(methods), len(methods)))
    significance[p_value <= alfa] = 1
    statisticaly_better = advantage * significance
    statisticaly_better_table = tabulate(np.concatenate(
    (names_column, statisticaly_better), axis=1), methods)
    print(f"Metric: {metric}")
    print("Statistical significance (alpha = 0.05):")
    print(statisticaly_better_table)
    print()
    print()

Metric: mean_precission_score
Statistical significance (alpha = 0.05):
                   KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
---------------  -------------  -----------------  -------  -----  -----------------
KMeansSMOTE                  0                  0        0      0                  0
MeansShiftSMOTE              0                  0        0      0                  0
SMOTE                        0                  0        0      0                  0
ROS                          1                  1        1      0                  1
BorderlineSMOTE              0                  0        0      0                  0


Metric: mean_recall_scores
Statistical significance (alpha = 0.05):
                   KMeansSMOTE    MeansShiftSMOTE    SMOTE    ROS    BorderlineSMOTE
---------------  -------------  -----------------  -------  -----  -----------------
KMeansSMOTE                  0                  0        0      1                  0
MeansShif

# Testy parowe
## Cel testów parowych
W poprzedniej części pracy wykonano ocenę klasyfikatorów za pomocą metryk i walidacji krzyżowej. W ten sposób uzyskano osobną ocenę dla przypadków klasyfikatorów gdzie różnicami były: zbiory uczące, różne modele uczenia maszynowego oraz główny cel pracy czyli różne metody balansowania danych (sprawdzić czy niema zmian). Uzyskane wyniki należy w tym momencie porównać aby ocenić czy wyniki uzyskane w przypadkach gdzie wykorzystano metodę SMOTE wykonywaną na klastrach powstałych z danych niezbalansowanych są lepsze od pozostałych metod oversamplingu. Aby ocenić tą zależność można uśrednić uzyskane metryki i ocenić czy średnio model wykorzystujący zaproponowaną metodę oversamplingu uzyskuje lepsze wyniki, jednak test taki nie może zostać uznany za prawidłowy ponieważ mógł wynikać z przypadku. W tym celu należy wziąć pod uwagę również parametr określający jak wyniki wchodzące w skład średniej są od niej oddalone (odchylenie standardowe zbiorów). W tym celu przez badaczy wykorzystywane są testy statystyczne takie jak test T- Studenta oraz test Wilcoxona. Poniżej opisane zostaną testy które mogą zostać wykorzystane w pracy.
## Wkorzystane testy statystyczne 
Testy statystyczne wykorzystuje się dla różńych zbiorów danych aby ocenić czy różnica między nimi jest statystycznie istotna.
W tej pracy celem jest porównanie metryk dla modeli wykorzystujących mechanizm balansowania danych opartych o metodę SMOTE działającą na pojedynczych klastrach danych wejściowych oraz innych popularnie wykorzystywanych narzędzi oversamplingu. W ten sposób dane można podzielić na pary: zaproponowana w pracy metoda i inna metoda oversamplingu. Taki sposób podziału danych determinuje wykorzystanie testów parowych.
### Test T-Studenta
Test T-Studenta to test parametryczny (opierający się o porównanie parametrów populacji takich jak odchylenie standardowe czy średnia )
Warunkiem koniecznym do zastosowania testu T studenta jest założenie, że porównywane zbiory są normalne. Testowanie normalności zbiorów zostanie omówione w dalszej części pracy. Jeśli Testowanie normalności wykazało, że próbki nie są normalne wtedy można wykorzystać inne testy o mniejszej precyzji czyli testy nieparametryczne. Należy jednak napiętać, że jeśli to możliwe powinienny zostać przeprowadzone testy parametryczne takie jest test T studenta lub analiza wariancji.\
Testy statystyczne parametryczne : https://pogotowiestatystyczne.pl/slowniki/testy-parametryczne/#:~:text=Testy%20parametryczne%20to%20rodzaj%20test%C3%B3w,standardowe%20lub%20innych%20statystykach%20opisowych.
### Test Wilcoxona
Test Wilcoxona to test nieparametryczny wykonywany na bazie próbek populacji a nie na jej parametrach. Wykorzystuje on różnicę między próbkami w przypadkach wykorzystania dwóchróżnych hiperparametrów modelu. Różnica każdejz próbek zostaje zakwalifikowana do jednego ze zbiorów $T_{-}$ gdy różnica jest ujemna lub $T_{+}$ gdy jest dodatnia. W ten sposób uzyskano dwa zbiory. Wszystkim różnicom w tym momencie usuwany zostaje znak a przyznana zostaje  $ranga$(tu można walnąć dokładniejszy opis ale teraz trochę małoczasu) a następnie sumowane są wszystkie ranki w zbiorach $T_{-}$ i $T_{+}$. Pod uwagę bierze się mniejszą sumę rang oraz sprawdzana jest ona w tablicy wartości sum wag Wilxocona.
### Test założenia Normalności
Aby ocenić czy próbki pewnej populacji mają określony rozkład można wykorzystać Test Kołmogorowa-Smirnowa. Aby określić czy wyniki badań możemy ocenić za pomocą testu T studenta trzeba spełnić założenie normalności a więc określić czy próbki mogą pochodzić z wałsciwości o charakterze rozkładu normalngo. W tym celu test Kołmogorowa-Smirnowa należy wykonać dla uzyskanych wcześniej wyników i rozkładu normalnego.

biblio: https://www.scirp.org/html/6-1241391_107034.htm
https://onlinelibrary.wiley.com/doi/full/10.1002/9781118445112.stat06558

Do wykonania oceny założenia normalności wykorzystana zostanie funkcja kstest pochodząca z pakiety scipy \
Tutaj jak to robić: https://medium.com/@ricardojaviermartnezsustegui/kolmog%C3%B3rov-smirnov-test-in-python-step-by-step-1b7532021bd2

a jest jeszcze coś takiego
scipy.stats.normaltest