<center><h1>Classification by regional modeling</h1></center>

Classification by regional modeling consists in a five-step approach:
1. Setting the hyper-parameters. In this step, we specify the number of SOM prototypes $C$. It must be also defined as the maximum number of regions $K_{max}$. Without any prior knowledge, we will set in this example $K_{max} = \sqrt{C}$.


2. SOM training. In order to build regional models, follow the procedure introduced by Vesanto and Alhoniemi [1]. Thus, the very first step requires training the SOM as usual, with $C$ prototypes.


3. Clustering of the SOM. The step consists in performing clustering over the $C$ SOM prototypes. Although one may use any clustering algorithm for this step, for the sake of simplicity, we use the standard K-means algorithm in combination with the Davies–Bouldin (DB) index. The DB index is a clustering validity measure commonly used for finding the optimal number of clusters, but any suitable measure can be equally used (see [2]). Thus, we compute $K = 1, 2, ... K_{max}$ partitioning of the SOM prototypes and the corresponding DB index value as well. The optimal partitioning, represented by $K_{opt}$ partitions, is then the value of $K$ wich minimizes the DB index.


4. Partitioning SOM prototypes into regions. Once $K_{opt}$ is selected, the $r$-th cluster of SOM prototypes, $r = 1...K_{opt}$, is composed of all weight vectors $w_i$ that are mapped onto the prototype $p_r$ of the K-means algorithm. More formally, the set of SOM prototypes associated with the r-th prototype of the K-means algorithm is defined as:
$$W_r = \{w_i \in R^{p+q} | \|w_i-p_r\| < \|w_i-p_j\|, \forall j =1,...,K_{opt}, j\neq r \}$$


5. Mapping data points to regions. The fourth step consists in finding $K_{opt}$ data partitions, denoted by $\{X_1\}$, $\{X_2\}$, ... , $\{X_{K_{opt}}\}$ of the training dataset by mapping each datapoint to a region $r \in \{1, ... , K_{opt}\}$. In other words, let us denote $N_r$ as the number of data vectors in $\{X_r\}$. Then, the partition $\{X_r\}$ is composed of those input vectors $x_{rμ}$, $μ = 1, ... , N_r$ , whose closest SOM prototype belongs to $W_r$.


6. Building classification models over the regions. Finally, once the original dataset has been divided into $K_{opt}$ subsets (one per region), the last step consists in building $K_{opt}$ regional classification models using $X_r$, $r = 1, ... , K_{opt}$.

To test this framework the datasets below were gathered from the UCI repository:
* Vertebral Column
* Wall-Following
* Alzheimer (aquele usado na disciplina)

In [1]:
# loading datasets
import pandas as pd

# Vertebral Column
# dataset for classification between Normal (NO) and Abnormal (AB)
vc2c = pd.read_csv('vertebral_column_data/column_2C.dat', delim_whitespace=True, header=None)
# dataset for classification between DH (Disk Hernia), Spondylolisthesis (SL) and Normal (NO)
vc3c = pd.read_csv('vertebral_column_data/column_3C.dat', delim_whitespace=True, header=None)

# Wall-Following
# dataset with all 24 ultrassound sensors readings
wf24f = pd.read_csv('wall_following_data/sensor_readings_24.data', header=None)
# dataset with simplified 4 readings (front, left, right and back)
wf4f  = pd.read_csv('wall_following_data/sensor_readings_4.data',  header=None)
# dataset with simplified 2 readings (front and left)
wf2f  = pd.read_csv('wall_following_data/sensor_readings_2.data',  header=None)

# Parkinson (31 people, 23 with Parkinson's disease (PD))
temp = pd.read_csv('parkinson_data/parkinsons.data')
labels = temp.columns.values.tolist()
new_labels = [label for label in labels if label not in ('name')] # taking off column 'name'
pk = temp[new_labels]

In [2]:
pk_features = pk.columns.tolist()
pk_features.remove('status')

# datasets with separation between 'features' and 'labels'
datasets = {
    "vc2c":  {"features": vc2c.iloc[:,0:6],  "labels": pd.get_dummies(vc2c.iloc[:,6],  drop_first=True)},
    "vc3c":  {"features": vc3c.iloc[:,0:6],  "labels": pd.get_dummies(vc3c.iloc[:,6],  drop_first=True)},
    "wf24f": {"features": wf24f.iloc[:,0:24],"labels": pd.get_dummies(wf24f.iloc[:,24],drop_first=True)},
    "wf4f":  {"features": wf4f.iloc[:,0:4],  "labels": pd.get_dummies(wf4f.iloc[:,4],  drop_first=True)},
    "wf2f":  {"features": wf2f.iloc[:,0:2],  "labels": pd.get_dummies(wf2f.iloc[:,2],  drop_first=True)},
    "pk":    {"features": pk.loc[:,pk_features], "labels": pk.loc[:,["status"]]}
}

OBS: Was chosen to maintain k-1 dummies variables when we had k categories, so the missing category is identified when all dummies variables are zero.

## Step 1: Setting the hyper-parameters.
## Step 2: SOM training.

The code below implements the class *SOM_2D* (self-organizing maps in a two-dimensional grid) and a function to plot data and neurons over all training iterations in the special case when the features space is also two-dimensional.

In [3]:
import numpy as np
import plotly.offline as plt
import plotly.graph_objs as go
from math import ceil
from random import randint
import ipywidgets as widgets
from IPython.display import clear_output
from plotly import tools
from multiprocessing import Pool

plt.init_notebook_mode(connected=True) # enabling plotly inside jupyter notebook

class SOM_2D:
    'Class of Self Organizing Maps conected in a two-dimensional grid.'
    
    def __init__(self, nRows, nColumns, dim): 
        self.nRows    = nRows
        self.nColumns = nColumns
        self.dim = dim   # neurons dimension = features dimension
        self.nEpochs = 0 # number of epochs of trained SOM
        
        self.param = np.zeros((dim, nRows, nColumns))
        self.paramHist = None
        self.ssdHist   = None
        
    def init(self, X): # giving the data, so we can define maximum and minimum in each dimension
        self.paramHist = None # reset paramHist and ssdHist
        self.ssdHist   = None
        
        # Auxiliary random element
        rand_01 = np.random.rand(self.dim, self.nRows, self.nColumns)
        # find min-max for each dimension:
        minimum = np.amin(X, axis=0)
        maximum = np.amax(X, axis=0)
        for dim in range(self.dim):
            self.param[dim,:,:] = (maximum[dim]-minimum[dim])*rand_01[dim,:,:] + minimum[dim]
        
    
    #def update_neuron(self, args):
    #    row, column, winner_idx, alpha, sigma, i = args
    #    h_ik = self.h_neighbor(winner_idx, [row, column], sigma)
    #    self.param[:,row,column] += alpha * h_ik * (X[i] - self.param[:,row,column])
        
    
    def train(self, X, alpha0, sigma0, nEpochs=100, batchSize=100, saveParam=False, saveSSD=True, tol=1e-6,
              verboses=0):
        tau1 = nEpochs/sigma0
        tau2 = nEpochs
        SSD_new = self.SSD(X) # initial SSD, from random parameters
        
        if saveParam: 
            self.paramHist = np.zeros((nEpochs+1, self.dim, self.nRows, self.nColumns))
            self.paramHist[0,:,:,:] = self.param # random parameters
        if saveSSD:
            self.ssdHist = np.zeros((nEpochs+1))
            self.ssdHist[0] = SSD_new # initial SSD, from random parameters
        
        sigma = sigma0
        alpha = alpha0
        inertia = np.inf # initial value of inertia
        batchSize = X.shape[0] if X.shape[0] < batchSize else batchSize # adjusting ill defined batchSize
        for epoch in range(nEpochs):
            # Updating alpha and sigma
            sigma = sigma0*np.exp(-epoch/tau1);
            alpha = alpha0*np.exp(-epoch/tau2);
            
            # shuffled order
            order = np.random.permutation(X.shape[0])
            for i in range(batchSize):
                # search for winner neuron
                winner_idx = self.get_winner(X[i])
                
                # updating neurons weights
                #args = [(r,c,winner_idx,alpha,sigma,i) for r in range(self.nRows) for c in range(self.nColumns)]
                
                #pool = Pool()                      # Create a multiprocessing Pool
                #pool.map(self.update_neuron, args) # process data_inputs iterable with pool
                #pool.close()
                #pool.join()
                #print("end of tata point: {}".format(i))
                
                
                for row in range(self.nRows):
                    for column in range(self.nColumns):
                        h_ik = self.h_neighbor(winner_idx, [row, column], sigma)
                        self.param[:,row,column] += alpha * h_ik * (X[i] - self.param[:,row,column])
                
               
            
            
            self.nEpochs = epoch+1 # saving number of epochs
            if verboses==1:
                print("End of epoch {}".format(self.nEpochs))
            
            SSD_old = SSD_new
            SSD_new = self.SSD(X)
            inertia = abs((SSD_old - SSD_new)/SSD_old)
            
            # Saving if necessary
            if saveParam:
                self.paramHist[epoch+1,:,:,:] = self.param
            if saveSSD:
                self.ssdHist[epoch+1] = SSD_new
                       
            if inertia < tol: # maybe break before nEpochs
                # history cutting
                if saveParam:
                    self.paramHist = self.paramHist[0:epoch+2,:,:,:]
                if saveSSD:
                    self.ssdHist = self.ssdHist[0:epoch+2]
                
                break
            
            
    def SSD(self, X):
        SSD = 0
        for x in X:
            dist_min = np.inf
            for row in range(self.nRows):
                for column in range(self.nColumns):
                    temp = x - self.param[:,row,column]
                    dist = np.dot(temp,temp)
                    if dist < dist_min:
                        dist_min = dist
            SSD += dist_min
        return SSD
        
        
    def get_winner(self, x):
        dist_matrix = np.zeros((self.nRows, self.nColumns)) # norm**2
        for row in range(self.nRows):
            for column in range(self.nColumns):
                aux = x - self.param[:,row,column]
                dist_matrix[row,column] = np.dot(aux,aux)
        result = [ceil((dist_matrix.argmin()+1)/self.nRows)-1, dist_matrix.argmin()%self.nRows]        
        return result
    
    
    def h_neighbor(self, idx_1, idx_2, sigma):
        aux = np.asarray(idx_1) - np.asarray(idx_2)
        return np.exp( -np.dot(aux,aux)/(2*sigma**2) )
    
    def getLabels(self, X):
        N = len(X)
        labels = np.zeros((N,2))
        #labels = [self.get_winner(X[i,:]) for i in range(len(X))]
        for i in range(N):
            labels[i,:] = self.get_winner(X[i,:])
            
        return labels
    
    def plotSSD(self):
        traceData = go.Scatter(
            x = [i+1 for i in range(self.nEpochs)], # epochs
            y = self.ssdHist, 
            mode='lines',
            name='SSD')
        data = [traceData]
        layoutData = go.Layout(
            title = "SSD history",
            xaxis=dict(title='Epoch'),
            yaxis=dict(title='SSD')
        )

        fig = go.Figure(data=data, layout=layoutData)
        plt.iplot(fig)

    def paramAsMatrix(self): # return the 3D matrix of param as a 2D matrix as in k-means
        som_clusters = np.zeros((self.nRows*self.nColumns, self.dim))
        count=0
        for r in range(self.nRows):
            for c in range(self.nColumns):
                som_clusters[count] = self.param[:,r,c]
                count+=1
        return som_clusters

In [4]:
# function to plot SOM in the special case when the feature space is 2D
def plot_SOM(SOM, X): 
    if SOM.paramHist is not None:
        # Int box to change the iteration number
        n_txt = widgets.BoundedIntText(
            value=0,
            min=0,
            max=len(SOM.paramHist)-1,
            step=10,
            description='epoch:'
        )    
        
    # Function to draw the graph
    def atualizarGrafico(change):
        clear_output()

        if SOM.paramHist is not None:
            display(n_txt)    
            n_ = change['new'] # new iteration number
    
        if X is not None:
            datapoints = go.Scatter(
                x = X[:,0], 
                y = X[:,1], 
                mode='markers',
                name='data',
                marker = dict(
                     size = 5,
                     color = '#03A9F4'
                    )
            )

            if SOM.paramHist is not None:
                x = SOM.paramHist[n_,0,:,:].reshape(-1).tolist() 
                y = SOM.paramHist[n_,1,:,:].reshape(-1).tolist()
                name = 'neurons [epoch ='+str(n_)+']'
            else:
                x = SOM.param[0,:,:].reshape(-1).tolist()
                y = SOM.param[1,:,:].reshape(-1).tolist()
                name = 'neurons'
            
            neurons = go.Scatter(x=x, y=y, mode='markers', name=name, 
                                 marker = dict(size=10,color = '#673AB7'))

            data = [datapoints, neurons]

            # cada linha que conecta os neurônios
            linhas = [{}]*(2*SOM.nRows*SOM.nColumns - SOM.nRows - SOM.nColumns)
            count=0 #contador para saber qual linha estamos
            for linha in range(SOM.nRows): # conecta da esquerda para direita
                for coluna in range(SOM.nColumns): # e de cima para baixo
                    try:
                        if SOM.paramHist is not None:
                            x0 = SOM.paramHist[n_,0,linha,coluna]
                            y0 = SOM.paramHist[n_,1,linha,coluna]
                            x1 = SOM.paramHist[n_,0,linha,coluna+1]
                            y1 = SOM.paramHist[n_,1,linha,coluna+1]
                        else:
                            x0 = SOM.param[0,linha,coluna]
                            y0 = SOM.param[1,linha,coluna]
                            x1 = SOM.param[0,linha,coluna+1]
                            y1 = SOM.param[1,linha,coluna+1]
                            
                        linhas[count]= {'type':'line','x0':x0,'y0': y0,'x1':x1,'y1':y1,
                                        'line': {'color': '#673AB7','width': 1,}}
                        count+=1
                    except:
                        pass
                    try:
                        if SOM.paramHist is not None:
                            x0 = SOM.paramHist[n_,0,linha,coluna]
                            y0 = SOM.paramHist[n_,1,linha,coluna]
                            x1 = SOM.paramHist[n_,0,linha+1,coluna]
                            y1 = SOM.paramHist[n_,1,linha+1,coluna]
                        else:
                            x0 = SOM.param[0,linha,coluna]
                            y0 = SOM.param[1,linha,coluna]
                            x1 = SOM.param[0,linha+1,coluna]
                            y1 = SOM.param[1,linha+1,coluna]
                        
                        linhas[count] = {'type': 'line','x0': x0,'y0': y0,'x1': x1,'y1': y1,
                                         'line': {'color': '#673AB7','width': 1}}
                        count+=1
                    except:
                        pass

            layout = go.Layout(
                title = "Dados + SOM",
                xaxis=dict(title="$x_1$"),
                yaxis=dict(title="$x_2$"),
                shapes=linhas
            )

            fig = go.Figure(data=data, layout=layout)
            plt.iplot(fig)
    
    if SOM.paramHist is not None:
        n_txt.observe(atualizarGrafico, names='value')
        
    atualizarGrafico({'new': 0})

Function to scale data:

In [5]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Função para mudar a escala dos dados
def scale_feat(X_train, X_test, scaleType='min-max'):
    if scaleType=='min-max' or scaleType=='std':
        X_tr_norm = np.copy(X_train) # fazendo cópia para deixar original disponível
        X_ts_norm = np.copy(X_test)
        scaler = MinMaxScaler() if scaleType=='min-max' else StandardScaler()
        scaler.fit(X_tr_norm)
        X_tr_norm = scaler.transform(X_tr_norm)
        X_ts_norm = scaler.transform(X_ts_norm)
        return (X_tr_norm, X_ts_norm)
    else:
        raise ValueError("Tipo de escala não definida. Use 'min-max' ou 'std'.")
        
import datetime
def printDateTime():
    print(datetime.datetime.now())

The code below trains the SOM's in all datasets:

Note: The number of neurons chosen was approximately $5\sqrt{N}$ of the dataset and a square grid to arrange them.

In [6]:
# Loading object:
import pickle
with open('soms.pkl', 'rb') as input:
    soms = pickle.load(input)

In [7]:
for dataset in datasets:
    print(dataset)
    soms[dataset].plotSSD()

vc2c


vc3c


wf24f


wf4f


wf2f


pk


# Step 3: Clustering of the SOM.

Function (and help functions) that implement Davies–Bouldin (DB) validation index:

In [8]:
from numpy.linalg import norm

def DB(model, X, q=2, t=2):
    k = len(model.cluster_centers_) # number of clusters
    db = 0
    for i in range(k):
        db += R(i,q,t,model,X)
    db/=k
    return db

def R(i,q,t,model,X):
    js = [j for j in range(len(model.cluster_centers_)) if j!=i] # for j!=i
    R_iqt = 0 # R_iqt is always >= 0
    for j in js:
        temp = (S(i,q,model,X) + S(j,q,model,X)) / d(i,j,t,model)
        R_iqt = temp if temp > R_iqt else R_iqt # searching for the maximum
    return R_iqt

def S(i,q,model,X):
    Vi = X[np.where(model.labels_ == i)] # partition 'i'
    wi = model.cluster_centers_[i]       # cluster center of partition 'i'
    S_iq = 0
    for x in Vi:
        S_iq += norm(x-wi)**q
    
    S_iq = (S_iq/len(Vi)) ** (1/q)
    return S_iq

def d(i,j,t,model):
    wi = model.cluster_centers_[i]
    wj = model.cluster_centers_[j]
    d_ijt = 0
    for m in range(len(wi)):
        d_ijt += (abs(wi[m] - wj[m])) ** t
    d_ijt = d_ijt ** (1/t)
    return d_ijt
#DB(kmeans, som_clusters)

In [9]:
from numpy import trace

def CH(kmeans, points):
    # ks is the vector of clusters. Ex: [0 1 2 3 4]
    # N is the vector frequency of each label. Ex: [14 10 5 12]
    ks, N = np.unique(np.sort(kmeans.labels_), return_counts=True)
    x_bar = np.mean(points, axis=0).reshape(-1,1)
    
    # matriz dispersão entregrupos
    Bk = np.zeros((points.shape[1],points.shape[1]))
    # matriz dispersão intragrupo
    Wk = np.zeros((points.shape[1],points.shape[1]))
    for i in ks: # para cada protótipo
        wi = kmeans.cluster_centers_[i].reshape(-1,1) # protótipo do cluster 'i'
        temp = wi - x_bar
        Bk += N[i]*np.matmul(temp,temp.T)
        
        Vi = points[kmeans.labels_==i].T
        for l in range(Vi.shape[1]): # para cada elemento da partição 'i'
            temp = Vi[:,l] - wi
            Wk += np.matmul(temp,temp.T)
            
    N = len(points) # number of points
    K = len(kmeans.cluster_centers_) # number os clusters
    ch = ( trace(Bk)/(K-1) ) / ( trace(Wk)/(N-K) )
    return ch

In [10]:
import base

Clustering and searching for optimal $k$ in range 2 to $\sqrt{C}$, where $C$ is the number os SOM prototypes:

In [11]:
%%time
    
from sklearn.cluster import KMeans

printDateTime()
validation_indices = {
    'DB':   {},
    'Dunn': {},
    'CH':   {}
}

for dataset_name in datasets:
        som = soms[dataset_name]
        som_clusters = np.zeros((som.nRows*som.nColumns, som.dim))

        count=0
        for r in range(som.nRows):
            for c in range(som.nColumns):
                som_clusters[count] = som.param[:,r,c]
                count+=1

        C = som.nRows*som.nColumns
        #ks = [i for i in range(2, C+1)] # range to search for k in k-means
        ks = [i for i in range(2, ceil(C**(1/2)))]


        n_init = 10 # number of independent rounds of initialization
        validation_indices['DB'][dataset_name]   = [0]*len(ks)
        validation_indices['Dunn'][dataset_name] = [0]*len(ks)
        validation_indices['CH'][dataset_name]   = [0]*len(ks)
        for i in range(len(ks)):
            kmeans = KMeans(n_clusters=ks[i], n_init=n_init, init='random', n_jobs=-1).fit(som_clusters)
            # test if number of distinct clusters == number of clusters specified
            centroids = kmeans.cluster_centers_
            if len(centroids) == len(np.unique(centroids,axis=0)):
                validation_indices['DB'][dataset_name][i] = DB(kmeans,som_clusters)
            else:
                validation_indices['DB'][dataset_name][i] = np.inf

            validation_indices['Dunn'][dataset_name][i] = base.dunn_fast(som_clusters, kmeans.labels_)
            validation_indices['CH'][dataset_name][i]   = CH(kmeans, som_clusters)

        print("End of dataset {}".format(dataset_name))

2019-06-14 10:23:16.284788
End of dataset vc2c
End of dataset vc3c
End of dataset wf24f
End of dataset wf4f
End of dataset wf2f
End of dataset pk
CPU times: user 20.8 s, sys: 16.3 s, total: 37.1 s
Wall time: 9.2 s


In [12]:
def plot_validation_indices(dataset_name, validation_indices):
    data = []
    for index_name, results_vec in validation_indices.items():
    #for validation_index in validation_indices:
        #print(index_name)
        #print(results_vec[dataset_name])
        data.append(go.Scatter(
            x=[i for i in range(2, len(results_vec[dataset_name])+2)],
            y=results_vec[dataset_name], 
            mode='lines+markers', 
            name="{} index".format(index_name)))

    
    layout = go.Layout(
        title = "Indices vs k [{} dataset]".format(dataset_name),
        legend=dict(orientation="h", y=-.05),
        xaxis=dict(title="Number of clusters (k)"),
        yaxis=dict(title="Indices values")
    )

    fig = go.Figure(data=data, layout=layout)
    plt.iplot(fig)


for dataset_name in datasets:
    plot_validation_indices(dataset_name, validation_indices)
    #plot_db(db[dataset_name], dataset_name)
    for index_name, results_vec in validation_indices.items():
        results = results_vec[dataset_name]
        k_opt = np.argmin(results) if index_name=='DB' else np.argmax(results)
        k_opt += 2
        print("K_opt for {} dataset using {} index: {}".format(
              dataset_name, index_name, k_opt))
        
    #print("K_opt for {} dataset is: {}".format(dataset_name, np.argmin(db[dataset_name])+2))

K_opt for vc2c dataset using DB index: 6
K_opt for vc2c dataset using Dunn index: 9
K_opt for vc2c dataset using CH index: 2


K_opt for vc3c dataset using DB index: 8
K_opt for vc3c dataset using Dunn index: 9
K_opt for vc3c dataset using CH index: 2


K_opt for wf24f dataset using DB index: 18
K_opt for wf24f dataset using Dunn index: 7
K_opt for wf24f dataset using CH index: 2


K_opt for wf4f dataset using DB index: 5
K_opt for wf4f dataset using Dunn index: 16
K_opt for wf4f dataset using CH index: 2


K_opt for wf2f dataset using DB index: 5
K_opt for wf2f dataset using Dunn index: 15
K_opt for wf2f dataset using CH index: 2


K_opt for pk dataset using DB index: 8
K_opt for pk dataset using Dunn index: 8
K_opt for pk dataset using CH index: 2


In [13]:
def plot_kmeans(kmeans, X):
    data = []
    if X is not None:
        datapoints = go.Scatter(
            x = X[:,0], 
            y = X[:,1], 
            mode='markers',
            name='data',
            marker = dict(
                 size = 5,
                 color = '#03A9F4'
                )
        )
        data.append(datapoints)
    
    kmeans_clusters = go.Scatter(
        x=kmeans.cluster_centers_[:,0],
        y=kmeans.cluster_centers_[:,1], 
        mode='markers', 
        name='kmeans clusters', 
        marker = dict(size=10,color = '#673AB7')
    )
    data.append(kmeans_clusters)

    layout = go.Layout(
        title = "Data + KMeans clusters",
        xaxis=dict(title="$x_1$"),
        yaxis=dict(title="$x_2$"),
    )

    fig = go.Figure(data=data, layout=layout)
    plt.iplot(fig)

plot_kmeans(kmeans,som_clusters)

# Step 4: Partitioning SOM prototypes into regions:

# Step 4: Mapping data points to regions:

# Step 6: Building classification models over the regions.

Below is the class Regional Linear Model wich implements the OLS in the regional fashion.

In [14]:
from sklearn import linear_model

class RegionalLinearModel:
    'Class of Regional Linear Models.'
    
    def __init__(self, SOM, Cluster, LinearModel): 
        self.SOM    = SOM
        self.Cluster = Cluster
        self.LinearModel = LinearModel
        self.models_ = None
        self.region_labels = None
        self.regional_models = []
        

    def fit(self, X, Y, verboses, SOM_params=None, Cluster_params=None):
        # SOM training
        if SOM_params is not None:
            if verboses==1:
                print("Start of SOM training: {}".format(datetime.datetime.now()))
            self.SOM.init(X)
            self.SOM.train(X=X, **SOM_params)
        
        # Cluster training
        if Cluster_params is not None:
            if verboses==1:
                print("Start of Clustering SOM prototypes: {}".format(datetime.datetime.now()))
            
            # Search for k_opt
            k_opt = None
            if type(Cluster_params['n_clusters']) is dict: # a search is implied:
                eval_function = Cluster_params['n_clusters']['metric']
                find_best     = Cluster_params['n_clusters']['criteria']
                k_values      = Cluster_params['n_clusters']['k_values']
                som_clusters = self.SOM.paramAsMatrix()
                
                validation_index = [0]*len(k_values)
                for i in range(len(k_values)):
                    kmeans = KMeans(n_clusters=k_values[i],
                                    n_init=10,
                                    init='random',
                                    n_jobs=-1).fit(som_clusters)
                    # test if number of distinct clusters == number of clusters specified
                    centroids = kmeans.cluster_centers_
                    if len(centroids) == len(np.unique(centroids,axis=0)):
                        validation_index[i] = eval_function(kmeans,som_clusters)
                    else:
                        validation_index[i] = np.NaN
                
                k_opt = k_values[find_best(validation_index)]
                if verboses==1:
                    print("Best k found: {}".format(k_opt))
            else:
                k_opt = Cluster_params['n_clusters']
            
            params = Cluster_params.copy()
            del params['n_clusters'] # deleting unecessary param
            self.Cluster = KMeans(n_clusters=k_opt, **params).fit(som_clusters) # real training of clustering algorithm
            
        
        # Linear model training
        self.region_labels = self.regionalize(X) # finding labels of datapoints
        if verboses==1:
            print("Start of Linear Model training: {}".format(datetime.datetime.now()))
        
        self.regional_models = [{}]*k_opt    
        for r in range(k_opt): # for each region
            Xr = X[np.where(self.region_labels == r)[0]]
            Yr = Y[np.where(self.region_labels == r)[0]]

            self.LinearModel.fit(Xr,Yr)
            self.regional_models[r] = self.LinearModel
            
                
    def regionalize(self, X):
        regions = np.zeros(len(X))
        for i in range(len(X)): # for each datapoint
            winner_som_idx = self.SOM.get_winner(X[i]) # find closest neuron
            winner_som_idx = winner_som_idx[0]*self.SOM.nColumns + winner_som_idx[1] # convert to squeezed index
            regions[i] = self.Cluster.labels_[winner_som_idx] # find neuron label index in kmeans
        
        return regions
          
    
    def predict(self, X):
        temp = self.regional_models[0].intercept_
        predictions = np.zeros((X.shape[0],len(temp)))
        
        regions = self.regionalize(X)
        for i in range(len(X)):
            predictions[i,:] = self.regional_models[int(regions[i])].predict(X[i].reshape(1, -1))
        
        return predictions


Evaluation of regional OLS in the datasets:

In [15]:
# convert dummies to multilabel
def dummie_to_multilabel(X):
    N = len(X)
    X_multi = np.zeros((N,1),dtype='int')
    for i in range(N):
        temp = np.where(X[i]==1)[0] # find where 1 is found in the array
        if temp.size == 0: # is a empty array, there is no '1' in the X[i] array
            X_multi[i] = 0 # so we denote this class '0'
        else:
            X_multi[i] = temp[0] + 1 # we have +1 because 
    return X_multi.T[0]

### Processing results:

In [16]:
results = {}

results['pk'] = pd.read_csv('pk - n_init 100 - 2019-03-13 20:38:58.418614', delim_whitespace=True)
results['vc2c'] = pd.read_csv('vc2c - n_init 100 - 2019-03-13 21:32:31.426084', delim_whitespace=True)
results['vc3c'] = pd.read_csv('vc3c - n_init 100 - 2019-03-13 21:33:59.228209', delim_whitespace=True)
results['wf2f'] = pd.read_csv('wf2f - n_init 100 - 2019-03-18 01:27:46.584924', delim_whitespace=True)
results['wf4f'] = pd.read_csv('wf4f - n_init 100 - 2019-03-18 01:13:44.728115', delim_whitespace=True)
results['wf24f'] = pd.read_csv('wf24f - n_init 100 - 2019-03-18 01:43:55.344967', delim_whitespace=True)
# accuracy, sensibility, especificiadade

In [17]:
# header of the results
cabecalho = ['Média', 'Mediana', 'Mínimo', 'Máximo', 'Desv. Padrão', 'Sensib. média', 'Especif. média']
df_data   = np.zeros((len(results), len(cabecalho))) # matriz que guardará resultados numéricos

ds_name = 'wf24f'
#results[ds_name]
idx = 0
idx_label = [' ']*len(results)
for ds_name in results:
    data = results[ds_name].values
    length = data.shape[1]
    cm_side = int(np.sqrt(length))
    acc = [0]*len(data)
    especificidade = 0
    sensibilidade  = 0
    for i in range(len(data)):
        cm = np.reshape(data[i], (cm_side,cm_side))
        total=sum(sum(cm))
        for j in range(cm_side):
            acc[i] += cm[j,j] # summing the diagonal
        acc[i]/=total
        #especificidade += cm[0,0]/(cm[0,0]+cm[0,1])
        #sensibilidade  += cm[1,1]/(cm[1,1]+cm[1,0])

    #especificidade/=100 # Valores médios
    #sensibilidade/=100


    df_data[idx,:] = np.matrix([np.mean(acc), np.median(acc), min(acc), max(acc), 
                                np.std(acc), sensibilidade, especificidade])
    idx_label[idx] = ds_name
    idx+=1
    
#idx_label = [key for key in results]
df = pd.DataFrame(df_data, columns=cabecalho, index=[idx_label])

In [18]:
df

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
pk,0.735128,0.74359,0.461538,0.974359,0.088084,0.0,0.0
vc2c,0.740968,0.75,0.403226,0.903226,0.090634,0.0,0.0
vc3c,0.65871,0.677419,0.403226,0.903226,0.146085,0.0,0.0
wf2f,0.353059,0.427656,0.04304,0.738095,0.231464,0.0,0.0
wf4f,0.594112,0.611264,0.071429,0.768315,0.124959,0.0,0.0
wf24f,0.429121,0.444139,0.159341,0.611722,0.108414,0.0,0.0


# Globlal OLS

In [19]:
resultsGOLS = {}
resultsGOLS['pk']    = pd.read_csv('GOLS - pk - n_init 100 - 2019-03-28 09:25:12.796348', delim_whitespace=True)
resultsGOLS['vc2c']  = pd.read_csv('GOLS - vc2c - n_init 100 - 2019-03-28 09:25:12.652123', delim_whitespace=True)
resultsGOLS['vc3c']  = pd.read_csv('GOLS - vc3c - n_init 100 - 2019-03-28 09:25:12.619445', delim_whitespace=True)
resultsGOLS['wf2f']  = pd.read_csv('GOLS - wf2f - n_init 100 - 2019-03-28 09:25:13.978124', delim_whitespace=True)
resultsGOLS['wf4f']  = pd.read_csv('GOLS - wf4f - n_init 100 - 2019-03-28 09:25:14.218842', delim_whitespace=True)
resultsGOLS['wf24f'] = pd.read_csv('GOLS - wf24f - n_init 100 - 2019-03-28 09:25:14.597113', delim_whitespace=True)
# accuracy, sensibility, especificiadade

In [20]:
# header of the results
cabecalho = ['Média', 'Mediana', 'Mínimo', 'Máximo', 'Desv. Padrão', 'Sensib. média', 'Especif. média']
df_data   = np.zeros((len(results), len(cabecalho))) # matriz que guardará resultados numéricos

idx = 0
idx_label = [' ']*len(results)
for ds_name in resultsGOLS:
    data = resultsGOLS[ds_name].values
    length = data.shape[1]
    cm_side = int(np.sqrt(length))
    acc = [0]*len(data)
    especificidade = 0
    sensibilidade  = 0
    for i in range(len(data)):
        cm = np.reshape(data[i], (cm_side,cm_side))
        total=sum(sum(cm))
        for j in range(cm_side):
            acc[i] += cm[j,j] # summing the diagonal
        acc[i]/=total
        #especificidade += cm[0,0]/(cm[0,0]+cm[0,1])
        #sensibilidade  += cm[1,1]/(cm[1,1]+cm[1,0])

    #especificidade/=100 # Valores médios
    #sensibilidade/=100


    df_data[idx,:] = np.matrix([np.mean(acc), np.median(acc), min(acc), max(acc), 
                                np.std(acc), sensibilidade, especificidade])
    idx_label[idx] = ds_name
    idx+=1
    
#idx_label = [key for key in results]
df_GOLS = pd.DataFrame(df_data, columns=cabecalho, index=[idx_label])

In [21]:
df_GOLS

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
pk,0.877949,0.871795,0.717949,0.974359,0.048667,0.0,0.0
vc2c,0.827742,0.822581,0.645161,0.951613,0.046906,0.0,0.0
vc3c,0.773387,0.774194,0.629032,0.903226,0.051933,0.0,0.0
wf2f,0.720916,0.722527,0.677656,0.751832,0.015868,0.0,0.0
wf4f,0.725861,0.728022,0.679487,0.754579,0.014791,0.0,0.0
wf24f,0.636026,0.637821,0.597985,0.664835,0.014179,0.0,0.0


# References

[1] J. Vesanto, E. Alhoniemi, Clustering of the self-organizing map, IEEE Trans.
Neural Netw. 11 (2000) 586–600.

[2] M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, J. Intell. Inf. Syst. 17 (2001) 107–145.