# Importing libraries...
  * #### The Cell #1 imports the essential matplotlib modules for displaying figures outside jupyter cell 
  * #### The Cell #2 imports the essessential numpy and scipy modules for our computations as well as the tqdm module for visualizing the computation time in an organized loading bar

In [1]:
import PyQt5
import matplotlib.pyplot as plt
from matplotlib import style;  style.use('ggplot')
get_ipython().magic('matplotlib qt')

In [2]:
from tqdm import tqdm
import numpy as np
from scipy.spatial.distance import euclidean

# Importing libraries (cnt.)...
  * #### The below Cell imports the module "lonely_boy2" which contains our BSAS implementation.

In [3]:
import lib.lonely_boy2 as lb2

# Loading the Samples...
  * The X_minimax contains the samples normalized with the minmax method 
  * The X_stdscl contains the samples normalized with mean = 0 and std = 1 (gaussian -normal- distribution)

In [4]:
X_minimax = np.load('comp-data/1-preprocessing-comp-data/user-feature-set-minimax.npy')

In [5]:
X_stdscl = np.load('comp-data/1-preprocessing-comp-data/user-feature-set-stdscl.npy')

# Initializing and Running BSAS...
  * #### Part 1: Determining the optimal Theta and max. Cluster Number for each approach

In [6]:
clf = lb2.BSAS()
clf.fit_best(X_minimax.T, first_time=True, n_times=50, dataname='minimax', plot_graph=True)

Computing (Min/Max) Euclidean Distances...: 100%|████████████████████████████████████| 943/943 [00:16<00:00, 57.93it/s]


saved: comp-data/2-bsas-comp-data/min-max-euclidean-distances-minimax.npy


Running BSAS...: 100%|█████████████████████████████████████████████████████████████████| 50/50 [24:38<00:00, 29.57s/it]


saved: comp-data/2-bsas-comp-data/total_clusters.npy
saved: comp-data/2-bsas-comp-data/total_theta.npy


Finding Optimal Cluster...: 100%|██████████████████████████████████████████████████████████████| 50/50 [00:00<?, ?it/s]


In [7]:
clf2 = lb2.BSAS()
clf2.fit_best(X_stdscl.T, first_time=True, n_times=50, dataname='stdscl', plot_graph=True)

Computing (Min/Max) Euclidean Distances...: 100%|████████████████████████████████████| 943/943 [00:15<00:00, 60.64it/s]


saved: comp-data/2-bsas-comp-data/min-max-euclidean-distances-gauss.npy


Running BSAS...: 100%|█████████████████████████████████████████████████████████████████| 50/50 [22:45<00:00, 27.32s/it]


saved: comp-data/2-bsas-comp-data/total_clusters.npy
saved: comp-data/2-bsas-comp-data/total_theta.npy


Finding Optimal Cluster...: 100%|██████████████████████████████████████████████████████████████| 50/50 [00:00<?, ?it/s]


# Initializing and Running BSAS...
  * #### Part 1: Determining the Optimal Theta and Max. Cluster Number for each approach
  
  (this one is just a shortcut which accelerates computation time if all prerequisite calcualtions are done beforehand

In [6]:
clf = lb2.BSAS()
clf.fit_best(X_minimax.T, first_time=False, n_times=50, dataname='minimax', plot_graph=True)

loaded: comp-data/2-bsas-comp-data/min-max-euclidean-distances-minimax.npy
loaded: comp-data/2-bsas-comp-data/total_clusters-minimax.npy
loaded: comp-data/2-bsas-comp-data/total_theta-minimax.npy


Finding Optimal Cluster...: 100%|██████████████████████████████████████████████████| 50/50 [00:00<00:00, 102751.20it/s]


In [7]:
clf2 = lb2.BSAS()
clf2.fit_best(X_stdscl.T, first_time=False, n_times=50, dataname='stdscl', plot_graph=True)

loaded: comp-data/2-bsas-comp-data/min-max-euclidean-distances-stdscl.npy
loaded: comp-data/2-bsas-comp-data/total_clusters-stdscl.npy
loaded: comp-data/2-bsas-comp-data/total_theta-stdscl.npy


Finding Optimal Cluster...: 100%|██████████████████████████████████████████████████████████████| 50/50 [00:00<?, ?it/s]


  * #### Part 2: Plotting Cluster Number vs. Theta...

    * minimax_scaled_samples (X_minimax): From the below screenshot it is visible that we'll use 2 clusters to represent our users, since the number 2 has the "longest step".

![bsas-clusters-vs-theta][fig-0]

[fig-0]: figures/bsas-number-of-clusters-vs-theta-minimax.png "#clusters versus Theta"

* standard_scaled_samples  (X_stdscl): From the below screenshot it is visible that we'll use 4 clusters to represent our users, since the number 4 has the "longest step".

![bsas-clusters-vs-theta][fig-1]

[fig-1]: figures/bsas-number-of-clusters-vs-theta-stdscl.png "#clusters versus Theta"

  * #### Part 3: Fetching the Optimal Theta and Max. Cluster Number...

  * #### MiniMax Approach: opt(Θ) = 1.5903268063654714 | opt(q) = 2

In [8]:
theta_, q_ = clf.specs()
theta_, q_

(1.5903268063654714, 2)

* #### StandardScaler Approach: opt(Θ) = 4.6747517492864477 | opt(q) = 4

In [9]:
theta2_, q2_ = clf2.specs()
theta2_, q2_

(4.6747517492864477, 4)

# Running BSAS with the Optimal Settings...

  * #### MiniMax Approach: opt(Θ) = 1.5903268063654714 | opt(q) = 2

In [10]:
order_minimax = np.random.permutation(range(X_minimax.shape[0]))
# order_minimax = np.load('comp-data/2-bsas-comp-data/order-minimax.npy')
# The order that gave the max. number of clusters
clf.fit(X_minimax.T, order_minimax)

  * #### StandardScaler Approach: opt(Θ) = 1.5903268063654714 | opt(q) = 2

In [11]:
order = np.random.permutation(range(X_stdscl.shape[0]))
# order = np.load('comp-data/2-bsas-comp-data/order-stdscl.npy')
# The order that gave the max. number of clusters
clf2.fit(X_stdscl.T, order)

# Fetching the Clusters and their respective Representatives (average)...

  * #### MiniMax Approach Clusters & Centroids

In [12]:
clusters, centroids = clf.predict()
clusters, centroids

({0: array([[ 0.        ,  0.5862069 ,  0.24137931, ...,  0.65517241,
           0.10344828,  0.        ],
         [ 0.        ,  0.28571429,  0.        , ...,  0.85714286,
           0.        ,  0.        ],
         [ 0.        ,  0.04761905,  0.0952381 , ...,  0.33333333,
           0.19047619,  0.04761905],
         ..., 
         [ 0.        ,  0.35      ,  0.1       , ...,  0.5       ,
           0.15      ,  0.        ],
         [ 0.        ,  0.6       ,  0.26666667, ...,  0.46666667,
           0.2       ,  0.        ],
         [ 0.        ,  0.51162791,  0.26744186, ...,  0.39534884,
           0.1744186 ,  0.04651163]]),
  1: array([[ 0.        ,  0.66666667,  0.5       , ...,  0.16666667,
           0.33333333,  0.        ],
         [ 0.        ,  0.58974359,  0.42307692, ...,  0.33333333,
           0.17948718,  0.03846154],
         [ 0.        ,  0.875     ,  1.        , ...,  0.5       ,
           0.25      ,  0.        ],
         ..., 
         [ 0.        ,  0.

In [26]:
centroids_minimax = []
for key in centroids:
        centroids_minimax.append(centroids[key])

centroids_minimax = np.array(centroids_minimax)

clusters_minimax = []

for X in X_minimax:
    tmp = distance.cdist([X], centroids_minimax, 'euclidean')
    min_index, min_value = min(enumerate(tmp[0]), key=lambda p: p[1])    
    clusters_minimax.append(min_index)

tmp = pd.DataFrame(X_minimax)
tmp[19] = clusters_minimax

new_centroids_minimax = tmp.groupby([19]).mean()
new_centroids_minimax = new_centroids_minimax.values

print (clusters_minimax)
print (new_centroids_minimax)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 

  * #### StandardScaler Approach Clusters & Centroids

In [13]:
clusters_, centroids_ = clf2.predict()
clusters_, centroids_

({0: array([[-0.93450873,  1.02162395,  0.41973697, ...,  1.77398268,
          -0.63356524, -0.93450873],
         [-0.65582584,  0.38256507, -0.03279129, ...,  0.17488689,
          -0.03279129, -0.65582584],
         [-0.71855285,  0.79839206, -0.41516387, ...,  0.49500308,
           0.19161409, -0.71855285],
         ..., 
         [-1.01512172,  1.34659003, -0.22788447, ...,  1.74020866,
           0.16573416, -0.62150309],
         [-1.16063204,  1.10111245,  0.4225891 , ...,  0.874938  ,
           0.08332743, -1.04754482],
         [-0.79387749,  1.08348945, -0.017036  , ...,  1.01875266,
          -0.017036  , -0.7291407 ]]),
  1: array([[-0.60859266,  0.48228097, -0.39041793, -0.60859266, -0.60859266,
          -0.1722432 ,  0.48228097, -0.60859266,  0.7004557 , -0.60859266,
          -0.39041793,  3.31855241, -0.60859266, -0.60859266, -0.39041793,
           0.04593152,  1.79132933, -0.60859266, -0.60859266],
         [-0.69860172,  0.32243156, -0.49439506, -0.69860172, -0.

In [22]:
from scipy.spatial import distance
import pandas as pd

In [27]:
centroids_stdscl = []
for key in centroids_:
        centroids_stdscl.append(centroids_[key])

centroids_stdscl = np.array(centroids_stdscl)

clusters_stdscl = []

for X in X_stdscl:
    tmp = distance.cdist([X], centroids_stdscl, 'euclidean')
    min_index, min_value = min(enumerate(tmp[0]), key=lambda p: p[1])    
    clusters_stdscl.append(min_index)

tmp = pd.DataFrame(X_stdscl)
tmp[19] = clusters_stdscl

new_centroids_stdscl = tmp.groupby([19]).mean()
new_centroids_stdscl = new_centroids_stdscl.values
    
print (clusters_stdscl)
print (new_centroids_stdscl)

[3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 2, 2, 0, 2, 0, 3, 0, 3, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 2, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 1, 2, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 3, 0, 0, 2, 0, 2, 0, 0, 0, 0, 1, 0, 0, 1, 0, 3, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 

# Saving the Computed Results...
  * #### MiniMax Approach: 
    * The Order the Samples were Presented to the BSAS Algorithm (order-minimax.npy)
    * The Clusters and their Respective Centroids (clusters-minimax.npy | centroids-minimax.npy)

In [18]:
np.save('comp-data/2-bsas-comp-data/order-minimax.npy', order_minimax)

In [24]:
np.save('comp-data/2-bsas-comp-data/clusters-minimax.npy', clusters_minimax)
np.save('comp-data/2-bsas-comp-data/centroids-minimax.npy', centroids_minimax)

  * #### StandardScaler Approach: 
    * The Order the Samples were Presented to the BSAS Algorithm (order-stdscl.npy)
    * The Clusters and their Respective Centroids (clusters-stdscl.npy | centroids-stdscl.npy)

In [11]:
np.save('comp-data/2-bsas-comp-data/order-stdscl.npy', order)

In [25]:
np.save('comp-data/2-bsas-comp-data/clusters-stdscl.npy', clusters_stdscl)
np.save('comp-data/2-bsas-comp-data/centroids-stdscl.npy', new_centroids_stdscl)

  * Due to the benefits of the gaussian distribution we therefore use our dataset scaled according to the standard norm with mean=0, variance=1 

(To scale the samples to the gaussian norm subtract each sample with its mean and divide by its variance)

**So the dataset we'll use on k-means is the X_stdscl**

# ~ END OF CHAPTER 2 - SEQUENTIAL CLUSTERING ~