# Factors discretization using KMeans method (Example of using kmeans_rages class)

This notebook has been created for developing and testing methods of discretization for numeric variables

In [1]:
from copy import deepcopy

import numpy as np
import pandas as pd

from rangers import kmeans_ranges
from sklearn.datasets import make_blobs

import plotly.express as px
import plotly.graph_objects as go

ModuleNotFoundError: No module named 'reangers'

# Lests generate some random ranges which will be used as test data

Data building

In [None]:
# 3 differend data blobs
X,y = make_blobs(
    n_samples = 150, n_features = 1,
    centers = 3, cluster_std = 1.6,
    random_state=1
)

example_data = pd.DataFrame(
    {'var': X.ravel(), 'blob':y})

del X,y

Data visualisation

In [None]:
basic_plot = px.box(example_data, x = 'var', color = 'blob', points = 'all')
basic_plot.show()

# Examples

Creating, fitting and dealing with SSE for different numbers of clusters.

In [None]:
my_ranges = kmeans_ranges(
    max_clusters = 5, 
    kmeans_kwarg={'max_iter':50}
)
my_ranges.build_kmeans_instances(
    example_data['var'].to_numpy()[:, np.newaxis])

# my_ranges.fit(example_data['var'].to_numpy()[:, np.newaxis])

px.line(
    x = range(1, len(my_ranges.SSE_arr) + 1), 
    y = my_ranges.SSE_arr)

After building instances you can auromaricly select the best clusters number according to the Elbow method. This method returns an index of the best sklearn.cluster.kmeans instance in kmeans_insts list.

In [None]:
my_ranges.elbow_choose()

For more readability of the result data range, we converts numbers which marks clusters in sklearn.cluster.kmeans, in to strings wich looks like "\[4;5)".

We make bins with a method init_bens. Bean is a point wich is closing each next cluster in case incleasing moving. Let's get and plot bean for each cluster.

In [None]:
bins = my_ranges.init_bins(
    example_data['var'].to_numpy()[:, np.newaxis])

box_with_bins = deepcopy(basic_plot)
box_with_bins.layout.yaxis2 = \
    go.layout.YAxis(overlaying='y', range=[0, 2], showticklabels=False)


for my_bin in bins:
    box_with_bins.add_scatter(
    x = [my_bin, my_bin], y = [0, 2], mode='lines', yaxis='y2',
    showlegend=False, line=dict(
        dash='dash', color = "firebrick", width = 2
    ))
    
    
box_with_bins.show()

After calling all methods described above (they are joined in fit method), we can use predict, wich converts input data frame to strings with the corresponding ranges.

Let's see how it works and adds it to the visualisation.

In [None]:
ranged_series = my_ranges.transform(
    example_data['var'].to_numpy()[:, np.newaxis])

ranged_series[:5]

In [None]:
box_with_groups = deepcopy(box_with_bins)

example_data["data_groups"] = ranged_series
example_data


for data_group in example_data['data_groups'].unique():
    
    draw_data = example_data.query("data_groups == '" + data_group + "'")
        
    box_with_groups.add_trace(
        go.Scatter(
            x = draw_data['var'], 
            y = np.ones(draw_data.shape[0])*1.8,
            mode = 'markers',
            name = data_group,
            yaxis = 'y2'
        )
    )
    
box_with_groups.show()

And finally check full complex of methods in one line of the code.

In [None]:
my_ranges.fit_transform(example_data['var'].to_numpy()[:, np.newaxis])