# Clarans Clustering of Mixed Data-Types

1. Load and transform dataset
 - fill missing values 
2. Calculate gower distance (dis-similarity between pairs of records)
3. Apply K-Medoids partitioning
4. Apply CLARANS partitioning

In [None]:
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

### 1. Case Study: Auto insurance claims [dataset](https://www.kaggle.com/xiaomengsun/car-insurance-claim-data)

In [None]:
# load data
DATA_PATH = os.path.join(os.getcwd(),'../data')
df = pd.read_csv(os.path.join(DATA_PATH,'car_insurance_claim.csv'),low_memory=False,)

# convert object to numerical
df[['INCOME','HOME_VAL','BLUEBOOK','OLDCLAIM', 'CLM_AMT',]] = df[['INCOME','HOME_VAL','BLUEBOOK','OLDCLAIM', 'CLM_AMT',]].replace('[^.0-9]', '', regex=True,).astype(float).fillna(0)

# clean textual classes
for col in df.columns:
    if df[col].dtype == 'O':
        df[col] = df[col].str.upper().replace('Z_','',regex=True).replace('[^A-Z]','',regex=True)
        
data_types = {f:t for f,t in zip(df.columns,df.dtypes)}

df[:2]

### 2. Feature Encoding & Engineering

***what features do we have?***
Having explored I found this [data dictionary](https://rpubs.com/data_feelings/msda_data621_hw4) and following key definitions:
- Bluebook = car re-sale value. 
- MVR_PTS = [MotorVehicleRecordPoints (MVR) ](https://www.wnins.com/losscontrolbulletins/MVREvaluation.pdf) details an individual’s past driving history indicating violations and accidents over a specified period
- TIF = Time In Force / customer lifetime
- YOJ = years in job
- CLM_FRQ = # of claims in past 5 years
- OLDCLAIM = sum $ of claims in past 5 years

In [None]:
# copy df
tdf = df.copy()

***fill missing & mean fill***

In [None]:
tdf['OCCUPATION'].fillna('OTHER',inplace=True)

for col in ['AGE','YOJ','CAR_AGE']:
    tdf[col].fillna(tdf[col].mean(),inplace=True)
    
print(tdf.isnull().sum()[tdf.isnull().sum()>0])

In [None]:
feat_id = ['ID']
feat_account = ['KIDSDRIV', 'BIRTH', 'AGE', 'HOMEKIDS', 'YOJ', 'INCOME',
                'PARENT1', 'HOME_VAL', 'MSTATUS', 'GENDER', 'EDUCATION', 'OCCUPATION','URBANICITY','TIF',]
feat_car = [ 'TRAVTIME', 'CAR_USE','MVR_PTS','BLUEBOOK','CAR_TYPE', 'RED_CAR','REVOKED','CAR_AGE',]
feat_claims = ['OLDCLAIM', 'CLM_FREQ', 'CLAIM_FLAG','CLM_AMT',]

data_meta = pd.DataFrame(tdf.nunique(),columns=['num'],index=None).sort_values('num').reset_index()
data_meta.columns = ['name','num']
data_meta[:2]

***transform categorical variables to label encoded***

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
for feat in data_meta.loc[data_meta['num']<=12,'name'].values:
    tdf[feat] = le.fit_transform(tdf[feat])

In [None]:
Xy = tdf[feat_account+feat_car+feat_claims].copy()

In [None]:
try: 
    gd = np.load(os.path.join(DATA_PATH,'car_insurance_claim_gower_distance.npy'))
    print('Gower distances loaded from file.')
except:
    print('Calculating Gower dsitances...5-8 minutes')
    %time gd = gower.gower_matrix(Xy[:])
    np.save(os.path.join(DATA_PATH,'car_insurance_claim_gower_distance.npy'))

### 6. CLARANS Clustering

CLARANS (Clustering Large Applications based upon RANdomized Search) presents a trade-off between the cost and the effectiveness of using samples to obtain clustering.
First, it randomly selects k objects in the data set as the current medoids. It then randomly selects a current medoid x and an object y that is not one of the current medoids.

In [None]:
n = 11000
sample = np.nan_to_num(tdf.values)
sample.shape

In [None]:
from pyclustering.cluster.clarans import clarans;
from pyclustering.utils import timedcall;

"""!
The pyclustering library clarans implementation requires
list of lists as its input dataset.
Thus we convert the data from numpy array to list.
"""
data = sample.tolist()

#get a glimpse of dataset
print("A peek into the dataset : ",data[:4])

In [None]:
"""!
@brief Constructor of clustering algorithm CLARANS.
@details The higher the value of maxneighbor, the closer is CLARANS to K-Medoids, and the longer is each search of a local minima.
@param[in] data: Input data that is presented as list of points (objects), each point should be represented by list or tuple.
@param[in] number_clusters: amount of clusters that should be allocated.
@param[in] numlocal: the number of local minima obtained (amount of iterations for solving the problem).
@param[in] maxneighbor: the maximum number of neighbors examined.     

The higher the value of maxneighbor, the closer is CLARANS to K-Medoids, and the longer is each search of a local minima.

"""
# choose k clusters
results = dict()
for k in [2,10]:
    clarans_instance = clarans(data, k, 4, 3);
    print(k)
    
    #calls the clarans method 'process' to implement the algortihm
    %time clarans_instance.process()

    #returns the clusters 
    clusters = clarans_instance.get_clusters();

    #returns the mediods 
    medoids = clarans_instance.get_medoids();
    
    result = {'clusters':cluseters,'medoids':medoids}
    
    results[k] = result

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

# list of scores
for k in results.keys():
    cluster_array = [e for e,k in enumerate(results[k]['clusters']) for i in k]
    score1 = silhouette_score(sample, cluster_array, metric='precomputed')
    score2 = silhouette_score(Xy, cluster_array,metric='correlation')
    print(f'{k} : {score1} : {score2}')

In [None]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, n_iter=500)
tsne = tsne_model.fit_transform(sample)

tsne_df = pd.DataFrame(tsne)

tsne_df['cluster'] = np.nan
for e,k in enumerate(clusters):
    print(e,len(k))
    tsne_df.iloc[k,-1] = e
    
groups = tsne_df.groupby('cluster')

fig, ax = plt.subplots(figsize=(15, 10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group[0], group[1], marker='o', linestyle='', label=name)
ax.legend()
plt.show()

## Inspect Cluster Values

In [None]:
Xy['cluster'] = tsne_df['cluster'].copy()

In [None]:
fig,axs = plt.subplots(15,2,figsize=(6,30),sharex=True)

for ax,col in zip(axs.flatten(),Xy.columns):
    Xy.boxplot(column=col,by='cluster',ax=ax)
    
plt.tight_layout()

### Silhouette Scores

In [None]:
# list of scores
cluster_array = [e for e,k in enumerate(clusters) for i in k]

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

silhouette_score(sample, cluster_array)

# *References*

- https://towardsdatascience.com/clustering-on-mixed-type-data-8bbd0a2569c3
- https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using-gower-distance-ab89b3aa90d9
- https://www.researchgate.net/post/What_is_the_best_way_for_cluster_analysis_when_you_have_mixed_type_of_data_categorical_and_scale
- https://www.google.com/search?client=firefox-b-d&q=python+gower+distance
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
- https://discuss.analyticsvidhya.com/t/clustering-technique-for-mixed-numeric-and-categorical-variables/6753
- https://stackoverflow.com/questions/24196897/r-distance-matrix-and-clustering-for-mixed-and-large-dataset
- https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
- https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
- https://rpubs.com/data_feelings/msda_data621_hw4
- https://pypi.org/project/gower/
- https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
- https://towardsdatascience.com/k-medoids-clustering-on-iris-data-set-1931bf781e05
- https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/pam
- https://github.com/annoviko/pyclustering/issues/499
- https://stats.stackexchange.com/questions/2717/clustering-with-a-distance-matrix
- https://www.kaggle.com/fabiendaniel/customer-segmentation
- https://dkopczyk.quantee.co.uk/claim-prediction/
- https://www.casact.org/pubs/dpp/dpp08/08dpp170.pdf
- https://medium.com/analytics-vidhya/partitional-clustering-using-clarans-method-with-python-example-545dd84e58b4
