# Analyzes cluster similarity within Source-Definition-Item hierarchy

1. Setup analysis environment and load dataframes
1. Create working df as 3-level grouping using only Target=1
1. Compute counts/mean/var aggregations for top levels
1. Analyze those cluster similarity aggregations

# Setup Environment

## Set Notebook Parameters

In [1]:
# use gDrive if you previously saved train_data, etc.
# otherwise, use pre-generated data from repos (Default)
USE_GDRIVE = False

# save analysis plots if customized
SAVE_PLOT = False

## Import various packages


In [2]:
import pandas as pd
import numpy as np

import os.path
from os import path
from time import strftime, localtime
from google.colab import drive

## Clone CVA-SBERT GitHub or mount gDrive

In [3]:
if USE_GDRIVE:
    drive.mount('/content/drive')               # mount YOUR gDrive

    # Path to data -- change for YOUR specific Analysis folder
    path = '/content/drive/MyDrive/CVA-SBERT/Analysis-20221203-190207' ### CHANGE!!!

else:
    !git clone https://github.com/Hackathorn/CVA-SBERT  # clone repos

    # Path to data in repository
    path = '/content/CVA-SBERT/data/SetUp_Data'

path

Cloning into 'CVA-SBERT'...
remote: Enumerating objects: 375, done.[K
remote: Counting objects: 100% (213/213), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 375 (delta 139), reused 184 (delta 119), pack-reused 162[K
Receiving objects: 100% (375/375), 115.89 MiB | 11.54 MiB/s, done.
Resolving deltas: 100% (240/240), done.
Checking out files: 100% (22/22), done.


'/content/CVA-SBERT/data/SetUp_Data'

Load dataframes and create working df as simply ```df```

In [13]:
# load previous dataframes from SetUp notebook
CVA_df = pd.read_pickle(path + '/CVA_df.pkl')
token_df = pd.read_pickle(path + '/token_df.pkl')

# use only 'good' data
df = CVA_df[CVA_df.Target == 1]
# remove unneeded columns
df = df.drop(columns = ['Target', 'Definition', 'Item', 'is_train'])
# rename columns to short consist names
df.rename(columns={"Source": "S", "Def_token": "D", "Item_token": 'I'}, inplace=True)
df.rename(columns={"Cos_Sim": "Csim", "Euc_Sim": "Esim"}, inplace=True)

df

Unnamed: 0,Index,S,D,I,Csim,Esim
0,0,2978,7060,2240,0.185577,1.276263
6,6,3169,5361,5119,0.414065,1.082529
8,8,2367,9846,4760,0.253170,1.222154
13,13,12426,9358,7035,0.488197,1.011734
18,18,13903,7165,4199,0.240013,1.232872
...,...,...,...,...,...,...
28069,28069,1915,2576,2574,0.624945,0.866089
28071,28071,12822,2404,11294,0.169479,1.288814
28072,28072,3350,6839,8420,0.583409,0.912789
28074,28074,2361,6453,10551,0.383094,1.110771


RESULTS...
- Note half of the rows (approx) for eliminating Target=0 Def-Items
- Nice and compact

# Group as 3-level S-D-I hierarchy

## Next piece of this step

In [None]:
# some code

RESULTS...
- Some insights........

# Save analysis results to your gDrive - OPTIONAL

Mount gDrive and create timestamped Experiment Folder

In [None]:
drive.mount('/content/drive')   # ignore warning if already mounted

BASE_PATH = '/content/drive/MyDrive/CVA-SBERT/'
EXP_PATH = BASE_PATH + 'Analysis-' + strftime("%Y%m%d-%H%M%S", localtime())

if path.exists(BASE_PATH) == False:
    os.mkdir(BASE_PATH)
if path.exists(EXP_PATH) == False:
    os.mkdir(EXP_PATH)

Save dataframes or other results to Experiment Folder

In [None]:
# save initial two dataframes
CVA_df.to_pickle(EXP_PATH + '/CVA_df.pkl')
token_df.to_pickle(EXP_PATH + '/token_df.pkl')

# ...or other saving of other results, like plots
#