# Setup the environment

- The CSV file containing the CVS data is loaded into a dataframe. 
- TODO figure out how to insert image from GitHub

## Set notebook hyper-parameters

- **SBERT_MODEL**: The pre-trained LLM fine-tuned from BERT. See the [many models available](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads). The model [`paraphrase-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) is an early pretrained Sentence-Similiarity (S-S) model used in many examples. However, a more recent model [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) has most Likes and most Downloads (2M/month!) of almost a thousand S-S models.

- **USE_GDRIVE**: Default is FALSE implying you have not customized the following SetUp code. 

- **CLEAN_TEXT***: Default is TRUE implying that text string for Definitions and Items is changed by removing punctuation chars, eliminating whitespace, etc.

In [30]:
# set model name to create SBERT model instance
SBERT_MODEL = 'all-MiniLM-L6-v2'

# do NOT save to your gDrive
USE_GDRIVE = True

# clean token text of punctuation etc
CLEAN_TEXT = True

## Import SentenceTransformers

In [31]:
!pip install -q sentence_transformers

## Import various packages

- ```sentence_transformers``` class plus others


In [32]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

import pandas as pd
import numpy as np
from pprint import pprint

## Instantiate SentenceTransformer

- Creates an instance of HuggingFace pipeline for the `SentenceTransformer` class, based upon the parameters (which are MANY). 
- See [SentenceTransformer](https://www.sbert.net/) documentation, particularly for [its parameters](https://www.sbert.net/docs/package_reference/SentenceTransformer.html) and for [Sentence Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html).
- TODO - explore these parameters. 

In [33]:
model = SentenceTransformer(SBERT_MODEL)

# Create CVA dataframe

## Clone CVA-SBERT GitHub 

In [34]:
!git clone https://github.com/Hackathorn/CVA-SBERT 

fatal: destination path 'CVA-SBERT' already exists and is not an empty directory.


## Load dataframe from comma-delimited file

- Use 
Add ```DataId``` column to maintain lineage to original dataset
- Rename Item column for naming consistency

In [35]:
# using clone repo data
CVA_FileName = '/content/CVA-SBERT/data/CVA_Training_Data_Allv4_Richard.csv'
CVA_df = pd.read_csv(CVA_FileName)

###### TODO - explore direct URL to CVA data, thus avoiding repo cloning
# data = pd.read_csv("https://github.com/Hackathorn/CVA-SBERT/blob/cc0df10ca1bb2a18f723cfc1f1e62ed79d368eee/data/CVA_Training_Data_Allv4_Richard.csv")
###### gets following error...
###### ParserError: Error tokenizing data. C error: Expected 1 fields in line 28, saw 367

# print original df structure
print("------- Orginal Structure ---------")
print(CVA_df.info(verbose=True))

# maintain lineage index to original lines in comma-delimited dataset 
CVA_df.insert(loc=0, column='Index', value=CVA_df.index)
# OPTIONAL renaming for name consistency/simplication
CVA_df.rename(columns = {"Item_Text":"Item"}, inplace = True)
CVA_df.rename(columns = {"SourceId":"Source"}, inplace = True)

# print final df structure
print("------- Final Structure ---------")
print(CVA_df.info(verbose=True))

------- Orginal Structure ---------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   SourceId    28076 non-null  int64 
 1   Target      28076 non-null  int64 
 2   Definition  28076 non-null  object
 3   Item_Text   28076 non-null  object
dtypes: int64(2), object(2)
memory usage: 877.5+ KB
None
------- Final Structure ---------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Index       28076 non-null  int64 
 1   Source      28076 non-null  int64 
 2   Target      28076 non-null  int64 
 3   Definition  28076 non-null  object
 4   Item        28076 non-null  object
dtypes: int64(3), object(2)
memory usage: 1.1+ MB
None


## Print various counts/ratios 

- about Source-Definition-Item columns

In [36]:
ni = len(CVA_df)
ns = CVA_df.Source.nunique()
nd = CVA_df.Definition.nunique()
print("-------- Unique Counts and Ratios --------")
print(f"Item count = {ni:,d}")
print(f"Source count = {ns:,d} \n" +
      f"    with Items-per-Source = {(len(CVA_df)/ns):.2f} \n" +
      f"    with Definitions-per-Source = {(nd/ns):.2f}")
print(f"Definition count = {nd:,d} \n" +
      f"    with Items-per-Definition = {(len(CVA_df)/nd):.2f}")
print(f"Target mean = {CVA_df.Target.mean():.4f} \n" +
      f"    with count of ones = {CVA_df.Target.sum():,d}")

-------- Unique Counts and Ratios --------
Item count = 28,076
Source count = 833 
    with Items-per-Source = 33.70 
    with Definitions-per-Source = 3.47
Definition count = 2,887 
    with Items-per-Definition = 9.72
Target mean = 0.4999 
    with count of ones = 14,036


## Explore the CVS dataset

In [37]:
CVA_df        # NOTE: limited to 20K rows

Unnamed: 0,Index,Source,Target,Definition,Item
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.
2,2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...
...,...,...,...,...,...
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...
28074,28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...


## Clean text string list of Def+Items

- OPTIONALLY depending on hyperparm CLEAN_TEXT
- TODO: insert space for each punc? delete extra spaces? 

In [38]:
def clean_text(text):

    if CLEAN_TEXT:
        import string
        # remove all string punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # remove all leading digits
        text = text.lstrip(string.digits)
        # remove all leading spaces
        text = text.lstrip()

    return(text)

def clean_text_list(text_list):

    # clean each text in text_list
    text_list = [clean_text(s) for s in text_list]
        
    return(text_list)

## Tokenize Definitions & Items text

In [39]:
# collect all text strings from Def & Item
text_list = list(CVA_df.Definition) + list(CVA_df.Item)

# if CLEAN_TEXT parm, remove punc & leading digits/whitespace
text_list = clean_text_list(text_list)

# find unique text strings & sort
token_list = sorted(set(text_list))

len(text_list), len(token_list), token_list[:5] # testing

(56152,
 11507,
 ['A CEOs perceived ability to provide valuable help on strategic matters based on focal CEO responses in models of identification and on responses of potential help recipients in models of the amount of strategic help provided',
  'A behavior syndrome in which an individual adopts an active orientation that goes beyond formal work requirements',
  'A behavioral observation scale for appraising the employees performance',
  'A belief in ones capability to accomplish successfully certain trained tasks',
  'A belief that ability is fixed and unchangeable'])

Note that, instead of 56K text string to encode, we need only to encode 12K - That is more than 4x reduction.  
Since we are doing pairwise comparisons, it becomes 16x for similarity matrices

# Create Token dataframe 

In [40]:
token_df = pd.DataFrame(token_list, columns=['token_text'], dtype='str')
print(token_df.info(), token_df.dtypes)

# token_df.token_text = token_df.token_text.astype('str')
# print(token_df.info(), token_df.dtypes)

def token2text(token):
    # return(token_df.iloc[token, 'token_text'])
    return(token_df.iloc[token, 0])

def text2token(text):
    text = clean_text(text)
    tokens = token_df.index[token_df['token_text'] == text].tolist()
    assert len(tokens) == 1, "text2token ERROR: Single unique text was not found"
    return(tokens[0])

# testing...
text = token2text(1)
token = text2token(text)
print(token, text)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11507 entries, 0 to 11506
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   token_text  11507 non-null  object
dtypes: object(1)
memory usage: 90.0+ KB
None token_text    object
dtype: object
1 A behavior syndrome in which an individual adopts an active orientation that goes beyond formal work requirements


In [41]:
token_df

Unnamed: 0,token_text
0,A CEOs perceived ability to provide valuable h...
1,A behavior syndrome in which an individual ado...
2,A behavioral observation scale for appraising ...
3,A belief in ones capability to accomplish succ...
4,A belief that ability is fixed and unchangeable
...,...
11502,the fit between the requirements of ones role ...
11503,this last year I have had opportunities at wor...
11504,very strongly agree
11505,voluntary actual withdrawal from the organization


## Add tokens for Defintions and Items

In [42]:
# add token columns 
CVA_df['Def_token'] = [text2token(clean_text(text)) for text in CVA_df.Definition]
CVA_df['Item_token'] = [text2token(clean_text(text)) for text in CVA_df.Item]

# OPTIONALLY drop text columns for Def+Item ...but lose **token_list** 
# CVA_df.drop(columns=['Definition', 'Item'])

CVA_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Index       28076 non-null  int64 
 1   Source      28076 non-null  int64 
 2   Target      28076 non-null  int64 
 3   Definition  28076 non-null  object
 4   Item        28076 non-null  object
 5   Def_token   28076 non-null  int64 
 6   Item_token  28076 non-null  int64 
dtypes: int64(5), object(2)
memory usage: 1.5+ MB


In [43]:
CVA_df

Unnamed: 0,Index,Source,Target,Definition,Item,Def_token,Item_token
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...,7060,2240
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.,1900,3550
2,2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.,9762,10362
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...,1743,2342
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...,2398,5841
...,...,...,...,...,...,...,...
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.,2404,11294
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive,6839,8420
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...,8996,3955
28074,28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...,6453,10551


Testing whether tokens are working...
- If token = ```10551```, text = ```To what extent have other co-workers influence...```

In [44]:
token2text(10551)

'To what extent have other coworkers influenced what you see as most important to learn'

## Encode token_text into embeddings

NOTE: 
- Each encoding is 384-dim vector into the BERT latent/embedding space
- In CVA_df, there are 28,076 Def/Item pairs, requiring 2x28076 encodings
- There are 11,608 unique Def/Item text strings in token_list
- The result is an array of shape 11608 x 384 indexed by token value  
...instead of 28076x2x384 array
- Performing these 12K encodings in bulk takes only a few minutes - amazing!

In [45]:
token_list = token_df.token_text.tolist()
embeddings = model.encode(token_list, show_progress_bar=True)
embeddings.shape

Batches:   0%|          | 0/360 [00:00<?, ?it/s]

(11507, 384)

Save embeddings as ```encoding``` as new column in token_df

In [46]:
token_df['encoding'] = list(embeddings)

Examine the revised token dataframe

In [47]:
token_df.info(), token_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11507 entries, 0 to 11506
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   token_text  11507 non-null  object
 1   encoding    11507 non-null  object
dtypes: object(2)
memory usage: 179.9+ KB


(None,                                               token_text  \
 0      A CEOs perceived ability to provide valuable h...   
 1      A behavior syndrome in which an individual ado...   
 2      A behavioral observation scale for appraising ...   
 3      A belief in ones capability to accomplish succ...   
 4        A belief that ability is fixed and unchangeable   
 ...                                                  ...   
 11502  the fit between the requirements of ones role ...   
 11503  this last year I have had opportunities at wor...   
 11504                                very strongly agree   
 11505  voluntary actual withdrawal from the organization   
 11506            well below average 5 well above average   
 
                                                 encoding  
 0      [0.03299552, 0.0003685815, 0.042408034, -0.049...  
 1      [0.0138665335, 0.04801447, -0.0328765, 0.03009...  
 2      [0.005355214, 0.052194264, -0.06584072, 0.0142...  
 3      [-0.01332193

##Compute Simularity metrics 

NOTE: When using encodings from token_df...
- taking values of encoding column is 1D array of 1D arrays
- this is confusing, but easy if you simply use double-indexing [i][j]
```
>>> enc = token_df.encoding.values
>>> type(enc), enc.shape, type(enc[0]), enc[0].shape, enc[0][0]
(numpy.ndarray, (11507,), numpy.ndarray, (384,), 0.03299552)
```


In [48]:
# grab embeddings from token_df as list of arrays
embed = token_df.encoding.values

# allocate the metric arrays
cos_sim = np.empty(len(CVA_df),)
euc_sim = np.empty(len(CVA_df),)

# loop thru all Defintion/Item pairs in CVA_df
for i, pair in enumerate(CVA_df[['Def_token', 'Item_token']].values):

    # find embeddings for pair & reshape to...
    #   (no of samples = 1, 384-dim embed vector)
    e1 = embed[pair[0]].reshape(1, -1)
    e2 = embed[pair[1]].reshape(1, -1)

    # calculate metrics
    cos_sim[i] = cosine_similarity(e1, e2)
    euc_sim[i] = euclidean_distances(e1, e2)

Insert new columns for metrics
- TODO: save space with int32 & float32
- REMEMBER: Def+Item string in CVA_df != token_text string in token_df

In [49]:
CVA_df['Cos_Sim'] = cos_sim.tolist()    # append COS similarity as new column
CVA_df['Euc_Sim'] = euc_sim.tolist()    # append EUC similarity as new column

embeddings.shape, cos_sim.shape, CVA_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Index       28076 non-null  int64  
 1   Source      28076 non-null  int64  
 2   Target      28076 non-null  int64  
 3   Definition  28076 non-null  object 
 4   Item        28076 non-null  object 
 5   Def_token   28076 non-null  int64  
 6   Item_token  28076 non-null  int64  
 7   Cos_Sim     28076 non-null  float64
 8   Euc_Sim     28076 non-null  float64
dtypes: float64(2), int64(5), object(2)
memory usage: 1.9+ MB


((11507, 384), (28076,), None)

# Split into 80/20 Train/Validate

NOTE: This Train/Validate split may NOT be needed b/c Target labels are ignore  
within this pre-trained self-supervised BERT model. Needs discussion!

- split based on Source groups of Def-Items
- create list of unique Sources
- random shuffle of Sources
- find split cut-point at ```split_ratio = 0.8```
- add new column ```is_train``` as True/False or 1/0

In [59]:
# set split ratio
split_ratio = 0.8

# find unique SourceId values
unique_Source = CVA_df.Source.unique()

# find split cut-point
split_Source = int(split_ratio * len(unique_Source))+1

# randomly shuffe & convert array -> list
np.random.shuffle(unique_Source)
SourceId_list = list(unique_Source)

# create list of Sources for train set
train_SourceIds = SourceId_list[:split_Source]

# create boolean list for is_train
is_train_list = [True if Source in train_SourceIds else False 
                 for Source in CVA_df.Source.values]

# insert new column 
CVA_df['is_train'] = is_train_list

CVA_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Index       28076 non-null  int64  
 1   Source      28076 non-null  int64  
 2   Target      28076 non-null  int64  
 3   Definition  28076 non-null  object 
 4   Item        28076 non-null  object 
 5   Def_token   28076 non-null  int64  
 6   Item_token  28076 non-null  int64  
 7   Cos_Sim     28076 non-null  float64
 8   Euc_Sim     28076 non-null  float64
 9   is_train    28076 non-null  bool   
dtypes: bool(1), float64(2), int64(5), object(2)
memory usage: 2.0+ MB


## Examine Train/Valid split

- If sigificantly different from 80/20 ratio, can re-shuffe above
- TODO check for valid results amid the shuffle
- TODO set random seed for reproductable results

In [69]:
ld = len(CVA_df)
ls = CVA_df.Source.nunique()
lst = len(train_SourceIds)
lsv = ls - lst
lt = sum(CVA_df.is_train)
lv = ld - lt

print(f">>> Count of all data rows = {ld:,d}")
print(f">>> Count of unique Sources = {ls} split {lst/(ls):.0%}/{lsv/(ls):.0%} " + 
      "into Train/Valid of {lst} {lsv}")
print(f">>> Count of Train/Valid rows = {lt:,d} ({lt/(lt+lv):.1%}) and " + 
      "{lv} ({lv/(lt+lv):.1%}) with total = {lt+lv:,d}")

>>> Count of all data rows = 28,076
>>> Count of unique Sources = 833 split 80%/20% into Train/Valid of {lst} {lsv}
>>> Count of Train/Valid rows = 22,783 (81.1%) and {lv} ({lv/(lt+lv):.1%}) with total = {lt+lv:,d}


# Save results to gDrive

Mount gDrive and create folders

In [71]:
##### Only execute to save results
import os.path
from os import path
from time import strftime, localtime
from google.colab import drive

if USE_GDRIVE: 
    drive.mount('/content/drive')   # ignore warning if already mounted

    BASE_PATH = '/content/drive/MyDrive/CVA-SBERT/'
    EXP_PATH = BASE_PATH + 'Analysis-' + strftime("%Y%m%d-%H%M%S", localtime())

    if path.exists(BASE_PATH) == False:
        os.mkdir(BASE_PATH)
    if path.exists(EXP_PATH) == False:
        os.mkdir(EXP_PATH)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



Save Train/Valid datasets 

In [74]:
# Save dataframes to gDrive ...IF exists EXP_PATH with USE_GDRIVE=TRUE

if USE_GDRIVE and 'EXP_PATH' in globals():

    CVA_df.to_pickle(EXP_PATH + '/CVA_df.pkl')
    token_df.to_pickle(EXP_PATH + '/token_df.pkl')

In [None]:
# Reload data from gDrive ...IF exists EXP_PATH with USE_GDRIVE=TRUE

if USE_GDRIVE and 'EXP_PATH' in globals():

    CVA_df = pd.read_pickle(EXP_PATH + '/CVA_df.pkl')
    token_df = pd.read_pickle(EXP_PATH + '/token_df.pkl')