# Setup the environment

- The CSV file containing the CVS data is loaded into a dataframe. 
- TODO figure out how to insert image from GitHub

## Set notebook hyper-parameters

- **SBERT_MODEL**: The pre-trained LLM fine-tuned from BERT. See the [many models available](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads). The model [`paraphrase-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) is an early pretrained Sentence-Similiarity (S-S) model used in many examples. However, a more recent model [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) has most Likes and most Downloads (2M/month!) of almost a thousand S-S models.

- **USE_GDRIVE**: Default is FALSE implying you have not customized the following SetUp code. 

- **CLEAN_TEXT***: Default is TRUE implying that text string for Definitions and Items is changed by removing punctuation chars, eliminating whitespace, etc.

In [1]:
# set model name to create SBERT model instance
SBERT_MODEL = 'all-MiniLM-L6-v2'

# do NOT save to your gDrive
USE_GDRIVE = True

# clean token text of punctuation etc
CLEAN_TEXT = True

## Import SentenceTransformers

In [2]:
!pip install -q sentence_transformers

[K     |████████████████████████████████| 85 kB 2.1 MB/s 
[K     |████████████████████████████████| 5.8 MB 10.1 MB/s 
[K     |████████████████████████████████| 1.3 MB 38.0 MB/s 
[K     |████████████████████████████████| 182 kB 64.9 MB/s 
[K     |████████████████████████████████| 7.6 MB 47.7 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Import various packages

- ```sentence_transformers``` class plus others


In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

import pandas as pd
import numpy as np
from pprint import pprint

## Instantiate SentenceTransformer

- Creates an instance of HuggingFace pipeline for the `SentenceTransformer` class, based upon the parameters (which are MANY). 
- See [SentenceTransformer](https://www.sbert.net/) documentation, particularly for [its parameters](https://www.sbert.net/docs/package_reference/SentenceTransformer.html) and for [Sentence Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html).
- TODO - explore these parameters. 

In [4]:
model = SentenceTransformer(SBERT_MODEL)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Create CVA dataframe

## Clone CVA-SBERT GitHub 

In [5]:
!git clone https://github.com/Hackathorn/CVA-SBERT 

Cloning into 'CVA-SBERT'...
remote: Enumerating objects: 324, done.[K
remote: Counting objects: 100% (162/162), done.[K
remote: Compressing objects: 100% (133/133), done.[K
remote: Total 324 (delta 104), reused 47 (delta 29), pack-reused 162[K
Receiving objects: 100% (324/324), 88.67 MiB | 23.98 MiB/s, done.
Resolving deltas: 100% (205/205), done.


## Load dataframe from comma-delimited file

- Use 
Add ```DataId``` column to maintain lineage to original dataset
- Rename Item column for naming consistency

In [6]:
# using clone repo data
CVA_FileName = '/content/CVA-SBERT/data/CVA_Training_Data_Allv4_Richard.csv'
CVA_df = pd.read_csv(CVA_FileName)

###### TODO - explore direct URL to CVA data, thus avoiding repo cloning
# data = pd.read_csv("https://github.com/Hackathorn/CVA-SBERT/blob/cc0df10ca1bb2a18f723cfc1f1e62ed79d368eee/data/CVA_Training_Data_Allv4_Richard.csv")
###### gets following error...
###### ParserError: Error tokenizing data. C error: Expected 1 fields in line 28, saw 367

# print original df structure
print("------- Orginal Structure ---------")
print(CVA_df.info(verbose=True))

# maintain lineage index to original lines in comma-delimited dataset 
CVA_df.insert(loc=0, column='Index', value=CVA_df.index)
# OPTIONAL renaming for name consistency/simplication
CVA_df.rename(columns = {"Item_Text":"Item"}, inplace = True)
CVA_df.rename(columns = {"SourceId":"Source"}, inplace = True)

# print final df structure
print("------- Final Structure ---------")
print(CVA_df.info(verbose=True))

------- Orginal Structure ---------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   SourceId    28076 non-null  int64 
 1   Target      28076 non-null  int64 
 2   Definition  28076 non-null  object
 3   Item_Text   28076 non-null  object
dtypes: int64(2), object(2)
memory usage: 877.5+ KB
None
------- Final Structure ---------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Index       28076 non-null  int64 
 1   Source      28076 non-null  int64 
 2   Target      28076 non-null  int64 
 3   Definition  28076 non-null  object
 4   Item        28076 non-null  object
dtypes: int64(3), object(2)
memory usage: 1.1+ MB
None


## Print various counts/ratios 

- about Source-Definition-Item columns

In [7]:
ni = len(CVA_df)
ns = CVA_df.Source.nunique()
nd = CVA_df.Definition.nunique()
print("-------- Unique Counts and Ratios --------")
print(f"Item count = {ni:,d}")
print(f"Source count = {ns:,d} \n" +
      f"    with Items-per-Source = {(len(CVA_df)/ns):.2f} \n" +
      f"    with Definitions-per-Source = {(nd/ns):.2f}")
print(f"Definition count = {nd:,d} \n" +
      f"    with Items-per-Definition = {(len(CVA_df)/nd):.2f}")
print(f"Target mean = {CVA_df.Target.mean():.4f} \n" +
      f"    with count of ones = {CVA_df.Target.sum():,d}")

-------- Unique Counts and Ratios --------
Item count = 28,076
Source count = 833 
    with Items-per-Source = 33.70 
    with Definitions-per-Source = 3.47
Definition count = 2,887 
    with Items-per-Definition = 9.72
Target mean = 0.4999 
    with count of ones = 14,036


## Explore the CVS dataset

In [8]:
CVA_df        # NOTE: limited to 20K rows

Unnamed: 0,Index,Source,Target,Definition,Item
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.
2,2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...
...,...,...,...,...,...
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...
28074,28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...


## Clean text string list of Def+Items

- OPTIONALLY depending on hyperparm CLEAN_TEXT
- TODO: insert space for each punc? delete extra spaces? 

In [8]:
def clean_text(text):

    if CLEAN_TEXT:
        import string
        # remove all string punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # remove all leading digits
        text = text.lstrip(string.digits)
        # remove all leading spaces
        text = text.lstrip()

    return(text)

def clean_text_list(text_list):

    # clean each text in text_list
    text_list = [clean_text(s) for s in text_list]
        
    return(text_list)

## Tokenize Definitions & Items text

In [9]:
# collect all text strings from Def & Item
text_list = list(CVA_df.Definition) + list(CVA_df.Item)

# if CLEAN_TEXT parm, remove punc & leading digits/whitespace
text_list = clean_text_list(text_list)

# find unique text strings & sort
token_list = sorted(set(text_list))

len(text_list), len(token_list), token_list[:5] # testing

(56152,
 11507,
 ['A CEOs perceived ability to provide valuable help on strategic matters based on focal CEO responses in models of identification and on responses of potential help recipients in models of the amount of strategic help provided',
  'A behavior syndrome in which an individual adopts an active orientation that goes beyond formal work requirements',
  'A behavioral observation scale for appraising the employees performance',
  'A belief in ones capability to accomplish successfully certain trained tasks',
  'A belief that ability is fixed and unchangeable'])

Note that, instead of 56K text string to encode, we need only to encode 12K - That is more than 4x reduction.  
Since we are doing pairwise comparisons, it becomes 16x for similarity matrices

# Create Token dataframe 

In [10]:
token_df = pd.DataFrame(token_list, columns=['token_text'], dtype='str')
print(token_df.info(), token_df.dtypes)

# token_df.token_text = token_df.token_text.astype('str')
# print(token_df.info(), token_df.dtypes)

def token2text(token):
    # return(token_df.iloc[token, 'token_text'])
    return(token_df.iloc[token, 0])

def text2token(text):
    text = clean_text(text)
    tokens = token_df.index[token_df['token_text'] == text].tolist()
    assert len(tokens) == 1, "text2token ERROR: Single unique text was not found"
    return(tokens[0])

# testing...
text = token2text(1)
token = text2token(text)
print(token, text)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11507 entries, 0 to 11506
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   token_text  11507 non-null  object
dtypes: object(1)
memory usage: 90.0+ KB
None token_text    object
dtype: object
1 A behavior syndrome in which an individual adopts an active orientation that goes beyond formal work requirements


In [102]:
token_df

Unnamed: 0,token_text
0,A CEOs perceived ability to provide valuable h...
1,A behavior syndrome in which an individual ado...
2,A behavioral observation scale for appraising ...
3,A belief in ones capability to accomplish succ...
4,A belief that ability is fixed and unchangeable
...,...
11502,the fit between the requirements of ones role ...
11503,this last year I have had opportunities at wor...
11504,very strongly agree
11505,voluntary actual withdrawal from the organization


## Add tokens for Defintions and Items

In [11]:
# add token columns 
CVA_df['Def_token'] = [text2token(clean_text(text)) for text in CVA_df.Definition]
CVA_df['Item_token'] = [text2token(clean_text(text)) for text in CVA_df.Item]

# OPTIONALLY drop text columns for Def+Item ...but lose **token_list** 
# CVA_df.drop(columns=['Definition', 'Item'])

CVA_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Index       28076 non-null  int64 
 1   Source      28076 non-null  int64 
 2   Target      28076 non-null  int64 
 3   Definition  28076 non-null  object
 4   Item        28076 non-null  object
 5   Def_token   28076 non-null  int64 
 6   Item_token  28076 non-null  int64 
dtypes: int64(5), object(2)
memory usage: 1.5+ MB


In [84]:
CVA_df

Unnamed: 0,Index,Source,Target,Definition,Item,Def_token,Item_token
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...,7060,2240
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.,1900,3550
2,2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.,9762,10362
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...,1743,2342
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...,2398,5841
...,...,...,...,...,...,...,...
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.,2404,11294
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive,6839,8420
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...,8996,3955
28074,28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...,6453,10551


In [12]:
# testing
token2text(10551)

'To what extent have other coworkers influenced what you see as most important to learn'

# Encode token_text into embeddings

- Each encoding is 384-dim vector into the BERT latent/embedding space
- In CVA_df, there are 28,076 Def/Item pairs, requiring 2x28076 encodings
- There are 11,608 unique Def/Item text strings in token_list
- The result is an array of shape 11608 x 384 indexed by token value  
...instead of 28076x2x384 array
- Performing these 12K encodings in bulk takes only a few minutes - amazing!

In [13]:
token_list = token_df.token_text.tolist()
embeddings = model.encode(token_list, show_progress_bar=True)
embeddings.shape

Batches:   0%|          | 0/360 [00:00<?, ?it/s]

(11507, 384)

## Save embeddings into token_df

In [29]:
token_df['encoding'] = list(embeddings)
token_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11507 entries, 0 to 11506
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   token_text  11507 non-null  object
 1   encoding    11507 non-null  object
dtypes: object(2)
memory usage: 179.9+ KB


##Compute Simularity metrics 

- Using CoSin and Euc 
- For all Def/Item pairs

In [40]:
enc = token_df.encoding.values
type(enc), enc.shape, type(enc[0]), enc[0].shape, enc[0][0]

(numpy.ndarray, (11507,), numpy.ndarray, (384,), 0.03299551)

In [None]:
limit = 5
size = len(CVA_df) if limit == 0 else limit

cos_sim = np.empty(size,)
euc_sim = np.empty(size,)

for i, pair in enumerate(CVA_df[['Def_token', 'Item_token']].values):
    if (limit != 0) and (i >= limit): 
        break

    t1 = int(pair[0])
    t2 = int(pair[1])
    print('t1 & t2 = ', t1, t1, pair)
    e1 = embeddings[t1, :]
    e2 = embeddings[t2, :]
    print('e1 = ', e1)

    cos_sim[i] = cosine_similarity(e[0,:].reshape(1, -1), e[1,:].reshape(1, -1))
    euc_sim[i] = euclidean_distances(e[0,:].reshape(1, -1), e[1,:].reshape(1, -1))

    print(i, pair, pair[0])

In [None]:
limit = 0   # limit sample size IF limit>0

size = len(train_data) if limit == 0 else limit
embeddings = np.empty((size, 2, 384))
cos_sim = np.empty(size,)
euc_sim = np.empty(size,)

for i, pair in enumerate(train_data[['Definition', 'Item']].values): # Item_Text?
    if (limit != 0) and (i >= limit): 
        break

    e = model.encode(pair)  # MODEL ENCODER here............
    embeddings[i,:,:] = e
    cos_sim[i] = cosine_similarity(e[0,:].reshape(1, -1), e[1,:].reshape(1, -1))
    euc_sim[i] = euclidean_distances(e[0,:].reshape(1, -1), e[1,:].reshape(1, -1))

train_data['Cos_Sim'] = cos_sim.tolist()    # append COS similarity as new column
train_data['Euc_Sim'] = euc_sim.tolist()    # append EUC similarity as new column

embeddings.shape, cos_sim.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


((22189, 2, 384), (22189,))

### Encode Definition/Item sentences

- Each encoding is 384-dim vector into the BERT latent/embedding space
- There are 833 Definitions and 28,076 Items, each with an encoding
- Result is two arrays: Def_encodings (833x384) and Item_encodings (28076x384)
- Plus... need table to link each Definition to its Items for pairwise comparisons

Procedure: 
- take train_data while ignoring Target
- loop thru df by pairs of Definition+Item_Text string
- do the standard model.encode to generate 384-dim embeddings
- also calculate cosine_similarity
- append similarity to train_data df

NOTE: takes 15-20 minutes for 22K rows in train_data


In [None]:
train_data

Unnamed: 0,DataId,SourceId,Target,Definition,Item,Cos_Sim,Euc_Sim
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...,0.183756,1.277688
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.,0.292208,1.189783
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...,0.322255,1.164255
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...,0.446235,1.052393
6,6,3169,1,Informal rewards and recognition that particip...,"In the last few years (1994 to the present), h...",0.384002,1.109953
...,...,...,...,...,...,...,...
28070,28070,12341,0,The extent to which reputations were observabl...,The project required close working relationshi...,0.213506,1.254188
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.,0.147961,1.305403
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive,0.569600,0.927793
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...,0.274533,1.204547


# Split into 80/20 Train/Validate 
- based on Source groups of Items

In [None]:
split_ratio = 0.8           # can change from 80/20

unique_Source = data.Source.unique()        # find unique SourceId values
split_Source = int(split_ratio * len(unique_Source))+1

np.random.shuffle(unique_Source)              # randomly shuffe
SourceId_list = list(unique_Source)           # array -> list

train_SourceIds = SourceId_list[:split_Source]    # create index lists
valid_SourceIds = SourceId_list[split_Source:]

train_data = data[data.SourceId.isin(train_SourceIds)]  # split dataset
valid_data = data[data.SourceId.isin(valid_SourceIds)]

train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22189 entries, 0 to 28075
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   DataId      22189 non-null  int64 
 1   SourceId    22189 non-null  int64 
 2   Target      22189 non-null  int64 
 3   Definition  22189 non-null  object
 4   Item        22189 non-null  object
dtypes: int64(3), object(2)
memory usage: 1.0+ MB


## Examine split of Train/Valid datasets

- Note the split in rows below. If sigificantly different from 80/20 ratio, re-shuffe above
- TODO check for valid results amid the shuffle

In [None]:
ld = len(data)
ls = len(unique_SourceId)
lst = len(train_SourceIds)
lsv = len(valid_SourceIds)
lt = len(train_data)
lv = len(valid_data)

print(f">>> Count of all data rows = {ld:,d}")
print(f">>> Count of unique Sources = {ls} split 80/20 into Train/Valid of {lst} {lsv}")
print(f">>> Count of Train/Valid rows = {lt:,d} ({lt/(lt+lv):.1%}) and {lv} ({lv/(lt+lv):.1%}) with total = {lt+lv:,d}")

>>> Count of all data rows = 28,076
>>> Count of unique Sources = 833 split 80/20 into Train/Valid of 667 166
>>> Count of Train/Valid rows = 22,189 (79.0%) and 5887 (21.0%) with total = 28,076


# Save results to gDrive

Mount gDrive and create folders

In [None]:
##### Only execute to save results
import os.path
from os import path
from time import strftime, localtime
from google.colab import drive

if USE_GDRIVE: 
    drive.mount('/content/drive')

    BASE_PATH = '/content/drive/MyDrive/CVA-SBERT-Analyses/'
    EXP_PATH = BASE_PATH + strftime("%Y%m%d-%H%M%S", localtime())

    if path.exists(BASE_PATH) == False:
        os.mkdir(BASE_PATH)
    if path.exists(EXP_PATH) == False:
        os.mkdir(EXP_PATH)

Mounted at /content/drive



Save Train/Valid datasets 

In [None]:
# Save Train/Valid data to gDrive ...IF exists EXP_PATH with USE_GDRIVE=TRUE

if USE_GDRIVE and 'EXP_PATH' in globals():
    train_data.to_csv(EXP_PATH+'/train_data.csv', index=False)
    valid_data.to_csv(EXP_PATH+'/valid_data.csv', index=False)

    np.savez_compressed(EXP_PATH+'/embeddings.npz', embeddings) # BIG->compress!
    np.save(EXP_PATH+'/cos_sim.npy', cos_sim)
    np.save(EXP_PATH+'/euc_sim.npy', euc_sim)    

    # dump copy of original data (if needed)
    #data.to_csv(EXP_PATH+'/all_data.csv', index=False)