# Development Plan

## Revise CVA-SBERT notebook...
- ✅ Create public GitHub for notebooks and dataset
- ✅ Split fullv4 dataset into 80/20 Training/Validation based on Source.
- For all Items, compute latent vectors.
- For all unique Definitions, compute latent vectors.  
- For Training by each Definition, compute pairwise similarities among its Items.

## Try experiments...
- Compute similarity stats. Browse extreme similarities for patterns in text.  
- Based on pairwise equality of Target values, plot similarity (and spread) distributions. Clear classification?
- UMAP hierarchical clustering of latent vectors. May have to use a small sample.


## Research S-BERT...
- Relationship with [HuggingFace Hub](https://www.sbert.net/docs/hugging_face.html)  
- [model comparisons](https://www.sbert.net/docs/pretrained_models.html), like **all-MiniLM-L6-v2** for good quick results
- [unsupervised learning](https://www.sbert.net/examples/unsupervised_learning/README.html) plus [domain adaptation](https://www.sbert.net/examples/domain_adaptation/README.html) by fine tuning on labeled training data  
- [evaluation classes](https://www.sbert.net/docs/package_reference/evaluation.html) like BinaryClassificationEvaluator
- understand parameters for [SentenceTransformer](https://www.sbert.net/docs/package_reference/SentenceTransformer.html) class & encoder method
- understand/test differences between [Cross-Encoders versus Bi-Encoders](https://www.sbert.net/examples/applications/cross-encoder/README.html)
- S-BERT clustering approaches like [topic modeling](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) (w UMAP) and [BERTopic](https://github.com/MaartenGr/BERTopic)


## References

This notebook derives 



# Setup Environment

## Clone CVA-SBERT GitHub 

- from [repository](https://github.com/Hackathorn/CVA-SBERT) and install [dependencies](https://github.com/Hackathorn/CVA-SBERT/blob/master/requirements.txt)

In [1]:
!git clone https://github.com/Hackathorn/CVA-SBERT  # clone
%cd CVA-SBERT
### %pip install -qr requirements.txt  #### TODO

Cloning into 'CVA-SBERT'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (68/68), done.[K
remote: Total 72 (delta 40), reused 10 (delta 3), pack-reused 0[K
Unpacking objects: 100% (72/72), done.
/content/CVA-SBERT


## Import SentenceTransformers

In [2]:
!pip install -q sentence_transformers

[K     |████████████████████████████████| 85 kB 2.6 MB/s 
[K     |████████████████████████████████| 5.5 MB 15.5 MB/s 
[K     |████████████████████████████████| 1.3 MB 36.8 MB/s 
[K     |████████████████████████████████| 163 kB 41.5 MB/s 
[K     |████████████████████████████████| 7.6 MB 40.4 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Import various packages

- ```sentence_transformers``` class plus others


In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd
import numpy as np
from pprint import pprint

## Instantiate SentenceTransformer

The HuggingFace pipeline `SentenceTransformer` is ...

The model `paraphrase-MiniLM-L6-v2` is ...

In [4]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')      #### TODO find current best model

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

# Create Train/Valid Datasets

## Create dataframe from CSV file

In [5]:
CSV_FileName = 'CVA Training Data Allv4_Richard.csv'

data = pd.read_csv('/content/CVA-SBERT/data/' + CSV_FileName)

print(data.info(verbose=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   SourceId    28076 non-null  int64 
 1   Target      28076 non-null  int64 
 2   Definition  28076 non-null  object
 3   Item_Text   28076 non-null  object
dtypes: int64(2), object(2)
memory usage: 877.5+ KB
None


- print count uniques of columns

In [6]:
ni = len(data)
ns = data.SourceId.nunique()
nd = data.Definition.nunique()

print(f"Item count = {ni:,d}")
print(f"SourceId count = {ns:,d} with Items-per-Source = {(len(data)/ns):.2f}")
print(f"Definition count = {nd:,d} with Items-per-Defintion = {(len(data)/nd):.2f}")
print(f"Target mean = {data.Target.mean():.4f} with count of ones = {data.Target.sum():,d}")

Item count = 28,076
SourceId count = 833 with Items-per-Source = 33.70
Definition count = 2,887 with Items-per-Defintion = 9.72
Target mean = 0.4999 with count of ones = 14,036


## Explore with Colab Data Table Display

In [7]:
data        # limited to 20K rows

Unnamed: 0,SourceId,Target,Definition,Item_Text
0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...
1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.
2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.
3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...
4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...
...,...,...,...,...
28071,12822,1,How characteristic each of the attractiveness ...,Wise.
28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive
28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...
28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...


## Clean data

- remove dash at beginning of string
- remove/change question marks within strings
- ???

In [8]:
#### TODO

## Split into 80/20 Train/Validate sets 

- based on SourceId groups of Items

In [9]:
unique_SourceId = data.SourceId.unique()            # find unique SourceId values
split_SourceId = int(0.8 * len(unique_SourceId))+1

np.random.shuffle(unique_SourceId)                  # randomly shuffe
SourceId_list = list(unique_SourceId)               # array -> list

train_SourceIds = SourceId_list[:split_SourceId]    # create index lists
valid_SourceIds = SourceId_list[split_SourceId:]

train_data = data[data.SourceId.isin(train_SourceIds)]  # split dataset
valid_data = data[data.SourceId.isin(valid_SourceIds)]

## Examine split of Train/Valid datasets

- Note the split in rows below. If off 80/20 ratio, re-shuffe above

In [10]:
ld = len(data)
ls = len(unique_SourceId)
lst = len(train_SourceIds)
lsv = len(valid_SourceIds)
lt = len(train_data)
lv = len(valid_data)

print(f">>> Count of all data rows = {ld:,d}")
print(f">>> Count of unique Sources = {ls} split 80/20 into Train/Valid of {lst} {lsv}")
print(f">>> Count of Train/Valid rows = {lt:,d} ({lt/(lt+lv):.1%}) and {lv} ({lv/(lt+lv):.1%}) with total = {lt+lv:,d}")

>>> Count of all data rows = 28,076
>>> Count of unique Sources = 833 split 80/20 into Train/Valid of 667 166
>>> Count of Train/Valid rows = 23,589 (84.0%) and 4487 (16.0%) with total = 28,076


## Save experiment data to gDrive
- OPTIONAL

In [11]:
##### Only execute to save results
import os.path
from os import path
from time import strftime, localtime

from google.colab import drive
drive.mount('/content/drive')

BASE_PATH = '/content/drive/MyDrive/CVA-SBERT-Experiments/'
EXP_PATH = BASE_PATH + strftime("%Y%m%d-%H%M%S", localtime())

if path.exists(BASE_PATH) == False:
    os.mkdir(BASE_PATH)
if path.exists(EXP_PATH) == False:
    os.mkdir(EXP_PATH)

Mounted at /content/drive


In [12]:
# Save Train/Valid data ...IF exists EXP_PATH

if 'EXP_PATH' in globals():
    train_data.to_csv(EXP_PATH+'/train_data.csv', index=True)
    valid_data.to_csv(EXP_PATH+'/valid_data.csv', index=True)

# Encode Definition/Item sentences

- Each encoding is 384-dim vector into the BERT latent/embedding space
- There are 833 Definitions and 28,076 Items, each with an encoding
- Result is two arrays: Def_encodings (833x384) and Item_encodings (28076x384)
- Plus... need table to link each Definition to its Items for pairwise comparisons

## Do a simple approach...

- save/commit train_data & valid_data TS to GitHub
- take train_data with 20,874 items
- create list of Sources
- for each Source, create list of Definitions
- for each Definition, create df of Items 
- for each Item, encode both Def & Items sentences into array

In [101]:
limit = 10   # limit sample size of limit>0

size = limit if limit != 0 else len(train_data)
embeddings = np.empty((size, 2, 384))
similarities = np.empty(size,)

for i, pair in enumerate(train_data[['Definition', 'Item_Text']].values):
    if (i != 0) or (i >= limit): break
    print('i = ', i)
    e = model.encode(pair)
    embeddings[i,:,:] = e
    similarities[i] = cosine_similarity(e[0,:].reshape(1, -1), e[1,:].reshape(1, -1))

i =  0


In [None]:
source_list = list(train_data.SourceId.unique())
source_list.sort()
type(source_list), len(source_list), source_list[:5]

for source in source_list:
    definition_list = train_data.Definition[[train_data.SourceId == source]]
    print(definition_list)
    break

KeyError: ignored

Once sentence_pairs is encoded by model, the result embeddings is a list of list, each element is an latent vector of shape (384,). 

In [None]:
embeddings = []
for pair in sentence_pairs:
    embeddings.append(model.encode(pair))

type(embeddings), len(embeddings), len(embeddings[0]), type(embeddings[0][0]), embeddings[0][0].shape

(list, 4, 3, numpy.ndarray, (384,))

## Construct index table/query of Items-Definitions-Sources

In [None]:
######## TODO

In [None]:
######## TODO

Manually scan and picked four Definitions that seems to make sense. 

In [None]:
definition_samples = [
    "A combination of temporal planning and temporal reminders modified to be leader-specific.",
    "A behavioral observation scale for appraising the employee's performance",
    "A belief that ability is fixed and unchangeable.",
    "A deep sense of moral obligation associated with animal care.",
]

For the first pass, I choose only the second sample to process

In [None]:
data_sample = data[data['Definition'] == definition_samples[1]]
data_sample

Unnamed: 0,SourceId,Target,Definition,Item_Text
427,1930,1,A behavioral observation scale for appraising ...,The employee influences others in a way that r...
1634,1930,1,A behavioral observation scale for appraising ...,The employee adapts personal style to the need...
6887,1930,0,A behavioral observation scale for appraising ...,People can substantially change the kind of pe...
18047,1930,0,A behavioral observation scale for appraising ...,"Everyone is a certain kind of person, and ther..."


## Construct Sentences-Pairs

This section takes the `data_sample` from the previous section. This consists of: N sentence-pairs, first is ***Definition*** of a specific topic that a survey instrument is studies, while the second is the ***Item*** that the respondent rates. 

Each pair is encoded into a 384-dim latent/embedding vector and a cosine similarity is calculated for the pair.

The `sentence_pairs` is a list of pair_lists from the `data_sample` df

In [None]:
sentence_pairs = data_sample[['Definition', 'Item_Text', 'Target']].values.tolist()
sentence_pairs

[["A behavioral observation scale for appraising the employee's performance",
  'The employee influences others in a way that results in agreement',
  1],
 ["A behavioral observation scale for appraising the employee's performance",
  'The employee adapts personal style to the needs of different situations',
  1],
 ["A behavioral observation scale for appraising the employee's performance",
  'People can substantially change the kind of person they are.',
  0],
 ["A behavioral observation scale for appraising the employee's performance",
  'Everyone is a certain kind of person, and there is not much they can really change about that.',
  0]]

# Compute pairwise similarities

- For Training by each Definition, compute pairwise similarities among its Items.

Using [Cosine Similarity function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)  from sklearn, ...

In [None]:
for i, pair in enumerate(embeddings):
    print(sentence_pairs[i][0])
    print(sentence_pairs[i][1])
    sim = cosine_similarity(pair[0].reshape(1, -1), pair[1].reshape(1, -1))
    print("   Similarity:", sim, "Target:", sentence_pairs[i][2])
    print()


A behavioral observation scale for appraising the employee's performance
The employee influences others in a way that results in agreement
   Simularity: [[0.59870344]] Target: 1

A behavioral observation scale for appraising the employee's performance
The employee adapts personal style to the needs of different situations
   Simularity: [[0.55898446]] Target: 1

A behavioral observation scale for appraising the employee's performance
People can substantially change the kind of person they are.
   Simularity: [[0.31225926]] Target: 0

A behavioral observation scale for appraising the employee's performance
Everyone is a certain kind of person, and there is not much they can really change about that.
   Simularity: [[0.20095041]] Target: 0

