# Setup Environment

## Clone CVA-SBERT GitHub 

- from [repository](https://github.com/Hackathorn/CVA-SBERT) and install [dependencies](https://github.com/Hackathorn/CVA-SBERT/blob/master/requirements.txt)

In [None]:
!git clone https://github.com/Hackathorn/CVA-SBERT  # clone
%cd CVA-SBERT
### %pip install -qr requirements.txt  #### TODO

Cloning into 'CVA-SBERT'...
remote: Enumerating objects: 127, done.[K
remote: Counting objects: 100% (127/127), done.[K
remote: Compressing objects: 100% (120/120), done.[K
remote: Total 127 (delta 79), reused 19 (delta 6), pack-reused 0[K
Receiving objects: 100% (127/127), 6.98 MiB | 4.82 MiB/s, done.
Resolving deltas: 100% (79/79), done.
/content/CVA-SBERT


## Import SentenceTransformers

In [None]:
!pip install -q sentence_transformers

[K     |████████████████████████████████| 85 kB 2.6 MB/s 
[K     |████████████████████████████████| 5.5 MB 12.3 MB/s 
[K     |████████████████████████████████| 1.3 MB 41.5 MB/s 
[K     |████████████████████████████████| 182 kB 47.8 MB/s 
[K     |████████████████████████████████| 7.6 MB 10.0 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Import various packages

- ```sentence_transformers``` class plus others


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

import pandas as pd
import numpy as np
from pprint import pprint

## Instantiate SentenceTransformer

The HuggingFace pipeline `SentenceTransformer` is ...

The model [`paraphrase-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) is an early pretrained Sentence-Similiarity (S-S) model used in many posted examples. 

However, a more recent model [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) has most Likes and most Downloads (2M/month!) of almost a thousand S-S models.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Setup Train/Valid Datasets

## Create dataframe from CSV file. 

- Add ```DataId``` column to maintain lineage to original dataset
- Rename Item column for naming consistency

In [None]:
CSV_FileName = 'CVA Training Data Allv4_Richard.csv'
data = pd.read_csv('/content/CVA-SBERT/data/' + CSV_FileName)

data.insert(loc=0, column='DataId', value=data.index)       # maintain lineage to original dataset
data.rename(columns = {"Item_Text":"Item"}, inplace = True) # optional for naming consistency

print(data.info(verbose=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28076 entries, 0 to 28075
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   DataId      28076 non-null  int64 
 1   SourceId    28076 non-null  int64 
 2   Target      28076 non-null  int64 
 3   Definition  28076 non-null  object
 4   Item        28076 non-null  object
dtypes: int64(3), object(2)
memory usage: 1.1+ MB
None


## Print various counts/ratios 

- about Source-Definition-Item columns

In [None]:
ni = len(data)
ns = data.SourceId.nunique()
nd = data.Definition.nunique()
print("-------- Unique Counts of SourceID-Definition-Item_Text --------")
print(f"Item count = {ni:,d}")
print(f"SourceId count = {ns:,d} \n" +
      f"    with Items-per-Source = {(len(data)/ns):.2f} \n" +
      f"    with Definitions-per-Source = {(nd/ns):.2f}")
print(f"Definition count = {nd:,d} \n" +
      f"    with Items-per-Definition = {(len(data)/nd):.2f}")
print(f"Target mean = {data.Target.mean():.4f} \n" +
      f"    with count of ones = {data.Target.sum():,d}")

-------- Unique Counts of SourceID-Definition-Item_Text --------
Item count = 28,076
SourceId count = 833 
    with Items-per-Source = 33.70 
    with Definitions-per-Source = 3.47
Definition count = 2,887 
    with Items-per-Definition = 9.72
Target mean = 0.4999 
    with count of ones = 14,036


## Explore the CVS dataset

In [None]:
data        # NOTE: limited to 20K rows

Unnamed: 0,DataId,SourceId,Target,Definition,Item
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.
2,2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...
...,...,...,...,...,...
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...
28074,28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...


## Clean data --- TODO!!!!

- remove dash at beginning of string
- remove/change question marks within strings
- ???   ...or leave the original data unchanged

## Split into 80/20 Train/Validate 
- based on SourceId groups of Items

In [None]:
split_ratio = 0.8           # can change from 80/20

unique_SourceId = data.SourceId.unique()        # find unique SourceId values
split_SourceId = int(split_ratio * len(unique_SourceId))+1

np.random.shuffle(unique_SourceId)              # randomly shuffe
SourceId_list = list(unique_SourceId)           # array -> list

train_SourceIds = SourceId_list[:split_SourceId]    # create index lists
valid_SourceIds = SourceId_list[split_SourceId:]

train_data = data[data.SourceId.isin(train_SourceIds)]  # split dataset
valid_data = data[data.SourceId.isin(valid_SourceIds)]

train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23359 entries, 0 to 28075
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   DataId      23359 non-null  int64 
 1   SourceId    23359 non-null  int64 
 2   Target      23359 non-null  int64 
 3   Definition  23359 non-null  object
 4   Item        23359 non-null  object
dtypes: int64(3), object(2)
memory usage: 1.1+ MB


## Examine split of Train/Valid datasets

- Note the split in rows below. If sigificantly different from 80/20 ratio, re-shuffe above
- TODO check for valid results amid the shuffle

In [None]:
ld = len(data)
ls = len(unique_SourceId)
lst = len(train_SourceIds)
lsv = len(valid_SourceIds)
lt = len(train_data)
lv = len(valid_data)

print(f">>> Count of all data rows = {ld:,d}")
print(f">>> Count of unique Sources = {ls} split 80/20 into Train/Valid of {lst} {lsv}")
print(f">>> Count of Train/Valid rows = {lt:,d} ({lt/(lt+lv):.1%}) and {lv} ({lv/(lt+lv):.1%}) with total = {lt+lv:,d}")

>>> Count of all data rows = 28,076
>>> Count of unique Sources = 833 split 80/20 into Train/Valid of 667 166
>>> Count of Train/Valid rows = 23,359 (83.2%) and 4717 (16.8%) with total = 28,076


# Setup checkpoint to gDrive (optional)

## Mount gDrive and create folders

In [None]:
##### Only execute to save results
import os.path
from os import path
from time import strftime, localtime

from google.colab import drive
drive.mount('/content/drive')

BASE_PATH = '/content/drive/MyDrive/CVA-SBERT-Analyses/'
EXP_PATH = BASE_PATH + strftime("%Y%m%d-%H%M%S", localtime())

if path.exists(BASE_PATH) == False:
    os.mkdir(BASE_PATH)
if path.exists(EXP_PATH) == False:
    os.mkdir(EXP_PATH)

Mounted at /content/drive


## Save Train/Valid datasets 

In [None]:
# Save Train/Valid data to gDrive ...IF exists EXP_PATH

if 'EXP_PATH' in globals():
    train_data.to_csv(EXP_PATH+'/train_data.csv', index=False)
    valid_data.to_csv(EXP_PATH+'/valid_data.csv', index=False)
    # dump copy of original data (if needed)
    data.to_csv(EXP_PATH+'/all_data.csv', index=False)