# Setup Train/Valid Dataset from CSV

- The CSV file containing the CVS data is loaded into a dataframe. 

  # TODO figure out how to insert image from GitHub

## Set notebook parameters

- **SBERT_MODEL**: The pre-trained LLM fine-tuned from BERT. See the [many models available](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads). The model [`paraphrase-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) is an early pretrained Sentence-Similiarity (S-S) model used in many examples. However, a more recent model [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) has most Likes and most Downloads (2M/month!) of almost a thousand S-S models.

- **USE_GDRIVE**: Default is FALSE implying you have not customized the following SetUp code. 

In [1]:
# set model name to create SBERT model instance
SBERT_MODEL = 'all-MiniLM-L6-v2'

# do NOT save to your gDrive
USE_GDRIVE = False

## Import SentenceTransformers

In [3]:
!pip install -q sentence_transformers

[K     |████████████████████████████████| 85 kB 1.8 MB/s 
[K     |████████████████████████████████| 5.5 MB 49.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 46.0 MB/s 
[K     |████████████████████████████████| 182 kB 76.5 MB/s 
[K     |████████████████████████████████| 7.6 MB 45.2 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Import various packages

- ```sentence_transformers``` class plus others


In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

import pandas as pd
import numpy as np
from pprint import pprint

## Instantiate SentenceTransformer

- Creates an instance of HuggingFace pipeline for the `SentenceTransformer` class, based upon the parameters (which are MANY). 
- See [SentenceTransformer](https://www.sbert.net/) documentation, particularly for [its parameters](https://www.sbert.net/docs/package_reference/SentenceTransformer.html) and for [Sentence Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html).
- TODO - explore these parameters. 

In [5]:
model = SentenceTransformer(SBERT_MODEL)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Create Train/Valid Datasets

## Clone CVA-SBERT GitHub 

- from [repository](https://github.com/Hackathorn/CVA-SBERT) and install [dependencies](https://github.com/Hackathorn/CVA-SBERT/blob/master/requirements.txt)

In [1]:
!git clone https://github.com/Hackathorn/CVA-SBERT  # clone
# %cd CVA-SBERT   # not needed???

fatal: destination path 'CVA-SBERT' already exists and is not an empty directory.


## Load dataframe from CSV file. 

- Use 
Add ```DataId``` column to maintain lineage to original dataset
- Rename Item column for naming consistency

In [8]:
CVA_datafile = "https://github.com/Hackathorn/CVA-SBERT/blob/593416a5ab066116b1cda02e31e069588fedc2ee/data/CVA%20Training%20Data%20Allv4_Richard.csv"

with open(CVA_datafile) as f:
    lines = f.readlines()

/content/CVA-SBERT/data/CVA Training Data Allv4_Richard.csv

FileNotFoundError: ignored

In [6]:
# import pandas as pd
# CSV_FileName = 'CVA Training Data Allv4_Richard.csv'
# data = pd.read_csv('/content/CVA-SBERT/data/' + CSV_FileName)


# data/CVA Training Data Allv4_Richard.csv
# data = pd.read_csv("https://github.com/Hackathorn/CVA-SBERT/blob/593416a5ab066116b1cda02e31e069588fedc2ee/data/CVA%20Training%20Data%20Allv4_Richard.csv")

data = pd.read_csv("https://github.com/Hackathorn/CVA-SBERT/blob/593416a5ab066116b1cda02e31e069588fedc2ee/data/CVA%20Training%20Data%20Allv4_Richard.csv")

data.insert(loc=0, column='DataId', value=data.index)       # maintain lineage to original dataset
data.rename(columns = {"Item_Text":"Item"}, inplace = True) # optional for naming consistency

print(data.info(verbose=True))

ParserError: ignored

## Print various counts/ratios 

- about Source-Definition-Item columns

In [None]:
ni = len(data)
ns = data.SourceId.nunique()
nd = data.Definition.nunique()
print("-------- Unique Counts of SourceID-Definition-Item_Text --------")
print(f"Item count = {ni:,d}")
print(f"SourceId count = {ns:,d} \n" +
      f"    with Items-per-Source = {(len(data)/ns):.2f} \n" +
      f"    with Definitions-per-Source = {(nd/ns):.2f}")
print(f"Definition count = {nd:,d} \n" +
      f"    with Items-per-Definition = {(len(data)/nd):.2f}")
print(f"Target mean = {data.Target.mean():.4f} \n" +
      f"    with count of ones = {data.Target.sum():,d}")

-------- Unique Counts of SourceID-Definition-Item_Text --------
Item count = 28,076
SourceId count = 833 
    with Items-per-Source = 33.70 
    with Definitions-per-Source = 3.47
Definition count = 2,887 
    with Items-per-Definition = 9.72
Target mean = 0.4999 
    with count of ones = 14,036


## Explore the CVS dataset

In [None]:
data        # NOTE: limited to 20K rows

Unnamed: 0,DataId,SourceId,Target,Definition,Item
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.
2,2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...
...,...,...,...,...,...
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...
28074,28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...


## Clean data --- TODO!!!!

- remove dash at beginning of string
- remove/change question marks within strings
- ???   ...or leave the original data unchanged

## Split into 80/20 Train/Validate 
- based on SourceId groups of Items

In [None]:
split_ratio = 0.8           # can change from 80/20

unique_SourceId = data.SourceId.unique()        # find unique SourceId values
split_SourceId = int(split_ratio * len(unique_SourceId))+1

np.random.shuffle(unique_SourceId)              # randomly shuffe
SourceId_list = list(unique_SourceId)           # array -> list

train_SourceIds = SourceId_list[:split_SourceId]    # create index lists
valid_SourceIds = SourceId_list[split_SourceId:]

train_data = data[data.SourceId.isin(train_SourceIds)]  # split dataset
valid_data = data[data.SourceId.isin(valid_SourceIds)]

train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22189 entries, 0 to 28075
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   DataId      22189 non-null  int64 
 1   SourceId    22189 non-null  int64 
 2   Target      22189 non-null  int64 
 3   Definition  22189 non-null  object
 4   Item        22189 non-null  object
dtypes: int64(3), object(2)
memory usage: 1.0+ MB


## Examine split of Train/Valid datasets

- Note the split in rows below. If sigificantly different from 80/20 ratio, re-shuffe above
- TODO check for valid results amid the shuffle

In [None]:
ld = len(data)
ls = len(unique_SourceId)
lst = len(train_SourceIds)
lsv = len(valid_SourceIds)
lt = len(train_data)
lv = len(valid_data)

print(f">>> Count of all data rows = {ld:,d}")
print(f">>> Count of unique Sources = {ls} split 80/20 into Train/Valid of {lst} {lsv}")
print(f">>> Count of Train/Valid rows = {lt:,d} ({lt/(lt+lv):.1%}) and {lv} ({lv/(lt+lv):.1%}) with total = {lt+lv:,d}")

>>> Count of all data rows = 28,076
>>> Count of unique Sources = 833 split 80/20 into Train/Valid of 667 166
>>> Count of Train/Valid rows = 22,189 (79.0%) and 5887 (21.0%) with total = 28,076


# Encode Definition/Item sentences

- Each encoding is 384-dim vector into the BERT latent/embedding space
- There are 833 Definitions and 28,076 Items, each with an encoding
- Result is two arrays: Def_encodings (833x384) and Item_encodings (28076x384)
- Plus... need table to link each Definition to its Items for pairwise comparisons

Procedure: 
- take train_data while ignoring Target
- loop thru df by pairs of Definition+Item_Text string
- do the standard model.encode to generate 384-dim embeddings
- also calculate cosine_similarity
- append similarity to train_data df

NOTE: takes 15-20 minutes for 22K rows in train_data


In [None]:
limit = 0   # limit sample size IF limit>0

size = len(train_data) if limit == 0 else limit
embeddings = np.empty((size, 2, 384))
cos_sim = np.empty(size,)
euc_sim = np.empty(size,)

for i, pair in enumerate(train_data[['Definition', 'Item']].values): # Item_Text?
    if (limit != 0) and (i >= limit): 
        break

    e = model.encode(pair)  # MODEL ENCODER here............
    embeddings[i,:,:] = e
    cos_sim[i] = cosine_similarity(e[0,:].reshape(1, -1), e[1,:].reshape(1, -1))
    euc_sim[i] = euclidean_distances(e[0,:].reshape(1, -1), e[1,:].reshape(1, -1))

train_data['Cos_Sim'] = cos_sim.tolist()    # append COS similarity as new column
train_data['Euc_Sim'] = euc_sim.tolist()    # append EUC similarity as new column

embeddings.shape, cos_sim.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


((22189, 2, 384), (22189,))

In [None]:
train_data

Unnamed: 0,DataId,SourceId,Target,Definition,Item,Cos_Sim,Euc_Sim
0,0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...,0.183756,1.277688
1,1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.,0.292208,1.189783
3,3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...,0.322255,1.164255
4,4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...,0.446235,1.052393
6,6,3169,1,Informal rewards and recognition that particip...,"In the last few years (1994 to the present), h...",0.384002,1.109953
...,...,...,...,...,...,...,...
28070,28070,12341,0,The extent to which reputations were observabl...,The project required close working relationshi...,0.213506,1.254188
28071,28071,12822,1,How characteristic each of the attractiveness ...,Wise.,0.147961,1.305403
28072,28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive,0.569600,0.927793
28073,28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...,0.274533,1.204547


# Checkpoint results to gDrive

Mount gDrive and create folders

In [None]:
##### Only execute to save results
import os.path
from os import path
from time import strftime, localtime
from google.colab import drive

if USE_GDRIVE: 
    drive.mount('/content/drive')

    BASE_PATH = '/content/drive/MyDrive/CVA-SBERT-Analyses/'
    EXP_PATH = BASE_PATH + strftime("%Y%m%d-%H%M%S", localtime())

    if path.exists(BASE_PATH) == False:
        os.mkdir(BASE_PATH)
    if path.exists(EXP_PATH) == False:
        os.mkdir(EXP_PATH)

Mounted at /content/drive



Save Train/Valid datasets 

In [None]:
# Save Train/Valid data to gDrive ...IF exists EXP_PATH with USE_GDRIVE=TRUE

if USE_GDRIVE and 'EXP_PATH' in globals():
    train_data.to_csv(EXP_PATH+'/train_data.csv', index=False)
    valid_data.to_csv(EXP_PATH+'/valid_data.csv', index=False)

    np.savez_compressed(EXP_PATH+'/embeddings.npz', embeddings) # BIG->compress!
    np.save(EXP_PATH+'/cos_sim.npy', cos_sim)
    np.save(EXP_PATH+'/euc_sim.npy', euc_sim)    

    # dump copy of original data (if needed)
    #data.to_csv(EXP_PATH+'/all_data.csv', index=False)