<a href="https://colab.research.google.com/github/Hackathorn/CVA-SBERT/blob/main/notebooks/CVA%20Analysis%20using%20S-Bert%20v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This is a change from Colab**

# Revise CVA-SBERT notebook by:

-   For all Items, compute latent vectors.
-   For all unique Definitions, compute latent vectors.  
-   Split fullv4 dataset into 80/20 Training/Validation based on Source.
-   For Training by each Definition, compute pairwise similarities among its Items.
-   Try... Compute similarity stats. Browse extreme similarities for patterns in text.  
-   Try... Based on pairwise equality of Target values, plot similarity (and spread) distributions. Clear classification?
-   Try... UMAP hierarchical clustering of latent vectors. May have to use a small sample.
-   Create public GitHub for the above 

# Research S-BERT for:

-   Relationship with [HuggingFace Hub](https://www.sbert.net/docs/hugging_face.html)  
-   [model comparisons](https://www.sbert.net/docs/pretrained_models.html), like **all-MiniLM-L6-v2** for good quick results
-   [unsupervised learning](https://www.sbert.net/examples/unsupervised_learning/README.html) plus [domain adaptation](https://www.sbert.net/examples/domain_adaptation/README.html) by fine tuning on labeled training data  
-   [evaluation classes](https://www.sbert.net/docs/package_reference/evaluation.html) like BinaryClassificationEvaluator
-   understand parameters for [SentenceTransformer](https://www.sbert.net/docs/package_reference/SentenceTransformer.html) class & encoder method
-   understand/test differences between [Cross-Encoders versus Bi-Encoders](https://www.sbert.net/examples/applications/cross-encoder/README.html)
-   S-BERT clustering approaches like [topic modeling](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) (w UMAP) and [BERTopic](https://github.com/MaartenGr/BERTopic)

# References

This notebook derives 



# Imports

In [None]:
!pip install -q sentence_transformers

[K     |████████████████████████████████| 85 kB 2.7 MB/s 
[K     |████████████████████████████████| 5.5 MB 30.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 31.2 MB/s 
[K     |████████████████████████████████| 163 kB 36.1 MB/s 
[K     |████████████████████████████████| 7.6 MB 35.6 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd
from pprint import pprint

# Read dataset & create sample

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/CVA Training Data Allv4_Richard.csv')
data  # explore with Colab Data Table Display

Unnamed: 0,SourceId,Target,Definition,Item_Text
0,2978,1,People whose past behavior is consistent with ...,Have any of your current or previous partners ...
1,1056,0,Facilitation from work to school.,I enjoy being a student on this campus.
2,9900,0,The telemarketers ranked from 1 (most importan...,To upgrade physical work environments.
3,1015,0,Employees? sense of belongingness at work.,Helps others when it is clear their workload i...
4,2988,0,How attracted members were to the crew and the...,Managers rate each crew (low performance/high ...
...,...,...,...,...
28071,12822,1,How characteristic each of the attractiveness ...,Wise.
28072,3350,1,Participants' explanations for why the seller ...,The buyer is persuasive
28073,13668,0,The extent to which the employee perceived the...,I have been able to express my views and feeli...
28074,2361,1,Newcomers? belief that good alternative work e...,To what extent have other co-workers influence...


Manually scan and picked four Definitions that seems to make sense. 

In [None]:
definition_samples = [
    "A combination of temporal planning and temporal reminders modified to be leader-specific.",
    "A behavioral observation scale for appraising the employee's performance",
    "A belief that ability is fixed and unchangeable.",
    "A deep sense of moral obligation associated with animal care.",
]

For the first pass, I choose only the second sample to process

In [None]:
data_sample = data[data['Definition'] == definition_samples[1]]
data_sample

Unnamed: 0,SourceId,Target,Definition,Item_Text
427,1930,1,A behavioral observation scale for appraising ...,The employee influences others in a way that r...
1634,1930,1,A behavioral observation scale for appraising ...,The employee adapts personal style to the need...
6887,1930,0,A behavioral observation scale for appraising ...,People can substantially change the kind of pe...
18047,1930,0,A behavioral observation scale for appraising ...,"Everyone is a certain kind of person, and ther..."


# Model

The HuggingFace pipeline `SentenceTransformer` is ...

The model `paraphrase-MiniLM-L6-v2` is ...

In [None]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

# Sentences

This section takes the `data_sample` from the previous section. This consists of: N sentence-pairs, first is ***Definition*** of a specific topic that a survey instrument is studies, while the second is the ***Item*** that the respondent rates. 

Each pair is encoded into a 384-dim latent/embedding vector and a cosine similarity is calculated for the pair.

The `sentence_pairs` is a list of pair_lists from the `data_sample` df

In [None]:
sentence_pairs = data_sample[['Definition', 'Item_Text', 'Target']].values.tolist()
sentence_pairs

[["A behavioral observation scale for appraising the employee's performance",
  'The employee influences others in a way that results in agreement',
  1],
 ["A behavioral observation scale for appraising the employee's performance",
  'The employee adapts personal style to the needs of different situations',
  1],
 ["A behavioral observation scale for appraising the employee's performance",
  'People can substantially change the kind of person they are.',
  0],
 ["A behavioral observation scale for appraising the employee's performance",
  'Everyone is a certain kind of person, and there is not much they can really change about that.',
  0]]

Once sentence_pairs is encoded by model, the result embeddings is a list of list, each element is an latent vector of shape (384,). 

In [None]:
embeddings = []
for pair in sentence_pairs:
    embeddings.append(model.encode(pair))

type(embeddings), len(embeddings), len(embeddings[0]), type(embeddings[0][0]), embeddings[0][0].shape

(list, 4, 3, numpy.ndarray, (384,))

# Similarity

Using [Cosine Similarity function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)  from sklearn, ...

In [None]:
for i, pair in enumerate(embeddings):
    print(sentence_pairs[i][0])
    print(sentence_pairs[i][1])
    sim = cosine_similarity(pair[0].reshape(1, -1), pair[1].reshape(1, -1))
    print("   Similarity:", sim, "Target:", sentence_pairs[i][2])
    print()


A behavioral observation scale for appraising the employee's performance
The employee influences others in a way that results in agreement
   Simularity: [[0.59870344]] Target: 1

A behavioral observation scale for appraising the employee's performance
The employee adapts personal style to the needs of different situations
   Simularity: [[0.55898446]] Target: 1

A behavioral observation scale for appraising the employee's performance
People can substantially change the kind of person they are.
   Simularity: [[0.31225926]] Target: 0

A behavioral observation scale for appraising the employee's performance
Everyone is a certain kind of person, and there is not much they can really change about that.
   Simularity: [[0.20095041]] Target: 0

