# __[:+:]__ SBERT Embeddings Test

In [27]:
from sentence_transformers import SentenceTransformer, util, InputExample, losses
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import plotly.express as px
from umap import UMAP
import pandas as pd
import numpy as np
import umap.plot
import pickle
import json

### Get Data

In [28]:
with open("ideological_corpus.txt", "r") as f:
    corpus = f.readlines()
print("[+] -- Loaded ", len(corpus), ' docs ---------------------------------------------------------------------------------------------|')
for i in range(5):
    print(i, ': ', corpus[i])

with open("transcript.txt", "r") as f:
    transcript = f.readlines()
print("[+] -- Loaded ", len(transcript), ' docs ---------------------------------------------------------------------------------------------|')
for i in range(5):
    print(i, ': ', transcript[i])

# Read csv of generated reference opinions as dataframe
refCluster_df = pd.read_csv('referenceClusters.csv')
print(refCluster_df.head(), '\n')

[+] -- Loaded  70  docs ---------------------------------------------------------------------------------------------|
0 :  Support for Israeli settlements in the West Bank is crucial for security.

1 :  Palestinian statehood should be recognized and supported by the international community.

2 :  Economic cooperation between Israel and Palestine can lead to peace.

3 :  Military action is necessary to protect Israeli borders from threats.

4 :  Human rights abuses against Palestinians must be addressed by global organizations.

[+] -- Loaded  238  docs ---------------------------------------------------------------------------------------------|
0 :  Speaker 1 [0.01s - 26.53s]:  You're saying that you think it's the two narratives. Yeah, but we can't hear you very loudly. Maybe it's much better, much better.

1 :  Speaker 3 [4.60s - 25.21s]:  Yeah, I was just saying that I think it's pretty obvious why Native Americans would side with Palestinians. I didn't really know that there were

### Load Model and Embed 
Choose and load model from https://sbert.net/docs/sentence_transformer/pretrained_models.html

In [29]:
#[+:]-- Top quality for general purpose
model = SentenceTransformer('all-mpnet-base-v2')

#[+:]-- Smaller, faster, but decent quality
# model = SentenceTransformer('all-MiniLM-L6-v2')

#Embed some examples
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

#Get the cosine similarity score between example sentences
cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)


`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884



Cosine-Similarity: tensor([[0.6053]])


In [30]:
corpusEmb_arr= [model.encode(i) for i in corpus]
print(len(corpusEmb_arr), len(corpusEmb_arr[0]))


#[+:] extract only the sentences, (without the labels)
sentences = refCluster_df.loc[:,'opinion']
print(sentences.head(), '\n')

#{+:] Embed the extracted sentences
sent_emb= [model.encode(i) for i in sentences]
print('Embeded Sentences: ', len(sent_emb),'x', len(sent_emb[0]))

70 768
0    Israel has the right to exist as a sovereign J...
1    Israel's security measures are necessary to pr...
2    The Israeli government's efforts to negotiate ...
3    Israel's withdrawal from Gaza in 2005 was a si...
4    Hamas' use of civilian areas for launching att...
Name: opinion, dtype: object 

Embeded Sentences:  100 x 768


### UMAP Dimensionality Reduction & Visualization

#### 2D Projection

In [31]:

#[+] Also try "metric= cosine" arg for UMAP
# umapper_2d= UMAP(n_components= 2, init= 'random', random_state=0, metric='cosine')
# proj_2d= umapper_2d.fit_transform(corpusEmb_arr)
# fig_2d= px.scatter(proj_2d, x=0, y=1)
# fig_2d.show()

#### 3D Projection

In [33]:
umapper_3d= UMAP(n_components= 3, init= 'random', random_state=0, metric='cosine')
proj_3d= umapper_3d.fit_transform(corpusEmb_arr)
fig_3d= px.scatter_3d(proj_3d, x=0, y=1, z=2)
fig_3d.show()




umap_3d = UMAP(n_components=3, init='random', random_state=0, metric='cosine')
proj_3d = umap_3d.fit_transform(sent_emb)
fig_3d = px.scatter_3d(
    proj_3d, x=0, y=1, z=2,
    color=refCluster_df.stance, labels={'color': 'stance'}
)
fig_3d.update_traces(marker_size=5)
fig_3d.show()


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.




n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



## TODO
- [+] Install packages
- [+] Embed corpus lines
- [+] UMAP Dim reduction
- [+] For the embeddings, put a label next to each embeding and save the whole thing as csv, load as df
- [ ] Explore different sentence transformers models
- [ ] Explore clustering methods (read documentation)
- [ ] Embed LLM-generated polarized opinions as references
- [ ] Try other dimensionality reduction techiques


## Possible Ideas
-   Recreate Lupin: Classify embedding within a spectrum between two polar opposite references
-   KNN-based algo: instead of using a spectrum, simply classify each opinion based on its nearest neighbors in vector space, or some variation of that
    - Given the sometimes ambiguous and overlapping nature of ideologies, this might be used as an alternative to classifying based on discrete categories
    - ex: "given this person's opinion, he is most ideologically aligned with politicians x and y"
    - We could also create an embedding of entire ideology as an average of a collection of quotes, and do classification based on the single nearest neighbor

-   Recreate the 2019 Party Embedding method using SBERT
    - Embed each opinion in the dataset with a party/ideology label
    - Classify each new opinion based on distance to that label (need more research to confirm this method)