# Day 4: Word2Vec, Text Similarity, Hugging Face

## Agenda
- Word Embeddings: Word2Vec
- Text Similarity: Cosine Similarity, Jaccard Similarity
- Introduction to HuggingFace Transformers Library

## Word Embeddings
- Humans are good with words. Computers are good with numbers
  - Look at the evolution of programming languages
- Word Embedding is a vector representation of words (more precisely, vector representation)
- This allows computers to tell how 2 texts are similar/dissimilar from each other

- 2 popular techniques: Word2Vec, GloVe
  - Word2Vec: "two words sharing similar contexts also share a similar meaning"
  - GloVe: "Global Vectors for Word Representation"

Library Gensim provides Word2Vec implementation which we'll use to explore.

In [None]:
pip install gensim

In [40]:
import gensim.downloader as api

# Pre-trained model using word2vec on google news dataset
w2v = api.load('word2vec-google-news-300')

In [41]:
print(f"Out of total {len(w2v.index_to_key)} words")
for index, word in enumerate(w2v.index_to_key):
    if index <= 10:
        print(f"#{index}: {word}")
    else:
        break

Out of total 3000000 words
#0: </s>
#1: in
#2: for
#3: that
#4: is
#5: on
#6: ##
#7: The
#8: with
#9: said
#10: was


In [42]:
vec_computer = w2v['computer']
print(vec_computer)

[ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-01
 -2.20703125e-01  1.30859375e-01  3.66210938e-02 -3.63769531e-02
 -1.13281250e-01  1.95312500e-01  9.76562500e-02  1.26953125e-01
  6.59179688e-02  6.93359375e-02  1.02539062e-02  1.75781250e-01
 -1.68945312e-01  1.21307373e-03 -2.98828125e-01 -1.15234375e-01
  5.66406250e-02 -1.77734375e-01 -2.08984375e-01  1.76757812e-01
  2.38037109e-02 -2.57812500e-01 -4.46777344e-02  1.88476562e-01
  5.51757812e-02  5.02929

In [43]:
w2v['revature']

KeyError: "Key 'revature' not present"

In [48]:
pairs = [
    # Going from more similar to less similar 
    ('cup', 'cup'),
    ('cup', 'bowl'),  
    ('cup', 'beverage'),
    ('cup', 'cat'),  
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, w2v.similarity(w1, w2)))

'cup'	'cup'	1.00
'cup'	'bowl'	0.40
'cup'	'beverage'	0.27
'cup'	'cat'	0.13


In [50]:
print(w2v.most_similar(positive=['mug'], topn=5))

[('coffee_mug', 0.6172524690628052), ('mugshot', 0.6114740371704102), ('mugs', 0.6063777804374695), ('pint_glass', 0.529857337474823), ('mugshots', 0.5195839405059814)]


In [46]:
print(w2v.doesnt_match(['cup', 'cat', 'mug', 'jar']))

cat


## Text Similarity: Cosine Similarity, Jaccard Similarity

- Techniques like Word2Vec and GloVe gives us mathematical representations of texts, but how do they tell how similar/dissimilar they are?
- Different Techniques:
    - Euclidean Distance (how far apart are these vectors?)
    - Cosine Similarity (The angle between the vectors)
    - Jaccard Similarity (How many words do these texts share?)
    - and more

### Cosine Similarity
One of the most common techniques. We calculate the angle between the two vectors.
![cos_sim](https://memgraph.com/images/blog/cosine-similarity-python-scikit-learn/cosine-similarity.png)
![cos_sim_formula](https://assets-global.website-files.com/5ef788f07804fb7d78a4127a/60dee7e4dec6611dc63cb158_dNiiYIrknDdfDwnqRpJ4n23givOOrrkWvlsBED9hE7qahtn_itdM1ziLQm0YYmqlV2j5q1Kur_icFc_K1jyYKIAcz_PBZ32OjpaFVQGAf41K3O0PhVRnnROFNnb_04jQ36VcX8pF.png)

In [53]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Downloading scikit_learn-1.3.2-cp311-cp311-win_amd64.whl (9.2 MB)
   ---------------------------------------- 0.0/9.2 MB ? eta -:--:--
    --------------------------------------- 0.2/9.2 MB 6.6 MB/s eta 0:00:02
   -- ------------------------------------- 0.6/9.2 MB 9.4 MB/s eta 0:00:01
   ----- ---------------------------------- 1.4/9.2 MB 12.3 MB/s eta 0:00:01
   --------- ------------------------------ 2.2/9.2 MB 15.3 MB/s eta 0:00:01
   ------------- -------------------------- 3.2/9.2 MB 18.3 MB/s eta 0:00:01
   ------------------ --------------------- 4.2/9.2 MB 20.6 MB/s eta 0:00:01
   -------------------- ------------------- 4.8/9.2 MB 19.0 MB/s eta 0:00:01
   -------------------------- -----------


[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: C:\Users\JuniperSong\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [54]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: C:\Users\JuniperSong\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [55]:
from sklearn.feature_extraction.text import CountVectorizer, cosine_similarity
import numpy as np

# This is a python implementation of cosine similarity, using numpy
def cosine_similarity(vec1, vec2):
    if len(vec1) != len(vec2) :
        return None
    
    # Compute the dot product between 2 vectors
    dot_prod = np.dot(vec1, vec2)
    
    # Compute the norms of the 2 vectors
    norm_vec1 = np.sqrt(np.sum(vec1**2)) 
    norm_vec2 = np.sqrt(np.sum(vec2**2))
    
    # Compute the cosine similarity
    # We divide the dot product of the 2 vectors by their length
    cosine_similarity = dot_prod / (norm_vec1 * norm_vec2)
    
    return cosine_similarity

In [56]:
# Sample texts
text1 = "Natural language processing is fascinating."
text2 = "I'm intrigued by the wonders of natural language processing."

# Tokenize and vectorize the texts
vectorizer = CountVectorizer().fit_transform([text1, text2]).toarray()
print(vectorizer)

# Calculate cosine similarity
cosine_sim = cosine_similarity(vectorizer[0, :], vectorizer[1, :])

# Cosine Similarity ranges from -1 to 1, a number closer to 1 means that they are more similar
print("Cosine Similarity:")
print(cosine_sim)

[[0 1 0 1 1 1 0 1 0 0]
 [1 0 1 0 1 1 1 1 1 1]]
Cosine Similarity:
0.4743416490252569


### Jaccard Index
Jaccard Index, or Jaccard similarity coefficient, is used to guage the similarity between two sets.

![jc_idx](https://wikimedia.org/api/rest_v1/media/math/render/svg/eaef5aa86949f49e7dc6b9c8c3dd8b233332c9e7)

In [58]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

In [63]:
from nltk import word_tokenize

text1 = "natural language processing is fascinating."
text2 = "I'm intrigued by the wonders of natural language processing."

# Tokenize the sentences and turn them into sets
set1 = set(word_tokenize(text1))
set2 = set(word_tokenize(text2))

# Calculate Jaccard similarity
jaccard_sim = jaccard_similarity(set1, set2)

# Print the Jaccard similarity
print(f"Jaccard Similarity: {jaccard_sim}")

Jaccard Similarity: 0.3076923076923077


## HuggingFace
- A company that focuses on fostering an open-source AI/ML community.
- Originally a chatbot company built for teenagers (perhaps, hence the name and emoji) 
- Famous for their massive open-source collection of AI models and Transformer library
- The platform we'll be using for our LLM's and Text Embedding Models.

### Transformer Library
- Library built for easy, low barrier interaction with various AI models
- Transformer refers to a specific type of text-generation model that are context aware (like BERT or GPT), BUT in this context, HuggingFace's Transformer library has models beyond text-generation.
- Benefits: Ease of Use, Flexibility, Simplicity
- Tutorial: https://huggingface.co/learn/nlp-course

In [None]:
# Install HuggingFace Transformer library
pip install transformers

In [None]:
# Got a windows machine and ran into an ERROR for not enabling long path? Run this line on PowerShell
New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" `
-Name "LongPathsEnabled" -Value 1 -PropertyType DWORD -Force

In [None]:
pip install tensorflow

In [None]:
pip install torch

In [3]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="microsoft/deberta-large-mnli")
classifier(["I'm so excited for the 3 day weekend!", "I love my cat"])

config.json: 100%|██████████| 729/729 [00:00<00:00, 728kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
pytorch_model.bin: 100%|██████████| 1.62G/1.62G [01:41<00:00, 16.0MB/s]
Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model fr

[{'label': 'NEUTRAL', 'score': 0.9694152474403381},
 {'label': 'NEUTRAL', 'score': 0.9525929689407349}]

Remember Tokenizing from the last 2 days?
Well, we can't feed the models the raw text, so we have to tokenize our raw text the exact way the model was trained/expecting to receive input.

In [4]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for this weekend this whole month.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
array([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 5353, 2023, 2878,
        3204, 1012,  102],
       [ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0,    0,
           0,    0,    0]])>, 'attention_mask': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])>}


In [5]:
from transformers import pipeline

text_generator = pipeline("text-generation")
generated_text = text_generator("This weekend is three day weekend! I am going to")[0]['generated_text']
print(generated_text)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
generation_config.json: 100%|██████████| 124/124 [00:00<?, ?B/s] 
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This weekend is three day weekend! I am going to be taking advantage of the opportunities that are available for me.

This Sunday I am taking advantage of all the opportunities available to me to compete with other athletes in the world. All these challenges


In [6]:
from transformers import pipeline
long_text = """
In recent years, artificial intelligence (AI) has witnessed remarkable advancements, particularly in the field of natural language processing (NLP). Researchers and developers have made significant strides in creating sophisticated models that can understand and generate human-like text. One notable breakthrough is the development of transformer-based architectures, such as BERT and GPT-3, which have set new benchmarks in various NLP tasks.

These models excel in tasks like sentiment analysis, text summarization, and language translation, showcasing their versatility. The widespread adoption of transformer models has led to the emergence of powerful tools and libraries, like the Transformers library by Hugging Face, making it easier for practitioners to leverage state-of-the-art models for their applications.

However, the rapid progress in AI and NLP also raises ethical concerns and challenges. Issues like bias in language models, transparency in decision-making processes, and potential misuse of powerful AI technologies need careful consideration. As the field continues to evolve, striking a balance between innovation and ethical responsibility becomes crucial.

In conclusion, the recent advancements in NLP, driven by transformer models and accessible libraries, have transformed the landscape of AI. While these technologies offer unprecedented capabilities, the ethical implications and responsible use of AI demand ongoing attention from the global community.
"""
summarizer = pipeline("summarization")
summary = summarizer(long_text)[0]['summary_text']
print(summary)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|██████████| 1.80k/1.80k [00:00<?, ?B/s]
pytorch_model.bin: 100%|██████████| 1.22G/1.22G [01:14<00:00, 16.5MB/s]
tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<?, ?B/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.95MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.09MB/s]


 Researchers and developers have made significant strides in creating sophisticated models that can understand and generate human-like text . The rapid progress in AI and NLP also raises ethical concerns and challenges . As the field continues to evolve, striking a balance between innovation and ethical responsibility becomes crucial .


### Other ways to use HuggingFace

#### Hugging Face Inference API
Hugging Face Inference API and Endpoint gives us access to the models hosted on huggingface via simple HTTP calls.
Subtle but annoying Difference: HuggingFace Inference API: free version. HuggingFace Inference Endpoint: Paid, your own huggingface infra

https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta