# Day 4: Word2Vec, Text Similarity, Hugging Face

## Agenda
- Word Embeddings: Word2Vec
- Text Similarity: Cosine Similarity, Jaccard Similarity
- Introduction to HuggingFace Transformers Library

## Word Embeddings
- Humans are good with words. Computers are good with numbers
  - Look at the evolution of programming languages
- Word Embedding is a vector representation of words (more precisely, vector representation)
- This allows computers to tell how 2 texts are similar/dissimilar from each other

- 2 popular techniques: Word2Vec, GloVe
  - Word2Vec: "two words sharing similar contexts also share a similar meaning"
  - GloVe: "Gloval Vectors for Word Representation"

Library Gensim provides Word2Vec implementation which we'll use to explore.

In [None]:
pip install gensim

In [None]:
import gensim.downloader as api

w2v = api.load('word2vec-google-news-300')

In [None]:
print(f"Out of total {len(w2v.index_to_key)} words")
for index, word in enumerate(w2v.index_to_key):
    if index <= 10:
        print(f"#{index}: {word}")
    else:
        break

In [None]:
vec_computer = w2v['computer']
print(vec_computer)

In [None]:
w2v['revature']

In [None]:
pairs = [
    # Going from more similar to less similar 
    ('cup', 'mug'),
    ('cup', 'bowl'),  
    ('cup', 'beverage'),
    ('cup', 'cat'),  
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, w2v.similarity(w1, w2)))

In [None]:
print(w2v.most_similar(positive=['cup', 'mug'], topn=5))

In [None]:
print(w2v.doesnt_match(['cup', 'cat', 'mug', 'jar']))

## Text Similarity: Cosine Similarity, Jaccard Similarity

- Techniques like Word2Vec and GloVe gives us mathematical representations of texts, but how do they tell how similar/dissimilar they are?
- Different Techniques:
    - Euclidean Distance (how far apart are these vectors?)
    - Cosine Similarity (The angle between the vectors)
    - Jaccard Similarity (How many words do these texts share?)
    - and more

### Cosine Similarity
One of the most common techniques. We calculate the angle between the two vectors.
![cos_sim](https://memgraph.com/images/blog/cosine-similarity-python-scikit-learn/cosine-similarity.png)
![cos_sim_formula](https://assets-global.website-files.com/5ef788f07804fb7d78a4127a/60dee7e4dec6611dc63cb158_dNiiYIrknDdfDwnqRpJ4n23givOOrrkWvlsBED9hE7qahtn_itdM1ziLQm0YYmqlV2j5q1Kur_icFc_K1jyYKIAcz_PBZ32OjpaFVQGAf41K3O0PhVRnnROFNnb_04jQ36VcX8pF.png)

In [None]:
pip install scikit-learn
pip install numpy

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# This is a python implementation of cosine similarity, using numpy
def cosine_similarity(vec1, vec2):
    if len(vec1) != len(vec2) :
        return None
    
    # Compute the dot product between 2 vectors
    dot_prod = np.dot(vec1, vec2)
    
    # Compute the norms of the 2 vectors
    norm_vec1 = np.sqrt(np.sum(vec1**2)) 
    norm_vec2 = np.sqrt(np.sum(vec2**2))
    
    # Compute the cosine similarity
    # We divide the dot product of the 2 vectors by their length
    cosine_similarity = dot_prod / (norm_vec1 * norm_vec2)
    
    return cosine_similarity

In [None]:
# Sample texts
text1 = "Natural language processing is fascinating."
text2 = "I'm intrigued by the wonders of natural language processing."

# Tokenize and vectorize the texts
vectorizer = CountVectorizer().fit_transform([text1, text2]).toarray()
print(vectorizer)

# Calculate cosine similarity
cosine_sim = cosine_similarity(vectorizer[0, :], vectorizer[1, :])

# Cosine Similarity ranges from -1 to 1, a number closer to 1 means that they are more similar
print("Cosine Similarity:")
print(cosine_sim)

### Jaccard Index
Jaccard Index, or Jaccard similarity coefficient, is used to guage the similarity between two sets.

![jc_idx](https://wikimedia.org/api/rest_v1/media/math/render/svg/eaef5aa86949f49e7dc6b9c8c3dd8b233332c9e7)

In [None]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

In [None]:
from nltk import word_tokenize

text1 = "Natural language processing is fascinating."
text2 = "I'm intrigued by the wonders of natural language processing."

# Tokenize the sentences and turn them into sets
set1 = set(word_tokenize(text1))
set2 = set(word_tokenize(text2))

# Calculate Jaccard similarity
jaccard_sim = jaccard_similarity(set1, set2)

# Print the Jaccard similarity
print(f"Jaccard Similarity: {jaccard_sim}")

## HuggingFace
- A company that focuses on fostering an open-source AI/ML community.
- Originally a chatbot company built for teenagers (perhaps, hence the name and emoji) 
- Famous for their massive open-source collection of AI models and Transformer library
- The platform we'll be using for our LLM's and Text Embedding Models.

### Transformer Library
- Library built for easy, low barrier interaction with various AI models
- Transformer refers to a specific type of text-generation model that are context aware (like BERT or GPT), BUT in this context, HuggingFace's Transformer library has models beyond text-generation.
- Benefits: Ease of Use, Flexibility, Simplicity
- Tutorial: https://huggingface.co/learn/nlp-course

In [None]:
pip install transformers

In [None]:
# Got a windows machine and ran into an ERROR for not enabling long path? Run this line on PowerShell
New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" `
-Name "LongPathsEnabled" -Value 1 -PropertyType DWORD -Force

In [None]:
pip install tensorflow

In [None]:
pip install torch

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I pulled 3 all nighters for this project. I'm so exhausted.")

Remember Tokenizing from the last 2 days?
Well, we can't feed the models the raw text, so we have to tokenize our raw text the exact way the model was trained/expecting to receive input.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for this weekend this whole month.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation")
generated_text = text_generator("This weekend is three day weekend! I am going to")[0]['generated_text']
print(generated_text)

In [None]:
from transformers import pipeline
long_text = """
In recent years, artificial intelligence (AI) has witnessed remarkable advancements, particularly in the field of natural language processing (NLP). Researchers and developers have made significant strides in creating sophisticated models that can understand and generate human-like text. One notable breakthrough is the development of transformer-based architectures, such as BERT and GPT-3, which have set new benchmarks in various NLP tasks.

These models excel in tasks like sentiment analysis, text summarization, and language translation, showcasing their versatility. The widespread adoption of transformer models has led to the emergence of powerful tools and libraries, like the Transformers library by Hugging Face, making it easier for practitioners to leverage state-of-the-art models for their applications.

However, the rapid progress in AI and NLP also raises ethical concerns and challenges. Issues like bias in language models, transparency in decision-making processes, and potential misuse of powerful AI technologies need careful consideration. As the field continues to evolve, striking a balance between innovation and ethical responsibility becomes crucial.

In conclusion, the recent advancements in NLP, driven by transformer models and accessible libraries, have transformed the landscape of AI. While these technologies offer unprecedented capabilities, the ethical implications and responsible use of AI demand ongoing attention from the global community.
"""
summarizer = pipeline("summarization")
summary = summarizer(long_text)[0]['summary_text']
print(summary)

### Other ways to use HuggingFace

#### Hugging Face Inference API
Hugging Face Inference API and Endpoint gives us access to the models hosted on huggingface via simple HTTP calls.
Subtle but annoying Difference: HuggingFace Inference API: free version. HuggingFace Inference Endpoint: Paid, your own huggingface infra

https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta