<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Word Vectors

---

### Learning Objectives
- Describe word vectors and understand the shortcomings of bag-of-words methods.
- Describe word embeddings.
- Apply Word2Vec, GloVe, and BERT embedding techniques.

**We will start by importing what we need for Word2Vec, GloVe, and the transformer models.** (Downloading the pre-trained Word2Vec embeddings can take a while! We are using the [gensim.downloader](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html) module for this.)

In [1]:
# Install/upgrade Gensim & transformers
# !pip install gensim --upgrade
# !pip install transformers --upgrade

In [2]:
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm
2023-02-04 10:45:34.272840: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [29]:
corpus = api.load('text8')

In [30]:
%%time
model = Word2Vec(corpus)

CPU times: user 1min 56s, sys: 1.11 s, total: 1min 57s
Wall time: 38.6 s


## Word Embeddings

### What is a vector?

There are lots of ways to think about a vector.

<img src="./images/vector.png" alt="drawing" width="450"/>

In **physics**, vectors are arrows.

<img src="./images/vector.jpg" alt="drawing" width="400"/>

In **computer science** and **statistics**, vectors are columns of values, like one numeric Series in a DataFrame.

#### It turns out that these are equivalent.

<img src="./images/vector_on_graph.png" alt="drawing" width="450"/>

[This video](https://www.youtube.com/watch?v=fNk_zzaMoSs) does an exceptional job explaining vectors.

### So... what is a word vector?

A word vector, simply, is a way for us to represent words with vectors.

<details><summary>How have we technically already done this?</summary>
    
- CountVectorizer and TFIDFVectorizer. By representing each word as a new column in our DataFrame, we have represented words with vectors.


</details>
---
<img src='../images/countvectorizer.jpg'></img>

To be more precise, we can think of each word as its own dimension or axis. In the example below, we have represented the horizontal axis with a vector for `cat` and the vertical axis with a vecvtor for `hat`.

<img src="./images/cat_hat.png" alt="drawing" width="400"/>

This is exactly what CountVectorization and TFIDFVectorization have done; we are now just representing it geometrically/visually! Each column in our DataFrame corresponds to a new axis.

This type of vectorization of words (turning each word into its own column) is known as "1-of-N encoding."

<img src="./images/one-hot-new.png" alt="drawing" width="400"/>

For example:
- the vector for the word `dog` would be [1, 0, 0, 0, 0].
- the vector for the word `cat` would be [0, 1, 0, 0, 0].
- the vector for the word `puppy` would be [0, 0, 1, 0, 0].
- the vector for the word `kitten` would be [0, 0, 0, 1, 0].
- the vector for the word `pug` would be [0, 0, 0, 0, 1].

All of the above vectors are independent of one another. Thinking purely about language and the way we use it, **should** dog and puppy be independent of one another? **Should** dog and pug be independent of one another?

<details><summary>What do you think?</summary>
    
- Probably not!
- Dog and puppy have similar meanings. (Really, only the age is different.)
- Dog and cat have similar meanings. (i.e. I know that "dog" and "cat" are more similar than "dog" and "book" or "cat" and "car.")
- Our current data science strategy for NLP (CountVectorization, TFIDFVectorization) is good in that it allows us to get computers to understand natural language in a way similar to how humans do... but our current strategy has its limitations!
</details>

Rather than creating a whole new dimension each time we encounter a new word and treating it as independent of all other words, can we instead come up with "new axes" that allow us to better understand meanings and relationships among words?
- YES.

**Word embedding** is a term used to describe representing words in mathematical space.
- One word embedding technique is CountVectorization.
- A more advanced word embedding technique is `Word2Vec`.

## Non-contextual Word Embeddings

### Word2Vec
- Word2Vec is an approach that takes in observations (sentences, tweets, books) and maps them into some other space using a neural network.

Going back to our previous example, you can "think" of a five-dimensional space. 
- The horizontal axis corresponds to `dog`.
- The vertical axis corresponds to `cat`.
- The axis extending out toward you corresponds to `puppy`.
- Given that we live in 3D space, we can't really visualize higher dimensions.

Instead of giving each word its own axis, the `Word2Vec` algorithm will take all of our words and map them to another set of axes that accounts for these relationships.

<img src="./images/word-vectors-new.png" alt="drawing" width="350"/>

### Why do we care?
The structure of language has a lot of valuable information in it! The way we organize our text/speech tells us a lot about what things mean.

By using machine learning to "learn" about the structure and content of language, our models can now organize concepts and learn the relationships among them.
- Above, we did not explicitly tell the computer what "dog" or "puppy" or "cat" or "kitten" actually mean. But by learning from the data, our model can quantify the relationship among these entities!

### How does Word2Vec work?

#### Basic Answer:
The idea is that we can use the position of words in sentences (i.e. see which words were commonly used together) to understand their relationships.
- If "dog" and "puppy" are used near one another a lot, then it suggests that there may be some sort of relationship between them.
- If "cat" and "dog" are used near similar words a lot (i.e. "pet"), then it suggests that there may be some sort of relationship between them.

#### More Advanced Answer:
There are two algorithms that use neural networks to learn these relationships: Continuous Bag-of-Words (CBOW) and Continuous Skip-grams.

![](./images/cbow.png)

**CBOW (BONUS)**

A continuous Bag-of-Words model is a two-layer neural network that:
- takes the surrounding "context words" as an input.
- generates the "focus word" as the output.

<img src="./images/word2vec-cbow.png" alt="drawing" width="400"/>

**Skip-Gram (BONUS)**

A Continuous Skip-gram model is a two-layer neural network that:
- takes the "focus word" as an input.
- generates the surrounding "context words" as the output.

<img src="./images/skipgram.png" alt="drawing" width="400"/>

([image source](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/)).

### Neat application 1: Which of these is not like the other?

In [31]:
model.wv.doesnt_match(['dog', 'fish', 'cat', 'tiger', 'hamster', 'tortoise'])

'tortoise'

In [32]:
model.wv.doesnt_match(['taco', 'burrito', 'quesadilla', 'hamburger'])

'taco'

In [33]:
model.wv.doesnt_match(['london', 'cannes', 'madrid', 'vienna', 'singapore'])

'singapore'

In [34]:
model.wv.doesnt_match(['math', 'engineering', 'optimization', 'linear', 'convex'])

'engineering'

Try your own and share the most mind-blowing one in a thread.

**Real-world application of this**: Suppose you're attempting to automatically detect spam emails or detect plagiarism based on words that don't belong.

### Neat application 2: What is most alike?

In [35]:
model.wv.most_similar('math')

[('computational', 0.8049209117889404),
 ('tutorial', 0.7648815512657166),
 ('springer', 0.7450544834136963),
 ('handbook', 0.7404348254203796),
 ('automata', 0.7219940423965454),
 ('dsp', 0.7161401510238647),
 ('combinatorial', 0.7137963175773621),
 ('analytical', 0.7096222043037415),
 ('astrophysics', 0.7089225053787231),
 ('algorithms', 0.7074402570724487)]

In [36]:
model.wv.most_similar('engineering')

[('biomedical', 0.7420523762702942),
 ('aeronautical', 0.7150801420211792),
 ('aerospace', 0.702772319316864),
 ('electronics', 0.6966991424560547),
 ('geomatics', 0.6876029968261719),
 ('interdisciplinary', 0.6855049729347229),
 ('aeronautics', 0.6853718757629395),
 ('computational', 0.6804250478744507),
 ('engineers', 0.6739534139633179),
 ('bioinformatics', 0.671549379825592)]

In [37]:
model.wv.most_similar('engineer')

[('inventor', 0.7979385852813721),
 ('pioneer', 0.7404082417488098),
 ('industrialist', 0.7226882576942444),
 ('architect', 0.7118277549743652),
 ('assistant', 0.6879084706306458),
 ('scientist', 0.6829404830932617),
 ('physicist', 0.6754765510559082),
 ('peddle', 0.6746423244476318),
 ('entrepreneur', 0.6732878088951111),
 ('aerospace', 0.6724109053611755)]

In [38]:
model.wv.most_similar('student')

[('graduate', 0.7838155031204224),
 ('undergraduate', 0.7286223769187927),
 ('teacher', 0.7228072285652161),
 ('lecturer', 0.6996104717254639),
 ('students', 0.6969389319419861),
 ('faculty', 0.6811642050743103),
 ('bachelor', 0.6587086915969849),
 ('academic', 0.6488825082778931),
 ('diploma', 0.6481816172599792),
 ('school', 0.6418749690055847)]

**Real-world application of this**: Suppose you're building out a process to detect when people are tweeting about an emergency. They may not just use the word "emergency." Rather than manually creating a list of words people could use, you may want to learn from a much larger corpus of data than just your personal experience!

# In Word2Vec model, the order of the words are not changed. Hence the order in which you pass it to the algorithm matters!

## Create Word2Vec word vectors from your own corpus! (BONUS)

### NOTE: This will usually take a *long* time!

In [26]:
# Import Word2Vec
from gensim.models.word2vec import Word2Vec

# If you want to use gensim's data, import their downloader
# and load it.
import gensim.downloader as api
corpus = api.load('text8')

# If you have your own iterable corpus of cleaned data, you can 
# read it in as corpus and pass that in.

# Train a model! 
model_new = Word2Vec(corpus,      # Corpus of data.
                 # size=100,    # How many dimensions do you want in your word vector?
                 window=5,    # How many "context words" do you want?
                 min_count=1, # Ignores words below this threshold.
                 sg=0,        # SG = 1 uses SkipGram, SG = 0 uses CBOW (default).
                 workers=6)   # Number of "worker threads" to use (parallelizes process).

# Do what you'd like to do with your data!
# model.wv.most_similar("car")

In [28]:
model_new.wv.most_similar('math')

[('computational', 0.8301625847816467),
 ('optimization', 0.7799440622329712),
 ('computation', 0.7665309309959412),
 ('diagrams', 0.7438765168190002),
 ('tutorial', 0.7416595220565796),
 ('mathematical', 0.7407210469245911),
 ('algorithms', 0.7406589388847351),
 ('combinatorial', 0.7406522035598755),
 ('bioinformatics', 0.7389428019523621),
 ('proc', 0.7342660427093506)]

Check out the documentation for Gensim's implementation of [Word2Vec here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).

### GloVe

GloVe stands for Global Vectors for Word Representation. It is an unsupervised technique that maps words to vector representations where the distance between the vectors represents semantic similarities. This is done using a co-occurrence matrix which shows us how often pairs of words occur together.

In [40]:
model_glove = api.load("glove-wiki-gigaword-50")



### Neat application 1: Which of these is not like the other?

In [41]:
model_glove.most_similar('apple')

[('blackberry', 0.7543067932128906),
 ('chips', 0.7438643574714661),
 ('iphone', 0.7429665327072144),
 ('microsoft', 0.7334205508232117),
 ('ipad', 0.7331036329269409),
 ('pc', 0.7217226624488831),
 ('ipod', 0.7199784517288208),
 ('intel', 0.7192243337631226),
 ('ibm', 0.7146540880203247),
 ('software', 0.7093585729598999)]

In [42]:
model_glove.most_similar('fine')

[('well', 0.7305028438568115),
 ('making', 0.7292699813842773),
 ('much', 0.7203167080879211),
 ('for', 0.7183343172073364),
 ('made', 0.7133499979972839),
 ('good', 0.7062110304832458),
 ('full', 0.7060422301292419),
 ('instead', 0.7030083537101746),
 ('than', 0.7028447389602661),
 ('worth', 0.7013351917266846)]

### Neat application 2: What is most alike?

# word2vec and gensim does not look at the context of the word. esp for words with different meaning

---
## Contextualized/Dynamic Word Embeddings

What are some shortcomings of `Word2Vec`? It takes into consideration the meaning of words based on context in the corpus, but what about words with different meanings?

How many meanings can you think of for the word "set"? This word [holds the record](https://www.guinnessworldrecords.com/world-records/english-word-with-the-most-meanings/) for the most number of meanings in the English language. Even a word like "apple" can take on vastly different meanings in today's age. `Word2Vec` assigns one vector for each word.

# Now is chatGPT 

**Dynamic Word Embeddings** overcome this shortcoming by assigning an embedding to each word after looking at the sentence of the words. This means that the same words (e.g. "apple" in a sentence about fruit and "Apple" in a sentence about computers) can be represented by different vectors based on their contexts. One of the first popular models that did this was called **ELMo**. Another popular one is named **BERT**.

<img src="./images/bert.png" alt="drawing" width="200"/>

[BERT](https://github.com/google-research/bert) (Bidirectional Encoder Representations from Transformers) was created by Google in late 2018 and continues to outperform other language representation models. It combined ELMo and several other transformers and is fully bidirectional allowing words to have different vectors based on the context of the word.

BERT is an example of a Transformer model. The following is from [Wikipedia](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)):

> Like recurrent neural networks (RNNs), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.  
Since their introduction, Transformers have become the model of choice for tackling many problems in NLP, replacing older recurrent neural network models such as the long short-term memory (LSTM).

We will use Hugging Face's [transformers](https://github.com/huggingface/transformers) for this section.

### Neat application 1: Fill in the blank
We will use the BERT model here!

In [None]:
# unmasker = pipeline('fill-mask', model='bert-base-uncased')

### Neat application 2: Sentiment Analysis
This was trained on [sst2](https://www.tensorflow.org/datasets/catalog/glue).

In [None]:
sent = pipeline('sentiment-analysis')

### Neat application 3: Question Answering
This was trained on [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/).

In [None]:
question = pipeline('question-answering')

In [None]:
# https://generalassemb.ly/faq
ga = 'General Assembly is a pioneer in education and career transformation, specializing in today\'s most in-demand coding, business, data, and design skills. With 30+ campuses around the world, we provide award-winning, dynamic training to a global community of professionals pursuing careers they love. Named one of Fast Company’s most innovative education companies, GA offers full- and part-time courses for career climbers both on campus and online. Through our corporate training programs, we also help companies compete for the future by sourcing, assessing, and growing their talent. All of these offerings are developed and led by industry experts.'
print(ga)

### Neat application 4: Summarization
By default, this uses a [Bart](https://medium.com/analytics-vidhya/assesing-barts-syntactic-abilities-and-bert-s-part-1-cbf0983f6ea4) model that was trained on CNN/Daily Mail data.

In [None]:
summarizer = pipeline('summarization')

In [None]:
# https://www.upi.com/Odd_News/2020/10/22/Bear-opens-car-door-climbs-inside-in-Tennessee/9821603398162/
news = """
An Indiana family visiting Tennessee captured video of a black bear wandering up to their unoccupied car, opening a door and climbing inside.
The Franczak family said they traveled from Crown Point, Ind., to Sevierville, Tenn., to celebrate a grandmother's birthday. "One of our bucket list things was to see a bear," father Brian Franczak told WBBM-TV.
The family said they were shocked, however, when a bear came walking up the driveway of their vacation home and headed for their SUV.
"I just screamed, 'Oh my God! The bear is here! The bear is in the driveway,'" mom Carly Franczak said.
The family captured video as the bear opened a back door of the vehicle and climbed inside.
"I was at go-carts racing and my grandpa got a call about that there's a bear in their car," daughter Olivia Franczak said, "and we couldn't believe it at first. We thought my uncle got dressed up as a bear and went into the car."
The Tennessee Wildlife Resources Agency recommends residents and visitors keep vehicle doors locked at all times and make sure food and trash are secured where the animals can't reach.
"""

### Neat application 5: Text Generation
Using [GPT-2](https://en.wikipedia.org/wiki/OpenAI#GPT-2)!

In [None]:
text_generator = pipeline("text-generation")

## (BONUS) Applying this to your data

Want to use a pre-trained model on your own text data? Due to hardware and time limitations, we will not do this in class, but below are several tutorials that can walk you through this. Warning: these models take a lot of time/memory - you may need a GPU for this! ([Google Colab offers free use of a GPU!](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm))

- [Example of BERT in Keras](https://colab.research.google.com/drive/1934Mm2cwSSfT5bvi78-AExAl-hSfxCbq#scrollTo=gsscu_BluPLE)
- [BERT tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/)
- [Predicting movie review sentiment with BERT](https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=dCpvgG0vwXAZ)
- [Classification with BERT in PyTorch](https://colab.research.google.com/drive/1ywsvwO6thOVOrfagjjfuxEf6xVRxbUNO)
- [Classification with GloVe embeddings](https://medium.com/analytics-vidhya/text-classification-using-word-embeddings-and-deep-learning-in-python-classifying-tweets-from-6fe644fcfc81)
- [Using pre-trained word embeddings in Keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)