<a href="https://colab.research.google.com/github/BenBritons/DS_notebooks/blob/main/NLP/WordEmbeddingsPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Fun with Word Embeddings

Today we gonna play with word embeddings: train our own little embedding, load one from   gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.




In [None]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt

--2023-09-30 15:06:19--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:601b:18::a27d:812
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2023-09-30 15:06:19--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce94e368602c97a6d68a72e1527.dl.dropboxusercontent.com/cd/0/get/CEshD7h5TBjUROsZAiOdxFLb-CDcfW8mBhyCH4iUqHjINkDgnXJlgpl1eNbgCxKTG6DL7QlRzYnp25R1ZZkMXcELn4R1gBQkV-F_7HDFLiWFpMwCNsHlUOiiaZWCUNNk17c/file?dl=1# [following]
--2023-09-30 15:06:20--  https://uce94e368602c97a6d68a72e1527.dl.dropboxusercontent.com/cd/0/get/CEshD7h5TBjUROsZAiOdxFLb-CDcfW8mBhyCH4iUqHjINkDgnXJlgpl1eNbgCxKTG6DL7QlRzYnp25R1ZZkMXcELn4R1gBQkV-F_7HDFLiWFpMwCNsHlUOiiaZWCUNNk17

In [None]:
import numpy as np

data = list(open("./quora.txt", encoding="utf-8"))
data[50]

"What TV shows or books help you read people's body language?\n"

__Tokenization:__ a typical first step for an nlp task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many nlp tasks like tokenization, stemming or part-of-speech tagging.

In [None]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [None]:
# TASK: lowercase everything and extract tokens with tokenizer.
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [tokenizer.tokenize(x.lower()) for x in data]

In [None]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [None]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings.

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

In [None]:
from gensim.models import Word2Vec
model = Word2Vec(data_tok,
                 vector_size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv  # define context as a 5-word window around the target word

In [None]:
# now you can get word vectors !
model.get_vector('anything')

array([-2.3225882 ,  2.046439  ,  3.0144172 ,  1.9234059 ,  2.546946  ,
        2.5201812 ,  1.4975481 , -1.5725596 ,  2.3959274 ,  1.8334048 ,
       -1.9365776 ,  1.7515744 ,  4.0948887 ,  2.369648  ,  2.9460363 ,
       -1.1996568 ,  0.29396698, -2.7112577 ,  0.3944208 , -0.8551634 ,
       -2.0157151 , -1.4032544 , -0.65810925, -0.91929126,  0.50119907,
       -2.7142465 , -0.78384596, -1.5199592 ,  1.7064478 ,  1.933576  ,
       -2.3501778 ,  0.7935677 ], dtype=float32)

In [None]:
# or query similar words directly. Go play with it!
model.most_similar('bread')

[('rice', 0.95610511302948),
 ('grass', 0.9261704087257385),
 ('pasta', 0.9244386553764343),
 ('cheese', 0.9223670959472656),
 ('chocolate', 0.9162390828132629),
 ('beans', 0.914391040802002),
 ('sauce', 0.9116111397743225),
 ('potato', 0.9092755317687988),
 ('corn', 0.907393217086792),
 ('garlic', 0.9067586064338684)]

### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts.

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [None]:
import gensim.downloader as api
model = api.load('glove-twitter-100')



In [None]:
model.most_similar(positive=["coder", "money"], negative=["brain"])

[('broker', 0.5820155739784241),
 ('bonuses', 0.5424473285675049),
 ('banker', 0.5385112762451172),
 ('designer', 0.5197198390960693),
 ('merchandising', 0.4964233338832855),
 ('treet', 0.4922019839286804),
 ('shopper', 0.4920562207698822),
 ('part-time', 0.4912828207015991),
 ('freelance', 0.4843311905860901),
 ('aupair', 0.4796452522277832)]

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [None]:
words = model.index_to_key[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [None]:
# for each word, compute it's vector with model
word_vectors = np.array([model.get_vector(x) for x in words])

In [None]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 100)
assert np.isfinite(word_vectors).all()

#### Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

word_vectors_pca = pca.fit_transform(word_vectors)



In [None]:
from sklearn.preprocessing import StandardScaler

word_vectors_pca = StandardScaler().fit_transform(word_vectors_pca)

In [None]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

#### Let's draw it!

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [None]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [None]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE.
# normalize them as just lke with pca


word_tsne =  TSNE(n_components=2).fit_transform(word_vectors)
word_tsne = StandardScaler().fit_transform(word_tsne)

In [None]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!


In [None]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros

    vectors = []
    phrase = phrase.lower()
    tokens = tokenizer.tokenize(phrase)
    for token in tokens:
      if token in list(model.index_to_key):
        vectors.append(model.get_vector(token))

    return sum(vectors)



In [None]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")

assert np.allclose(vector[::10],
                    np.array([  3.8168845 ,  -0.3069805 ,   1.1199516 ,  -1.2026184 ,
                    -12.334427  ,  -1.994626  ,   0.61000896,   2.1587763 ,
                    16.44223   ,   1.038716  ], dtype=np.float32))

In [None]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
phrase_vectors = [get_phrase_embedding(x) for x in chosen_phrases]

In [None]:
chosen_phrases

["Can I get back with my ex even though she is pregnant with another guy's baby?\n",
 'What is the best way to become an arms dealer in the U.S?\n',
 "Why doesn't Japan contribute to peace and prosperity to ASEAN and Asia, but incite wars in Asia on behalf the U.S. and serve the U.S.?\n",
 'Which is the best Panasonic air conditioner service center in Hyderabad?\n',
 'What did ancient Egyptians believe about the afterlife?\n',
 'How does one prepare for the RBI grade B officers examination?\n',
 "Which institutions in Kolkata give coaching for entrance exams for a master's in political science?\n",
 'Will omar bravo be in fifa 17?\n',
 'What are the pros and cons of arranged marriages?\n',
 'What will happen if all the vegetable dies?\n',
 'Why do people hate hypocrites?\n',
 'What do you consider poor, middle class, wealthy and rich in your country?\n',
 'How do I concentrate on my studies?\n',
 'What is new age psychobabble?\n',
 'How can you tell if your man is cheating even if he h

In [None]:
phrase_vectors = np.array(phrase_vectors)

In [None]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model.vector_size)

In [None]:
# map vectors into 2d space with tsne
# don't forget to normalize

phrase_vectors_2d = TSNE().fit_transform(phrase_vectors)

phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

In [None]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

Finally, let's build a simple "similar question" engine with phrase embeddings we've built.

In [None]:
# compute vector embedding for all lines in data
data_vectors = np.array([get_phrase_embedding(l) for l in data])

KeyboardInterrupt: ignored

In [None]:
def find_nearest(query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """
    query_vec = get_phrase_embedding(query)
    cosines = []
    for vec in chosen_phrases:
      vec = get_phrase_embedding(vec)
      cosines.append(query_vec @ vec / np.linalg.norm(query_vec) / np.linalg.norm(vec))

    m = np.argsort(cosines)[-k:]

    return [chosen_phrases[x] for x in m]

In [None]:
k = 10
query_vec = get_phrase_embedding("How do i enter the matrix?")
cosines = []
for vec in chosen_phrases:
    vec = get_phrase_embedding(vec)
    cosines.append(query_vec @ vec / np.linalg.norm(query_vec) / np.linalg.norm(vec))

m = np.argsort(cosines)[-k:]
print(m)

[128 481 165 870 617 750 255  79 650 132]


In [None]:
print([chosen_phrases[x] for x in m])

['What is the best way to read a fictional book? Do you take notes when you are reading? Do you read again these notes later?\n', 'How do you choose your first bank?\n', 'R2I - How did you plan R2I from US if you own the house, i mean job search, timeline etc ?\n', 'How do I find out if I have Siri on my phone?\n', 'How do I learn to enter journal entries online in 2 weeks or so?\n', 'How do I run a shell script from Java code?\n', 'My WhatsApp chat backup got deleted from Google, I need to switch from one Android to another, the chat is there only on the phone. What should I do?\n', 'How do I listen a song from you?\n', 'If I wanted to learn about the Roman Empire,what would be the best books to read?\n', 'How do I learn Calculus on my own?\n']


In [None]:
chosen_phrases[0]

"Can I get back with my ex even though she is pregnant with another guy's baby?\n"

In [None]:
results = find_nearest(query="How do i enter the matrix?", k=10)


In [None]:

print(''.join(results))

What is the best way to read a fictional book? Do you take notes when you are reading? Do you read again these notes later?
How do you choose your first bank?
R2I - How did you plan R2I from US if you own the house, i mean job search, timeline etc ?
How do I find out if I have Siri on my phone?
How do I learn to enter journal entries online in 2 weeks or so?
How do I run a shell script from Java code?
My WhatsApp chat backup got deleted from Google, I need to switch from one Android to another, the chat is there only on the phone. What should I do?
How do I listen a song from you?
If I wanted to learn about the Roman Empire,what would be the best books to read?
How do I learn Calculus on my own?



In [None]:
assert len(results) == 10 and isinstance(results[0], str)
assert results[1] == 'How do you choose your first bank?\n'
assert results[3] == 'How do I find out if I have Siri on my phone?\n'

In [None]:
find_nearest(query="How does Trump?", k=10)

['What is BusyBox used for?\n',
 'How do you feel when your question is unanswered on Quora?\n',
 'How can we say that climate change does not bring about health emergency?\n',
 'The education system is outdated. What would you do to change it?\n',
 'Why does India so scared of CPEC?\n',
 'What makes you sad about India?\n',
 'What were some things India did not do but takes credit for?\n',
 'What does it feel like to be an IITian?\n',
 'What might happen now that President-elect Donald Trump has won the election? What will be the impact?\n',
 'Does Donald Trump actually think he can become President?\n']

In [None]:
find_nearest(query="Why don't i ask a question myself?", k=10)

['Why should I ask my first question?\n',
 "Why do some people 'hate' drugs or people who ever use them? Isn't that a bit extreme?\n",
 "What should I do if someone doesn't reply to my email?\n",
 "Why do I feel like I'm not living my life?\n",
 "I need to gain weight, but I don't have abs. It's frustrating as heck. (150ibs 17 year old male) what should I do?\n",
 "I am 23 and don't know what I want. My life is very boring, I am depressed and frustrated, I don't have any good friends to share my feelings with. I don't even have a girlfriend. Sometimes I want to quit. What should I do?\n",
 "My ex bf says he doesn't have feelings for me right now. Why won't he just say I don't have feelings for you anymore?\n",
 "I'm really pretty but I don't want to be I hate the attention and dudes hitting on me what should I do?\n",
 "How do I get a girl's phone number in a library?\n",
 "What's a funny thing?\n"]

__Now what?__
* Try running TSNE on all data, not just 1000 phrases
* See what other embeddings are there in the model zoo: `gensim.downloader.info()`
* Take a look at [FastText](https://github.com/facebookresearch/fastText) embeddings
* Optimize find_nearest with locality-sensitive hashing: use [nearpy](https://github.com/pixelogik/NearPy) or `sklearn.neighbors`.