# Advanced Text Representation
We want to examine some pre-trained word embedding models. In particular, you should do the followings:
- Find a pre-trained word embedding model, such as [Google's word2vec model](https://code.google.com/archive/p/word2vec/).
- Load the pre-trained model using a library, such as [gensim](https://radimrehurek.com/gensim/models/keyedvectors.html).
- Examine the word vectors.
- Visualize a sample of words using necessary libraries, such as t-SNE.

## Importing Modules

In [85]:
import pandas as pd
import gensim.models
import sklearn.cluster
import sklearn.manifold
import plotly.express as px

## Loading a Pre-Trained Word2Vec Model

In [71]:
model = KeyedVectors.load_word2vec_format("/tmp/GoogleNews-vectors-negative300.bin.gz", binary=True)

## Examining Word Vectors

In [73]:
vector = model["Berlin"]
vector

array([ 0.14160156,  0.25195312,  0.02624512,  0.00069809, -0.09814453,
       -0.15429688, -0.20605469, -0.1484375 , -0.1640625 ,  0.0612793 ,
       -0.24121094, -0.078125  ,  0.03979492, -0.11816406,  0.01745605,
        0.06591797,  0.13378906,  0.25390625, -0.125     , -0.14648438,
        0.12402344,  0.19628906,  0.04663086,  0.07910156,  0.10400391,
        0.04003906, -0.22363281,  0.19921875, -0.06591797,  0.06689453,
        0.13964844, -0.171875  , -0.30859375, -0.02502441, -0.11523438,
       -0.15136719,  0.10839844,  0.15820312,  0.34375   ,  0.07373047,
        0.03125   , -0.04956055, -0.12597656,  0.14550781, -0.06396484,
        0.16503906, -0.10400391, -0.25195312,  0.04931641,  0.25976562,
        0.01367188,  0.06445312, -0.05615234, -0.13476562, -0.24316406,
        0.14941406, -0.33984375,  0.02600098, -0.3515625 , -0.1171875 ,
        0.07714844, -0.27148438, -0.07373047, -0.05541992,  0.04956055,
       -0.25195312,  0.13574219,  0.06396484, -0.23144531,  0.06

In [74]:
print(vector.shape)

(300,)


In [75]:
model.most_similar("Berlin", topn=10)

[('Munich', 0.6743212938308716),
 ('BBC_Tristana_Moore', 0.6629698276519775),
 ('Hamburg', 0.6400265097618103),
 ('Frankfurt', 0.6383745074272156),
 ('Germany', 0.631354808807373),
 ('Dusseldorf', 0.621732234954834),
 ('historic_Tempelhof_Airport', 0.6146382689476013),
 ('Munich_Germany', 0.6117033362388611),
 ('Kempinski_Hotel_Bristol', 0.6082198619842529),
 ('German', 0.608098030090332)]

In [97]:
model.most_similar(["Germany", "City"])

[('Berlin', 0.5473558306694031),
 ('Hamburg', 0.5346046686172485),
 ('Netherlands', 0.520778238773346),
 ('Austria', 0.5198242664337158),
 ('CIty', 0.5105443596839905),
 ('Cologne', 0.5068037509918213),
 ('Hungary', 0.5012103319168091),
 ('Braustolz_GmbH', 0.4935721457004547),
 ('Stuttgart', 0.4911365211009979),
 ('German', 0.48699653148651123)]

In [77]:
model.doesnt_match(["Berlin", "London", "Twitter", "Paris"])

'Twitter'

In [78]:
model.similarity("Berlin", "Twitter")

0.0066413707

In [79]:
model.similarity("Berlin", "Paris")

0.54122275

In [80]:
# woman - man = x - king
# woman - man + king = x
# x = ?
model.most_similar(positive=["woman", "king"], negative=["man"])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

In [81]:
# Paris - France = x - Germany
# Paris - France + Germany = x
# x = ?
model.most_similar(positive=["Paris", "Germany"], negative=["France"])

[('Berlin', 0.7644002437591553),
 ('Frankfurt', 0.7329736351966858),
 ('Dusseldorf', 0.7009456753730774),
 ('Munich', 0.6773864030838013),
 ('Cologne', 0.6470192670822144),
 ('Düsseldorf', 0.6399551033973694),
 ('Stuttgart', 0.6361044049263),
 ('Munich_Germany', 0.6238142251968384),
 ('Budapest', 0.6192865371704102),
 ('Hamburg', 0.6168562769889832)]

## Visualizing Word Vectors

### Chossing Some Words

In [90]:
vocabulary = ["university", "student", "teacher", "school",
              "football", "sport", "gym", "basketball",
              "germany", "france", "uk", "spain",
              "berlin", "paris", "london", "madrid",
              "film", "interstellar", "inception",
              "twitter", "facebook", "google", "yahoo",
              "cat", "dog", "mouse", "bird",
              "bmw", "audi", "wv", "benz",
              "obama", "churchill", "putin", "trump"]
vectors = [model[word] for word in vocabulary]

### Clustering Word Vectors

In [91]:
kmeans = sklearn.cluster.KMeans(n_clusters=9)
kmeans.fit(vectors)
cluster_labels = kmeans.labels_

### Reducing Feature Size to 2D

In [92]:
tsne = sklearn.manifold.TSNE(n_components=2, perplexity=3, n_iter=10000) 
reduced_vectors = tsne.fit_transform(vectors)

### Visualizing 2D Word Vectors

In [93]:
df = pd.DataFrame({"Word": vocabulary, "Cluster ID": cluster_labels, 
                   "x": reduced_vectors[:, 0], "y": reduced_vectors[:, 1]})
fig = px.scatter(df, x="x", y="y", color="Cluster ID", text="Word")
fig.update_traces(textposition="bottom right")
fig.show()