## Intro:
#### When I started with my GRE preparation, after going through many resources (for the vocab section) I found that some words appear more frequently in the exam, and Barron’s, Manhattan's and, Magoosh's word lists are the most renowned resources that contain the high-frequency GRE words. For this project, I picked Barron’s 333, Manhattan's 500 and, Magoosh's 1000 wordlists. The next challenge was learning these words so I came up with a plan. If I could somehow group similar words it would make the learning process much easier. But how to do that? Manually grouping these words would be way more challenging than simply learning the words as they are. After pondering for some time, it occurred to me why not let the machine do all the hard work! I think with a capability of above one million million floating-point operations per second it is much better for these types of tasks than I am so let’s get started and see how to build a model that can cluster similar words together.

#### I've used python for the project and the topics covered are Exploratory Data Analysis (EDA), Natural Language Processing (NLP), Word Embedding generation using Global Vectors (GloVe), Hierarchical Clustering, t-distributed Stochastic Neighbor Embedding (T-SNE).

![pic](https://images.unsplash.com/photo-1558021212-51b6ecfa0db9?ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=1522&q=80)

## 📊 EDA of High Frequency GRE vocabulary words.

In [None]:
# Let's begin by importing all the necessary libraries
from scipy.cluster import hierarchy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
sns.set()

In [None]:
# Let's import the 3 word lists and merge them to greate a single list. 
# These lists contain multiple columns, for this task, I've only considered 'word' and 'definition' columns.
# The merged dataframe also has a column named 'word_list' that represents one of the 3 the word lists the word corresponds to.
folder = '../input/gre-high-frequency-vocabulary-word-lists/'
magoosh = pd.read_csv(folder+'magoosh_1000.csv')[['word', 'definition']]
magoosh['word_list'] = ['magoosh_1000']*magoosh.shape[0]
manhattan = pd.read_csv(folder+'manhattan_500.csv')[['word', 'definition']]
manhattan['word_list'] = ['manhattan_500']*manhattan.shape[0]
barron = pd.read_csv(folder+'barron_333.csv')[['word', 'definition']]
barron['word_list'] = ['barrons_333']*barron.shape[0]

df = pd.concat([manhattan, magoosh, barron]).dropna().drop_duplicates(subset=['word'])

In [None]:
# This function is for calculating the Alphabetical frequency of words in the lists.
def alphabetical_frequency(df, wordlist='all'):
  counts = {}
  if wordlist=='manhattan':
    df = df[df['word_list']=='manhattan_500']
  elif wordlist=='barrons':
    df = df[df['word_list']=='barrons_333']
  elif wordlist=='magoosh':
    df = df[df['word_list']=='magoosh_1000']    
  for i in list('abcdefghijklmnopqrstuvwxyz'):
    k=1
    for j in df.word:
      if j[0]==i:
        k+=1
    counts[i] = k
  dd = pd.DataFrame()
  dd['alphabet'] = counts.keys()
  dd['frequency'] = counts.values()
  fig = px.bar(dd, x='alphabet', y='frequency')
  fig.show()

### Alphabetical frequency of words in all the word lists.

In [None]:
alphabetical_frequency(df)

### Alphabetical frequency of words in Manhattan's 500 word list.

In [None]:
alphabetical_frequency(df, wordlist='manhattan')

### Alphabetical frequency of words in Magoosh's 1000 word lists.

In [None]:
alphabetical_frequency(df, wordlist='magoosh')

### Alphabetical frequency of words in Barron's 333 word lists.

In [None]:
alphabetical_frequency(df, wordlist='barrons')

### While scraping the Barron's 333 word list, I found a meta feature 'frequency' that represents the frequency of occurance of a word in the list.

In [None]:
barron = pd.read_csv(folder+'barron_333.csv')
barron.head()

### Let's plot a box plot of the frequency distribution.

In [None]:
# box plot of the word frequencies in barron's 333 word list
fig = px.box(barron, y="frequency")
fig.show()

### Let's look at the highest and lowest frequent GRE words

In [None]:
# Collecting words with frequency over 10k
def top_frequency(barron, n=30):
  high_frequency = barron[['word', 'frequency']].sort_values(by='frequency', ascending=False)
  word_list = {}
  for w,f in high_frequency.values[:n]:
    word_list[w] = int(f)
  wordcloud = WordCloud(background_color="black").generate_from_frequencies(word_list)
  plt.figure(figsize=(15, 6))
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.show()

### Top 30 highest occurring GRE words as per Barron's list

In [None]:
top_frequency(barron, n=30)

### Top 30 lowest occurring GRE words as per Barron's list

In [None]:
top_frequency(barron, n=-30)

In [None]:
# words common in Magoosh and Barron lists
common_words_1 = set.intersection(set(magoosh['word']), set(barron['word']))
len(common_words_1)

In [None]:
# words common in Magoosh and Manhattan lists
common_words_2 = set.intersection(set(magoosh['word']), set(manhattan['word']))
len(common_words_2)

In [None]:
# words common in Manhattan and Barron lists
common_words_3 = set.intersection(set(manhattan['word']), set(barron['word']))
len(common_words_3)

In [None]:
# words common in Manhattan, Barron and Magoosh lists
common_words_all = set.intersection(set(manhattan['word']), set(barron['word']), set(magoosh['word']))
len(common_words_all)

In [None]:
from matplotlib_venn import venn3, venn3_circles

plt.figure(figsize=(16,9))
vd3=venn3([set(manhattan['word']), set(barron['word']), set(magoosh['word'])],
           set_labels = ('Manhattan\'s list', 'Barron\'s list', 'Magoosh\'s list'),
           set_colors=('#c4e6ff', '#F4ACB7', '#9D8189'), 
           alpha = 0.8)

venn3_circles([set(manhattan['word']), set(barron['word']), set(magoosh['word'])],
              linestyle='-.', linewidth=2, color='grey')

for text in vd3.set_labels:
    text.set_fontsize(16);
for text in vd3.subset_labels:
    text.set_fontsize(16)
    
plt.title('Venn Diagram for the 3 lists', fontname='Times New Roman', fontweight='bold', fontsize=20,
           pad=30, backgroundcolor='#cbe7e3', color='black', style='italic')

plt.show()

## Clustering similar words
#### In this section, we'll look at how to cluster similar words based on their Vector representations.
#### The w2v representation I've used for each word is GloVe (Global Vectors). It is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. More information on GloVe can be found on [this link.](https://nlp.stanford.edu/projects/glove/)

#### Let's begin by downloading and extracting the word embeddings. The embedding I'll be using contains data for about 2.2 million case sensitive words with each embedding vector of dimension 300.

In [None]:
# !wget http://nlp.stanford.edu/data/glove.840B.300d.zip
# !unzip glove.840B.300d.zip

### Creating a dictionary with our GRE words as keys and their corresponding embedding vectors from GloVe as values.

In [None]:
# file = open('glove.840B.300d.txt')
# dic = {}
# for line in tqdm(file):
#   w = line.split()[0]
#   if w in df.word.values:
#     m = line.split()[1:]
#     dic[w] = m
# file.close()

In [None]:
# w2v = []
# for w in df.word.values:
#   if w in dic.keys():
#     w2v.append(np.asarray(dic[w]).astype(np.float))
#   else:
#     w2v.append(np.nan)

### Adding the collected word embeddings to the word meaning data frame as a new feature 'embeddings'
I've also saved the dataframe as a csv file so that it can be used directly.

In [None]:
# df['embeddings'] = w2v
# df = df.dropna()[]
# for i,e in enumerate(w2v.T):
#   c = f'embedding_{i+1}'
#   df[c] = e

# # df.to_csv('words_meaning_embeddings.csv', index=False)
# df.head()

This function is for collecting the embeddings from the dataframe such that if the parameter 'wordlist' is:
1. manhattan_500: returns the embeddings of the words in Manhattan's 500 word list.
2. magoosh_1000: returns the embeddings of the words in Magoosh's 1000 word list.
3. barron_333: returns the embeddings of the words in Barron's 333 word list.
4. all: returns the embeddings of the words in from the word lists (set-wise union).

In [None]:
# def get_w2v(df, wordlist='all'):
#     if wordlist=='all':
#         w2v = np.row_stack(df.embeddings.values)
#     else:
#         w2v = np.row_stack(df[df['word_list']==wordlist].embeddings.values)
#     return w2v

In [None]:
# # since wordlist is 'all', the function returns the embeddings of all the words (in all the 3 word-lists)
# w2v = get_w2v(df, wordlist='all')
# w2v.shape

### I've already prepared and saved a dataframe containing the data. We can simply use that and skip the above steps.

In [None]:
df = pd.read_csv(folder+'words_meaning_embeddings.csv')
df.head()

In [None]:
w2v = df[df.columns[3:]]
w2v.shape

### Now it's time to cluster the word-embeddings using Hierarchical Clustering.
#### Hierarchical Clustering is a type of unsupevised learning technique that uses groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other. For this data, I have used Agglomerative Hierarchical Clustering also known as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.
#### For implementing the algorithm, I have used sciPy library which comes with buildin functionalities like calculating the wands or linkage of the datapoints based on the distance to consider between those points and the distance metric to use. These factors are passed as a attributes while initializing the clustering method. Finally, using these wands, the data is clustered.
#### You can read more about sciPy's hierarchical clustering on [this blog](https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/). 

![clustering](https://dashee87.github.io/images/hierarch.gif)

In [None]:
threshold = 0.7
Z = hierarchy.linkage(w2v, "average", metric="cosine",)
C = hierarchy.fcluster(Z, threshold, criterion="distance")
n = len(np.unique(C))
print(f'Number of clusters created: {n}')

In [None]:
# embedding the clusters to dataframe for extracting word clusters
data = df[['word', 'definition']].copy()
data['labels'] = C

### Let's check the number of words in each cluster

In [None]:
dd = pd.DataFrame(np.asarray(np.unique(C, return_counts=True)).T)
dd.columns = ['group_id', 'number of words']
fig = px.bar(dd, x='group_id', y='number of words')
fig.show()

### Let's see some of the word groups generated by clustering algorithm

In [None]:
df_grp = data.groupby('labels')
df_grp.get_group(81)

In [None]:
df_grp.get_group(90)

In [None]:
df_grp.get_group(79)

In [None]:
df_grp.get_group(83)

### Finally let's visualize the word embeddings in the form of a scatter plot using T-SNE. But first, let's quickly understand T-SNE.
#### t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more dimensions such that each embedding in the lower dimension represents the value in higher dimension. Also, these embeddings are placed in the lower dimension in such a manner that the distance between neighborhood points is preserved. So, t-SNE preserves the local structure of the data as well.
#### For a given point in n-dimensional hyperspace, it calculates the distance of that point from all the other points and converts these distributions of distances to student’s t-distribution. This is done for all the points such that in the end, each point has its own t-distribution of distances from all the other points. Now the points are randomly scattered in the lower dimensional space and each point is displaced by some distance such that after the displacement of all the points is done if we recalculate the t-distribution of distances of each point from the remaining points (this time this is done in the lower dimensional space), the distribution would be the same as what we obtained in n-dimensional hyperspace.
#### There are 2 main hyperparameters in t-SNE-
#### Perplexity: Instead of calculating the distance from all the other points, we can use only ‘k’ nearest points. This value of ‘k’ is called the perplexity value.
#### Iterations: The number of iterations for which we want t-SNE to update the points in lower-dimensional space.
#### Due to stochasticity, the algorithm may perform differently for different perplexity values so as a good practice, it is preferred to run t-SNE for different perplexity values and different numbers of iterations. To know more about t-SNE, check out [this awesome blog](https://distill.pub/2016/misread-tsne/), it has t-SNE very well explained with interactive visualization.

In [None]:
from sklearn.manifold import TSNE
transform = TSNE
trans = transform(n_components=2, perplexity=10, n_iter=1000, metric='cosine')
embeddings_2d = trans.fit_transform(w2v)

dff = data[['word', 'definition', 'labels']].copy()
dff['x'] = embeddings_2d[:,0]
dff['y'] = embeddings_2d[:,1]

fig = px.scatter(dff, x="x", y="y", color="labels",
                 hover_data=['word', 'definition'])
fig.show()

### Below is the dendogram (a diagram that shows the hierarchical relationship between objects) of word clusters.
You can view it by right clicking, selecting 'open image in new tab' and, zooming in.

In [None]:
plt.figure(figsize=(128, 72))
dn = hierarchy.dendrogram(Z)
plt.show()

### That was it about this project. If you found the work useful, please upvote⬆ the notebook📓 and leave your feedback🗣 in the comment section below👇🏼.
### Thanks for reading!