- Python: version 3.7.5
- Pandas: version 1.1.1
- Numpy: version 1.17.3
- NLTK: version 3.5
- Gensim: version 3.8.0
- Scikit-learn: version 0.23.2
- Plotly: version 4.14.1
- Networkx: version 2.5
- Yellowbrick: version 1.3
This repository provides an example of word similarity visualization. Here we use a word2vec model that vectorizes each word, cosine similarity to calculate the distance between each word vector and networkx to plot the word embedding into a graph structure
The example uses a news dataset and the model is trained using the headlines of three different categories: Politics, Entertainment and Travel
The model generates vector to represent each word. So, we can apply a lot of methods in the vectors space, like cosine similarity, to obtain the most similar words. This can be use to form clusters of words
First we use the word vectors in a Kmeans method to see how well the groups(news categories) are splitted. To determine the optimal numbers of clusters, the elbow method was applied. We can see in the image below that the optimal number of clusters matches with the number of categories.
To see if the groups were formed correctly, we provide words from each category and check if its groups are the same.
Finally, we plot the clusters using the graph structure. As can be seen in the image below, a graph is created were its vertices represents the words in the vocabulary and the edges represents the distance (similarity)