# Task

For a set of documents extracted from Wikipedia for three different topics at your choice, perform a
semantic analysis as follows:
1. Use text preprocessing techniques (stemming/lematization, stop words removal) and create the
bag-of-words and TF-IDF vectorizations
2. Using Latent Semantic Analysis with SVD for a) the bag-of-words encoding and b) the TF-IDF
encoding
3. Using Non-negative Matrix Factorization
4. Using LDA \

Use sklearn library in python. Check the tutorial here: [ https://nlpforhackers.io/topic-modeling/](https://nlpforhackers.io/topic-modeling/)

# Imports

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import brown

from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import spacy
import requests
import nltk
import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Getting Wikipedia documents

In [None]:
def extract_document(page_link, file_name):
    # get URL
    page = requests.get(page_link)

    # scrape webpage
    soup = BeautifulSoup(page.content, 'html.parser')
    list(soup.children)

    # write text into a file
    f = open(f"{file_name}.txt", "wb")
    for item in soup.find_all('p'):
        f.write(item.get_text().encode("UTF-8"))
    f.close()

In [None]:
sport = ['https://en.wikipedia.org/wiki/Chess', 'https://en.wikipedia.org/wiki/Olympic_Games', 'https://en.wikipedia.org/wiki/Gymnastics', 
         'https://en.wikipedia.org/wiki/Diving_(sport)', 'https://en.wikipedia.org/wiki/Premier_League']
for link in sport:
  extract_document(link, 's' + str(sport.index(link) + 1))


nlp = ['https://en.wikipedia.org/wiki/Linguistics', 'https://en.wikipedia.org/wiki/Artificial_intelligence',
       'https://en.wikipedia.org/wiki/Speech_recognition', 'https://en.wikipedia.org/wiki/Stemming', 
       'https://en.wikipedia.org/wiki/Optical_character_recognition']
for link in nlp:
  extract_document(link, 'nlp' + str(nlp.index(link) + 1))


film = ['https://en.wikipedia.org/wiki/Cinematography', 'https://en.wikipedia.org/wiki/Film_industry', 'https://en.wikipedia.org/wiki/Animation',
         'https://en.wikipedia.org/wiki/Sound_recording_and_reproduction', 'https://en.wikipedia.org/wiki/Sound_effect']
for link in film:
  extract_document(link, 'f' + str(film.index(link) + 1))

# Text preprocessing

In [None]:
topics_abrev = ['s', 'nlp', 'f']
data = []

for topic in topics_abrev:
  for i in range(1, 6):
    doc_content = open(f"{topic}{i}.txt", "r", encoding="utf-8").read()
    data.append(doc_content)

NO_DOCUMENTS = len(data)
NUM_TOPICS = 3

print(NO_DOCUMENTS)

15


## 1. Lower casing

In [None]:
for i in range(NO_DOCUMENTS):
  data[i] = data[i].lower()

print(len(data[14]), data[14])

13296 a sound effect (or audio effect) is an artificially created or enhanced sound, or sound process used to emphasize artistic or other content of films, television shows, live performance, animation, video games, music, or other media. traditionally, in the twentieth century, they were created with foley. in motion picture and television production, a sound effect is a sound recorded and presented to make a specific storytelling or creative point without the use of dialogue or music.  the term often refers to a process applied to a recording, without necessarily referring to the recording itself. in professional motion picture and television production, dialogue, music, and sound effects recordings are treated as separate elements. dialogue and music recordings are never referred to as sound effects, even though the processes applied to such as reverberation or flanging effects, often are called "sound effects".
this area and sound design have been slowly merged since the late-twent

## 2. Tokenization

In [None]:
for i in range(NO_DOCUMENTS):
  data[i] = word_tokenize(data[i])

print(len(data[14]), data[14])

2497 ['a', 'sound', 'effect', '(', 'or', 'audio', 'effect', ')', 'is', 'an', 'artificially', 'created', 'or', 'enhanced', 'sound', ',', 'or', 'sound', 'process', 'used', 'to', 'emphasize', 'artistic', 'or', 'other', 'content', 'of', 'films', ',', 'television', 'shows', ',', 'live', 'performance', ',', 'animation', ',', 'video', 'games', ',', 'music', ',', 'or', 'other', 'media', '.', 'traditionally', ',', 'in', 'the', 'twentieth', 'century', ',', 'they', 'were', 'created', 'with', 'foley', '.', 'in', 'motion', 'picture', 'and', 'television', 'production', ',', 'a', 'sound', 'effect', 'is', 'a', 'sound', 'recorded', 'and', 'presented', 'to', 'make', 'a', 'specific', 'storytelling', 'or', 'creative', 'point', 'without', 'the', 'use', 'of', 'dialogue', 'or', 'music', '.', 'the', 'term', 'often', 'refers', 'to', 'a', 'process', 'applied', 'to', 'a', 'recording', ',', 'without', 'necessarily', 'referring', 'to', 'the', 'recording', 'itself', '.', 'in', 'professional', 'motion', 'picture', '

## 3. Stop words removal

In [None]:
stop_words = set(stopwords.words('english')) 
# print(stop_words)

filtered_data = []
for doc in data:
  filtered_doc = []
  for word in doc:
    if word not in stop_words and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', word):
      filtered_doc.append(word)
  filtered_data.append(filtered_doc)

print(len(filtered_data[14]), filtered_data[14])

1238 ['sound', 'effect', 'audio', 'effect', 'artificially', 'created', 'enhanced', 'sound', 'sound', 'process', 'used', 'emphasize', 'artistic', 'content', 'films', 'television', 'shows', 'live', 'performance', 'animation', 'video', 'games', 'music', 'media', 'traditionally', 'twentieth', 'century', 'created', 'foley', 'motion', 'picture', 'television', 'production', 'sound', 'effect', 'sound', 'recorded', 'presented', 'make', 'specific', 'storytelling', 'creative', 'point', 'without', 'use', 'dialogue', 'music', 'term', 'often', 'refers', 'process', 'applied', 'recording', 'without', 'necessarily', 'referring', 'recording', 'professional', 'motion', 'picture', 'television', 'production', 'dialogue', 'music', 'sound', 'effects', 'recordings', 'treated', 'separate', 'elements', 'dialogue', 'music', 'recordings', 'never', 'referred', 'sound', 'effects', 'even', 'though', 'processes', 'applied', 'reverberation', 'flanging', 'effects', 'often', 'called', 'sound', 'effects', 'area', 'sound'

  if word not in stop_words and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', word):


## 4. Lemmatization

In [None]:
load_model = spacy.load("en_core_web_sm")

data = []
for doc in filtered_data:
  data.append(" ".join([word for word in doc]))


lemmatized_data = []
for doc_text in data:
  doc = load_model(doc_text)
  lemmatized_data.append(" ".join([token.lemma_ for token in doc]))

data = lemmatized_data
print(data[14])

sound effect audio effect artificially create enhanced sound sound process use emphasize artistic content film television show live performance animation video game music medium traditionally twentieth century create foley motion picture television production sound effect sound record present make specific storytelle creative point without use dialogue music term often refer process apply recording without necessarily refer record professional motion picture television production dialogue music sound effect recording treat separate element dialogue music recording never refer sound effect even though process apply reverberation flange effect often call sound effect area sound design slowly merge since late - twentieth century term sound effect range back early day radio year book bbc publish major article use sound effect consider sound effect deeply link broadcasting state would great mistake think anologous punctuation mark accent print never insert programme already exist author bro

# Creating the bag-of-words and TF-IDF vectorizations

* The bag-of-words model converts text into fixed-length vectors by counting how many times each word appears.
* TF-IDF works by proportionally increasing the number of times a word appears in the document, but is counterbalanced by the number of documents in which it is present. \
Hence, words that are commonly present in all the documents are not given a very high rank. However, a word that is present too many times in a few of the documents will be given a higher rank as it might be indicative of the context of the document.

In [None]:
bow_vectorizer = CountVectorizer(min_df=1, max_df=1.0,
                                 stop_words='english', lowercase = True,
                                 token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
bow_data_vectorized = bow_vectorizer.fit_transform(data)
print(bow_vectorizer.get_feature_names_out()[:50])
print(bow_data_vectorized.toarray(), len(bow_data_vectorized.toarray()))


tf_idf_vectorizer = TfidfVectorizer(use_idf=True,
                        smooth_idf=False,
                        ngram_range=(1,1),stop_words='english')
tf_idf_data_vectorized = tf_idf_vectorizer.fit_transform(data)
print(tf_idf_data_vectorized.toarray())

['-base' 'aaai' 'aan' 'aardman' 'aau' 'abandon' 'abandonment' 'abbas'
 'abbey' 'abbreviate' 'abbreviation' 'abbyy' 'abdominal' 'abdul'
 'abenteuer' 'aberration' 'abide' 'ability' 'able' 'abolish' 'abort'
 'abroad' 'absence' 'absent' 'absolute' 'abstract' 'abstraction' 'abul'
 'abuse' 'abyde' 'academia' 'academic' 'academy' 'accelerate'
 'acceleration' 'accelerator' 'accent' 'accept' 'acceptable' 'acceptance'
 'access' 'accessible' 'accident' 'accidental' 'acclaim' 'acclaimed'
 'accommodate' 'accommodation' 'accompaniment' 'accompany']
[[0 0 1 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]] 15
[[0.         0.00439034 0.         ... 0.00387866 0.00387866 0.        ]
 [0.         0.00421011 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.00957246]
 [0.         0.         0.00728851 ... 

  token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')


# Latent Semantic Indexing/Analysis with SVD

LSA is a technique of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. \
LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. \
Documents are then compared by taking the cosine of the angle between the two vectors formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

In [None]:
# Build a Latent Semantic Indexing Model with BOW
lsi_model_bow = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z_bow = lsi_model_bow.fit_transform(bow_data_vectorized)
print(lsi_Z_bow.shape)  # (NO_DOCUMENTS, NO_TOPICS)

# Build a Latent Semantic Indexing Model with TF-IDF
lsi_model_tfidf = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z_tfidf = lsi_model_tfidf.fit_transform(tf_idf_data_vectorized)
print(lsi_Z_tfidf.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(15, 3)
(15, 3)


# Non-negative Matrix Factorization

Non-negative Matrix Factorization or NMF is a method used to factorize a non-negative matrix, X, into the product of two lower rank matrices, A and B, such that AB approximates an optimal solution of X. This is an unsupervised learning algorithm used to reduce the dimensionality of data into lower-dimensional spaces. \
The algorithm iteratively changes the values of A and B such that their product approaches X. This method keeps the structure of the original data intact and makes sure that both the basis and weights are non-negative.

In [None]:
# Build a Non-Negative Matrix Factorization Model with BOW
nmf_model_bow = NMF(n_components=NUM_TOPICS)
nmf_Z_bow = nmf_model_bow.fit_transform(bow_data_vectorized)
print(nmf_Z_bow.shape)  # (NO_DOCUMENTS, NO_TOPICS)

# Build a Non-Negative Matrix Factorization Model with TF-IDF
nmf_model_tfidf = NMF(n_components=NUM_TOPICS)
nmf_Z_tfidf = nmf_model_tfidf.fit_transform(tf_idf_data_vectorized)
print(nmf_Z_tfidf.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(15, 3)
(15, 3)




# Latent Dirichlet Allocation

LDA is an iterative algorithm. These are the two main steps:

* In the initialization stage, each word is assigned to a random topic;
* Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
** What’s the probability of the word belonging to a topic?
** What’s the probability of the document to be generated by a topic?

In [None]:
# Build a Latent Dirichlet Allocation Model with BOW
lda_model_bow = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z_bow = lda_model_bow.fit_transform(bow_data_vectorized)
print(lda_Z_bow.shape)  # (NO_DOCUMENTS, NO_TOPICS)

# Build a Latent Dirichlet Allocation Model with TF-IDF
lda_model_tfidf = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z_tfidf = lda_model_tfidf.fit_transform(tf_idf_data_vectorized)
print(lda_Z_tfidf.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(15, 3)
(15, 3)


In [None]:
print(lda_Z_bow[0])
print(lda_Z_tfidf[0])
print(nmf_Z_bow[0])
print(nmf_Z_tfidf[0])
print(lsi_Z_bow[0])
print(lsi_Z_tfidf[0])

[6.57195286e-05 7.38871743e-05 9.99860393e-01]
[0.03041211 0.03036823 0.93921966]
[9.44962006 0.1682193  0.64529439]
[0.01676657 0.22310162 0.0476976 ]
[216.57609489 -75.74119281 -46.51965098]
[ 0.20623814 -0.11263948 -0.10492602]


In [None]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx + 1))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
 
print("LSI Model:", "\n")
print_topics(lsi_model_bow, bow_vectorizer)
print("\n", "=" * 50)
 
print("\n NMF Model:", "\n")
print_topics(nmf_model_bow, bow_vectorizer)
print("\n", "=" * 50)

print("\n LDA Model:", "\n")
print_topics(lda_model_bow, bow_vectorizer)

LSI Model: 

Topic 1:
[('film', 0.33836305144615425), ('game', 0.32766675562420805), ('olympic', 0.2648268911420336), ('chess', 0.23506266065303888), ('use', 0.19340937167459293), ('league', 0.1895090819156951), ('olympics', 0.12505958804386513), ('sport', 0.12165426973094288), ('world', 0.11863131212051532), ('ioc', 0.11591269701836866)]
Topic 2:
[('film', 0.6969009485135462), ('cinema', 0.17244049363598266), ('industry', 0.16208436968049897), ('studio', 0.08760375231081284), ('produce', 0.08052676489016174), ('production', 0.07970395737220276), ('company', 0.07745161224386711), ('movie', 0.06896346075004842), ('large', 0.06874037150127821), ('make', 0.06852368134671791)]
Topic 3:
[('league', 0.6456118993973649), ('premier', 0.3860466686862525), ('season', 0.2616956938525213), ('club', 0.24190200899201586), ('football', 0.12076861413229528), ('match', 0.09152616060324802), ('team', 0.08460328503480513), ('million', 0.07184754834741716), ('right', 0.06853396226497326), ('player', 0.062



[('film', 15.264493580658147), ('cinema', 3.742337672961421), ('industry', 3.5882722284586874), ('use', 3.012616756150872), ('make', 2.2155297612813114), ('language', 2.044185131619129), ('studio', 1.937596816013374), ('company', 1.917123226122671), ('produce', 1.8996995374978398), ('large', 1.8629315596312808)]
Topic 3:
[('league', 13.136759270302923), ('premier', 7.827287087996098), ('season', 5.315923809264945), ('club', 4.977319058040293), ('football', 2.4754544435496197), ('team', 2.3049898202194146), ('match', 2.09321830928158), ('player', 1.862070669609707), ('right', 1.8370100982021094), ('million', 1.7475255052506407)]


 LDA Model: 

Topic 1:
[('stem', 38.100056393911515), ('stemmer', 18.75400367951113), ('suffix', 17.429022021910797), ('algorithm', 15.732097917430906), ('rule', 15.683364973680535), ('strip', 13.693132946785532), ('word', 11.377786425203041), ('root', 9.581411917653815), ('olympic', 9.072716775524048), ('game', 9.014954038810671)]
Topic 2:
[('film', 414.77781

# Plots

## Visualisation for LSI with SVD model

In [None]:
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

### Plots for documents

In [None]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(bow_data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

In [None]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(tf_idf_data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

### Plots for words

In [None]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(bow_data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], bow_vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)



In [None]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(tf_idf_data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], tf_idf_vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

## Visualisation for NMF model

### Plots for documents

In [None]:
svd = NMF(n_components=2)
documents_2d = svd.fit_transform(bow_data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)



In [None]:
svd = NMF(n_components=2)
documents_2d = svd.fit_transform(tf_idf_data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)



### Plots for words

In [None]:
svd = NMF(n_components=2)
words_2d = svd.fit_transform(bow_data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], bow_vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)



In [None]:
svd = NMF(n_components=2)
words_2d = svd.fit_transform(tf_idf_data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], tf_idf_vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)



In [None]:
!pip install pyldavis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(nmf_model_bow, bow_data_vectorized, bow_vectorizer, mds='tsne')
panel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyldavis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Building wheels for collected packages: pyldavis, sklearn
  Building wheel for pyldavis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyldavis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136897 sha256=4e5f59184806fcd78b595f2d5ee2d6b95e1de85971a75afc4621eb60fed18977
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
  Building wheel for sklearn (

  from collections import Iterable
  by='saliency', ascending=False).head(R).drop('saliency', 1)


- topics are shown on the left while words are on the right
- larger topics are more frequent in the corpus
- topics closer together are more similar, topics further apart are less similar
- when selecting a topic, the most representative words for the selected topic can be seen on the right. This measure can be a combination of how frequent or how discriminant the word is. 
- to adjust the weight of each property, one can use the slider
- hovering over a word will adjust the topic sizes according to how representative the word is for the topic

In [None]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(nmf_model_tfidf, tf_idf_data_vectorized, tf_idf_vectorizer, mds='tsne')
panel

  by='saliency', ascending=False).head(R).drop('saliency', 1)


## Visualisation for LDA model

In [None]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model_bow, bow_data_vectorized, bow_vectorizer, mds='tsne')
panel

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [None]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model_tfidf, tf_idf_data_vectorized, tf_idf_vectorizer, mds='tsne')
panel

  by='saliency', ascending=False).head(R).drop('saliency', 1)
