## NLP for News Website Articles Topic Modelling

The scrapper contained in the folder will crawl the web by parsing a csv file with urls. The scrapper will provide us with a csv file containing the url,website title and description for each url in the original csv. 
For this project, we obtained a list of urls to articles' websites. This notebook performs NLP on the output dataset from the scrapper, and consists of:

1) Topic Modelling with LDA:

- Aimed at constructing a good dictionary by extracting all possible vocabularies/words from the dataset and remove those that are not relevant/informative (e.g., the infrequent words). 

- For url, given "www.lachainemeteo.com/meteo-france/ville-6560/previsions-meteo-bordeaux-aujourdhui" the following words can be extracted: lachainemeteo (or www.lachainemeteo.com), meteo-france, ville, prevision, meteo, bordeaux.

- Built a first/baseline model/method using LDA (Latent Dirichlet Allocation from the Gensim library) to find the topics from the constructed dictionary. An important aspect is properly defining the number of topics for LDA.

- Applied LDA to both titles and descriptions of the websites seperately. Finally, we combine all columns containing description, title and url to construct the dictionary and apply LDA. We then minimize topic coherence metrics vs. Topics number in LDA to find the best number of topics

2) Improving Baseline model:

To find the best number of topics, we have used two methods:
    - Calculated topic coherence for several number of topics to obtain optimal number of topics
    - Clustered embedding with K-means and used the elbow method to find optimal number of clusters

- For titles, used LSA to obtain SVD_matrix, performed k-means clustering on vectors representations, and applied LDA on each cluster to define 1 topic per cluster 

- For descriptions, implemented word2vec, performed k-means clustering on vectors representations, applied TF/IDF and other preprocessing techniques to keep only useful words (specially for descriptions), and applied LDA on each cluster to define 1 topic per cluster 


In [None]:
#Preprocessing URL data and creating data2.csv
import pandas as pd

df = pd.read_csv("scrapped_url.csv", delimiter=",")
df = df.dropna(subset=['title', 'description']).drop(columns="Unnamed: 3")

for i in range(0,len(df)):
    # taking care of www., .com, http and https.. and tokenizing
    if df.iloc[i,0][4]=="s":
        df.iloc[i,0] = str(df.iloc[i,0][8:]).replace("www.","").replace(".comment",". comment").replace(".madame",". madame").replace(".com","").replace(".php","").replace(".fr","").replace(".html","").replace("."," ").replace("-"," ").replace("0","").replace("1","").replace("2","").replace("3","").replace("4","").replace("5","").replace("6","").replace("7","").replace("8","").replace("9","").strip().split("/")
        df.iloc[i,0] = str(" ".join(df.iloc[i,0]).strip().replace("     "," ").replace("  "," ").lower().replace(" artfig","")).split(" ")
    else:
        df.iloc[i,0] = str(df.iloc[i,0][7:]).replace("www.","").replace(".comment",". comment").replace(".madame",". madame").replace(".com","").replace(".php","").replace(".fr","").replace(".html","").replace("."," ").replace("-"," ").replace("0","").replace("1","").replace("2","").replace("3","").replace("4","").replace("5","").replace("6","").replace("7","").replace("8","").replace("9","").strip().split("/")
        df.iloc[i,0] = str(" ".join(df.iloc[i,0]).strip().replace("     "," ").replace("  "," ").lower().replace(" artfig","")).split(" ")

df.to_csv("data2.csv")

### Baseline Model: Analysing URLs with LDA

In [None]:
import gensim
import pandas as pd
from gensim import corpora
import pickle

df = pd.read_csv("data2.csv")
unallowed_words = ["lefigaro","figaro","linternaute","commentcamarche","journaldunet","lachainemeteo","linternaute"]

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in unallowed_words:
            result.append(token)
    return result

processed_url = df['url_clean'].map(preprocess)
processed_url

dictionary = corpora.Dictionary(processed_url)
corpus = [dictionary.doc2bow(text) for text in processed_url]

pickle.dump(corpus, open('corpus.pkl', 'wb'))
#dictionary.save('dictionary.gensim')

NUM_TOPICS = 9
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
#ldamodel.save(f"model{NUM_TOPICS}.gensim")

In [None]:
for idx, topic in ldamodel.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
topics = ldamodel.print_topics(num_words=1)
for topic in topics:
    print(topic)

## Baseline Model: Analysing Titles with LDA

In [None]:
import gensim
import pandas as pd
from gensim import corpora
import pickle

df = pd.read_csv("data2.csv")
unallowed_words = ["lefigaro","figaro","linternaute","commentcamarche","journaldunet","lachainemeteo","linternaute","internaute","madame"]

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 4 and token not in unallowed_words:
            result.append(token)
    return result

processed_url = df['title'].map(preprocess)
processed_url

dictionary = corpora.Dictionary(processed_url)
corpus = [dictionary.doc2bow(text) for text in processed_url]

pickle.dump(corpus, open('corpus.pkl', 'wb'))
#dictionary.save('dictionary.gensim')

NUM_TOPICS = 9
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
#ldamodel.save(f"model{NUM_TOPICS}.gensim")

In [None]:
for idx, topic in ldamodel.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
topics = ldamodel.print_topics(num_words=1)
for topic in topics:
    print(topic)

## Baseline Model: Analysing Descriptions with LDA

In [None]:
import gensim
import pandas as pd
from gensim import corpora
import pickle

df = pd.read_csv("data2.csv")

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 4:
            result.append(token)
    return result

processed_url = df['description'].map(preprocess)
processed_url

dictionary = corpora.Dictionary(processed_url)
corpus = [dictionary.doc2bow(text) for text in processed_url]

pickle.dump(corpus, open('corpus.pkl', 'wb'))
#dictionary.save('dictionary.gensim')

NUM_TOPICS = 9
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
#ldamodel.save(f"model{NUM_TOPICS}.gensim")

In [None]:
for idx, topic in ldamodel.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
topics = ldamodel.print_topics(num_words=1)
for topic in topics:
    print(topic)

## Improving Model: LDA on combined URL, TITLES and DESCRIPTIONS tokens

In [None]:
import gensim
import pandas as pd
from gensim import corpora
import pickle

df = pd.read_csv("data2.csv")
unallowed_words = ["lefigaro","figaro","linternaute","commentcamarche","journaldunet","lachainemeteo","linternaute","internaute","madame"]

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 4 and token not in unallowed_words:
            result.append(token)
    return result

processed_url = df['title','url_clean','description'].map(preprocess)
processed_url

dictionary = corpora.Dictionary(processed_url)
corpus = [dictionary.doc2bow(text) for text in processed_url]

pickle.dump(corpus, open('corpus.pkl', 'wb'))
#dictionary.save('dictionary.gensim')

NUM_TOPICS = 9
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
#ldamodel.save(f"model{NUM_TOPICS}.gensim")

In [None]:
for idx, topic in ldamodel.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
topics = ldamodel.print_topics(num_words=1)
for topic in topics:
    print(topic)

## Improving Model: Coherence values analysis for best number of topics

In [None]:
import numpy as np
import pandas as pd
import logging
import pyLDAvis.gensim
import json
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
from numpy import array

# Import dataset
p_df = pd.read_csv('data2.csv')
p_df['all'] = p_df[['description', 'title', 'url_clean']].apply(lambda x: ' '.join(x).replace('[','').replace(']','').replace("'","").replace(".","").replace(",","").replace(";",""), axis=1)

docs =array(p_df['all'])
unallowed_words = ["lefigaro","figaro","linternaute","commentcamarche","journaldunet","lachainemeteo","linternaute","internaute","madame"]

# Define function for tokenize and lemmatizing
def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove unallowed_words and with less than 4 characters
    docs = [[token for token in doc if len(token) > 4 and token not in unallowed_words] for doc in docs]
    
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs
# Perform function on our document
docs = docs_preprocessor(docs)
#Create Biagram & Trigram Models 
from gensim.models import Phrases
# Add bigrams and trigrams to docs,minimum count 10 means only that appear 2 times or more.
bigram = Phrases(docs, min_count=2)
trigram = Phrases(bigram[docs])

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
    for token in trigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
#Remove rare & common tokens 
# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)
dictionary.filter_extremes(no_below=10, no_above=0.2)
#Create dictionary and corpus required for Topic Modeling
corpus = [dictionary.doc2bow(doc) for doc in docs]
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
print(corpus[:1])

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=2):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    coherence_values2 = []
    model_list = []
    for num_topics in range(start, limit, step):
        model=LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherencemodel2 = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='u_mass')
        coherence_values.append(coherencemodel.get_coherence())
        coherence_values2.append(coherencemodel2.get_coherence())

    return model_list, coherence_values, coherence_values2
model_list, coherence_values, coherence_values2 = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=docs, start=2, limit=40, step=2)

In [None]:
# Show graph
import matplotlib.pyplot as plt
limit=39; start=2; step=2;
x = range(start, limit, step)

plt.subplot(2, 1, 1)
plt.plot(x, coherence_values,label="c_v")
plt.title('Topic Coherence Values')
plt.ylabel('c_v')

plt.subplot(2, 1, 2)
plt.plot(x, coherence_values2,label="u_mass")
plt.xlabel('Num Topics')
plt.ylabel('u_mass')

plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Set parameters.
num_topics = 8
chunksize = 500 
passes = 20 
iterations = 400
eval_every = 1  

# Make a index to word dictionary.
temp = dictionary[0]  # only to "load" the dictionary.
id2word = dictionary.id2token

lda_model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)

# Print the Keyword in the topics
print(lda_model.print_topics())

"""
# Compute Coherence Score using c_v and UMass
coherence_model_lda = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence='c_v')
coherence_model_lda_umass = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence="u_mass")
coherence_lda_cv = coherence_model_lda.get_coherence()
coherence_lda_umass = coherence_model_lda_umass.get_coherence()
print('\nCoherence Score_c_v: ', coherence_lda_cv)
print('\nCoherence Score_umass: ', coherence_lda_umass)
"""

## Improving Model: Analysing titles with LSA and Clustering

In [None]:
## LSA: https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
import pandas as pd

df = pd.read_csv("data2.csv")
documents = df["title"].values
unallowed_words = ["lefigaro","figaro","linternaute","commentcamarche","journaldunet","lachainemeteo","linternaute","internaute","madame"]

# some preprocessing
for r in range(0,len(documents)):
    documents[r] = ''.join(i for i in documents[r] if not i.isdigit())
    
    # Removing words with less than XX characters
    XX = 4
    documents[r] = ' '.join([w for w in documents[r].split() if len(w)>XX and w not in unallowed_words])

# raw documents to tf-idf matrix: 
vectorizer = TfidfVectorizer(use_idf=True, 
                             smooth_idf=True)
# SVD to reduce dimensionality: 
svd_model = TruncatedSVD(n_components=1000,
                         algorithm='randomized',
                         n_iter=100)
# pipeline of tf-idf + SVD, fit to and applied to documents:
svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])
svd_matrix = svd_transformer.fit_transform(documents)
# svd_matrix can later be used to compare documents, compare words, or compare queries with documents

In [None]:
# Finding optimum number of clusters
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

NBR_CLUSTER_TEST = 20
x = []
for j in range(0,len(svd_matrix)):
    x.append(svd_matrix[j])
wcss = [] #Within Cluster Sum of Squares

for i in range(1, NBR_CLUSTER_TEST):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
plt.plot(range(1, NBR_CLUSTER_TEST), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') 
plt.show()

In [None]:
#Applying kmeans to the dataset / Creating the kmeans classifier
BEST_CLUSTER_NBR = 7
kmeans = KMeans(n_clusters = BEST_CLUSTER_NBR, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)
df["kmeans_class"] = y_kmeans
df.groupby('kmeans_class').count() 

In [None]:
#Using LDA to find 1 topic per cluster
from sklearn import linear_model
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import seaborn as sns
import gensim
from gensim import corpora
import pickle

def preprocess(text):
    result = []
    unallowed_words = ["lefigaro","figaro","linternaute","commentcamarche","journaldunet","lachainemeteo","linternaute","internaute","madame"]
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 4 and token not in unallowed_words:
            result.append(token)
    return result

CLUSTER_NUM = 7
for i in range(0,CLUSTER_NUM):
    # Latent DIrichlet Allocation Model
    processed_desc = df.loc[df['kmeans_class']==i]["title"].map(preprocess)
    dictionary = corpora.Dictionary(processed_desc)
    corpus = [dictionary.doc2bow(text) for text in processed_desc]

    pickle.dump(corpus, open('corpus.pkl', 'wb'))
    #dictionary.save('dictionary.gensim')

    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 1, id2word=dictionary, passes=15)
    #ldamodel.save(f"model{NUM_TOPICS}.gensim")
    for idx, topic in ldamodel.print_topics(-1):
        print('Cluster: {} \n Topic: {} \nWords: {}'.format(i, ldamodel.print_topics(num_words=1), topic))
        print()

## Improving Model: Analysing Descriptions with Word2vec and Clustering

In [None]:
from gensim.models.word2vec import Word2Vec
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

model_w2v_path = './model_w2v_urls.bin'
df2 = pd.read_csv("data2.csv")
X_train = df2["description"]
emb_size = 128

# some preprocessing
XX = 4
for r in range(0,len(X_train)):
    X_train[r] = ''.join(i for i in X_train[r] if not i.isdigit())
    X_train[r] = ' '.join([w for w in X_train[r].split() if len(w)>XX])

In [None]:
# Building/training the model
# Initialize model_w2v and build vocabularies
# should use the whole dataset, not just training set
model_w2v = Word2Vec(size=emb_size, min_count=5)
model_w2v.build_vocab(X_train)
model_w2v.train(X_train, total_examples=model_w2v.corpus_count, epochs=2000)

# save the model_w2v
model_w2v.save(model_w2v_path)
print("training w2v: done")

In [None]:
# Reloading the trained model_w2v, buidling vectors and cleaning dataframe
new_model = Word2Vec.load(model_w2v_path)
X = new_model[new_model.wv.vocab] # get words

def build_word2vec_from_text(model_w2v, sentence, emb_size):
    emb_vec = np.zeros(emb_size).reshape((1, emb_size))
    count = 0.
    for word in sentence:
        try:
            emb_vec += model_w2v[word].reshape((1, emb_size))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        emb_vec /= count
    return emb_vec

def scalar_2vec(data_frame, column_names):
    new_frame = data_frame.loc[:, column_names]
    return new_frame

X_train = np.concatenate([build_word2vec_from_text(new_model, d, emb_size) for d in X_train]) #model_w2v
newdf = pd.concat([df2, pd.DataFrame(X_train)],  axis=1)
newdf["vect"] = newdf[list(range(0,128))].values.tolist()
selected_columns = ["url_clean","title","description","vect"]
newdf = scalar_2vec(newdf, selected_columns)
newdf.head()

In [None]:
# Standardizing vectors
from sklearn import preprocessing

xvect = []
for i in range(0,len(newdf["vect"])):
    xvect.append(np.array(newdf["vect"].values[0]))

"""
## tried to standardize, normalize and scale vectors, but no improvements
standardized_vect = preprocessing.scale(xvect)
normalized_vect = preprocessing.normalize(xvect, norm='l2')
scaled_vect = preprocessing.scale(xvect)
newdf["standard_vect"] = list(standardized_vect)
newdf["normal_vect"] = list(normalized_vect)
newdf["scaled_vect"] = list(scaled_vect)
"""

In [None]:
# Finding optimum number of clusters
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

NBR_CLUSTER_TEST = 14
x = []
for j in range(0,len(newdf)):
    x.append(newdf["vect"][j])
wcss = [] #Within Cluster Sum of Squares

for i in range(1, NBR_CLUSTER_TEST):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
plt.plot(range(1, NBR_CLUSTER_TEST), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') 
plt.show()

In [None]:
#Applying kmeans to the dataset / Creating the kmeans classifier
BEST_CLUSTER_NBR = 10
kmeans = KMeans(n_clusters = BEST_CLUSTER_NBR, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)
newdf["kmeans_class"] = y_kmeans
newdf.groupby('kmeans_class').count()

In [None]:
#TF-IDF Analysis
#We want to plot the number of words of description text for each cluster
from sklearn import linear_model
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import seaborn as sns
import gensim
from gensim import corpora
import pickle

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token)
    return result

CLUSTER_NUM = 8
for i in range(0,CLUSTER_NUM):
    #TfidfVectorizer: Converts a collection of raw documents to a matrix of TF-IDF features.
    #min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
    #max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold.
    #Apply this vectorizer to the full dataset to create normalized vectors
    tfidf_vectorizer = TfidfVectorizer(min_df=3, max_df = 0.95, sublinear_tf=True, use_idf=True)
    tfidf_matrix = tfidf_vectorizer.fit_transform(newdf["description"].values)

    #tfidf_vectorizer.get_feature_names(): Array mapping from feature integer indices to feature name
    features = tfidf_vectorizer.get_feature_names()

    #Get the row that belongs to cluster 0
    row = newdf.loc[newdf['kmeans_class']==i].index.tolist()
    
    """
    #Create a series from the sparse matrix
    word2_matrix = pd.Series(tfidf_matrix.getrow(row).toarray().flatten(),index = features).sort_values(ascending=False)
    """
    
    word2_matrix = pd.Series()
    for rows in row:
        word2_matrix = pd.concat([word2_matrix, pd.Series(tfidf_matrix.getrow(rows).toarray().flatten(),index = features).sort_values(ascending=False)])
    word2_matrix = word2_matrix.sort_index(axis=0)
    
    tf_idf_plot = word2_matrix[:30].plot(kind='bar', title='Article Word TF-IDF Values',
                figsize=(10,6), alpha=1, fontsize=14, rot=80,edgecolor='black', linewidth=2 )
    tf_idf_plot.title.set_size(18)
    tf_idf_plot.set_xlabel('WORDS')
    tf_idf_plot.set_ylabel('TF-IDF')
    
    # Latent DIrichlet Allocation Model
    processed_desc = word2_matrix[0:1000].index.map(preprocess)
    dictionary = corpora.Dictionary(processed_desc)
    corpus = [dictionary.doc2bow(text) for text in processed_desc]

    pickle.dump(corpus, open('corpus.pkl', 'wb'))
    #dictionary.save('dictionary.gensim')

    try:
        ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 1, id2word=dictionary, passes=15)
        #ldamodel.save(f"model{NUM_TOPICS}.gensim")
        for idx, topic in ldamodel.print_topics(-1):
            print('dataframe: {} \n topic: {} \nWords: {}'.format(i, idx, topic))
    except:
        print(f"LDA FAILED FOR CLUSTER: {i}")