# LDA - Latent Dirichlet Allocation (Clustering for NLP)

**LDA is a topic modeling technique that helps uncover hidden themes or "topics" present in a large corpus of texts. LDA is a generative probabilistic model that imagines each document as a collection of topics in a certain proportion, and each topic as a collection of words.**

**In practice, LDA helps organize and understand large collections of documents by grouping texts based on their thematic similarities. Each topic discovered by LDA can be represented by a set of words that are frequently associated with it, allowing users to grasp the main content of a text or corpus without requiring exhaustive manual reading.**

**This model is particularly useful in areas such as document classification, article recommendation, and even for improving search systems by providing insights into the latent structure of text data.**

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


data = pd.read_csv('/Users/nathan/Desktop/UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv')
data.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [2]:
cv = CountVectorizer(max_df = 0.9,  # ignorer les mots qui apparaissent dans 90% des documents
                    min_df = 2,
                    stop_words = 'english')     # apparait dans au moins deux documents

In [3]:
dtm = cv.fit_transform(data.Article)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components = 7,    # total subjects
                                random_state = 42)

LDA.fit(dtm)

In [None]:
top_ten_words = LDA.components_[0].argsort()[-10:]   # last 10 values of argsort (index)

for index in top_ten_words:
    print(cv.get_feature_names_out()[index])  # les 10 mots les plus probables du premier sujet

In [None]:
for i in range(len(LDA.components_)):
    top_ten_words = LDA.components_[i].argsort()[-5:] 
    print(f'The top 5 words for the Topic {i}:', end = '\n')
    lst_word = []
    for index in top_ten_words:
        lst_word.append(cv.get_feature_names_out()[index])
    print(lst_word)
    print()

In [None]:
topics = LDA.transform(dtm)

In [None]:
lst_topics = [topics[i].round(2).argmax() for i in range(len(data))]  #  # taking the hight probability to be a part of each topic

topics = pd.DataFrame(lst_topics, columns = ['Topics'])
data = pd.concat((data, topics), axis = 1)
data.head()

In [None]:
import numpy as np

lst_random = [1, 4, 5, 6, 2, 3]
np.array(lst_random).argsort()

In [None]:
top_5_words = []

for i in range(LDA.n_components):
    top_5_words.append(list(LDA.components_[i].argsort()[-5:]))
    
top_5_words 

In [None]:
dict_final = {}

for i, index in enumerate(top_5_words):
    dict_final[i] = list(cv.get_feature_names_out()[index])
dict_final

In [None]:
data['Subjects'] = data.Topics.apply(lambda x: dict_final.get(x))

In [None]:
data.head()