# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [2]:
! pip install pyLDAvis gensim spacy
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mayur\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

### Import the libraries

In [63]:
import json
import re
import gensim
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models.phrases import Phrases, Phraser
import spacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
import pyLDAvis.gensim

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

### Load the dataset

In [6]:
# Opening JSON file
f = open('newsgroups.json')
  
# returns JSON object as 
# a dictionary
data = json.load(f)

In [7]:
content = data['content']
target = data['target']
target_names = data['target_names']

### Preprocess the data

### Email Removal

In [8]:
# Define a regular expression pattern to match email addresses
pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"

for i in range(len(content)):
    content[str(i)] = re.sub(pattern, "", content[str(i)])

### Newline Removal

In [9]:
for i in range(len(content)):
    content[str(i)] = content[str(i)].replace("\n", "")

### Single Quotes Removal

In [10]:
for i in range(len(content)):
    content[str(i)] = content[str(i)].replace("'", "")

### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [11]:
def sent_to_words(sentences):
    for sentence in sentences.values():
        yield gensim.utils.simple_preprocess(re.sub(r'\s+', ' ', sentence))

In [12]:
content = list(sent_to_words(content))

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [4]:
en_stop = set(stopwords.words('english'))

#### remove_stopwords( )

In [18]:
def remove_stopwords(texts):
    # remove stop words from tokens
    stopped_tokens = [token for token in texts if not token in en_stop]
    return stopped_tokens

In [22]:
stopped_tokens = [remove_stopwords(texts) for texts in content]

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

#### make_bigrams( )

In [27]:
def make_bigrams(texts):
    bigram = Phrases(texts, threshold=100)
    bigram_phraser = Phraser(bigram)
    bigram_tokens = [bigram_phraser[text] for text in texts]
    return bigram_tokens

In [30]:
data_words_bigrams = make_bigrams(stopped_tokens)

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [20]:
#! python -m spacy download en

  and should_run_async(code)


In [36]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

#### lemmatizaton( )

In [37]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [38]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [39]:
print(data_lemmatized[:1])

[['s', 'thing', 'subject', 'car', 'nntp_poste', 'host', 'umd_eduorganization', 'university', 'parkline', 'enlighten', 'car', 'day', 'door', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'yearsof', 'production', 'car', 'make', 'history', 'info', 'youhave', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [42]:
dictionary = Dictionary(data_lemmatized)

### Filter low-frequency words

In [43]:
dictionary.filter_extremes(no_below=10, no_above=0.5)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

### Create Corpus

In [57]:
train_corpus = corpus[:8500]
test_corpus = corpus[8500:]

### Create Index 2 word dictionary

In [44]:
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [52]:
ldamodel = LdaModel(train_corpus, num_topics=15, id2word = id2word, passes=20)

### Print the Keyword in the 10 topics

In [53]:
for idx in range(10):
    print("Topic #%s:" % idx, ldamodel.print_topic(idx, 10))

Topic #0: 0.017*"get" + 0.012*"good" + 0.011*"new" + 0.009*"buy" + 0.009*"bike" + 0.009*"article" + 0.008*"price" + 0.008*"look" + 0.008*"m" + 0.007*"car"
Topic #1: 0.087*"nntp_poste" + 0.082*"host" + 0.028*"article" + 0.025*"line" + 0.024*"organization" + 0.022*"university" + 0.017*"know" + 0.012*"thank" + 0.011*"posting_host" + 0.010*"post"
Topic #2: 0.015*"say" + 0.011*"think" + 0.009*"believe" + 0.008*"know" + 0.007*"question" + 0.007*"mean" + 0.007*"word" + 0.007*"make" + 0.006*"claim" + 0.006*"thing"
Topic #3: 0.018*"car" + 0.014*"use" + 0.009*"time" + 0.008*"article" + 0.008*"good" + 0.008*"much" + 0.008*"well" + 0.008*"get" + 0.007*"go" + 0.007*"speed"
Topic #4: 0.017*"people" + 0.010*"say" + 0.009*"article" + 0.009*"make" + 0.009*"think" + 0.008*"get" + 0.008*"right" + 0.007*"go" + 0.007*"know" + 0.007*"take"
Topic #5: 0.025*"use" + 0.017*"window" + 0.014*"file" + 0.012*"program" + 0.010*"get" + 0.009*"image" + 0.009*"run" + 0.008*"version" + 0.008*"available" + 0.008*"server"

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [58]:
# Calculate the perplexity of the test set
log_perplexity = ldamodel.log_perplexity(test_corpus)
perplexity = 2**(-log_perplexity)
print('Perplexity:', perplexity)

Perplexity: 179.04119871276973


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [61]:
# Compute the coherence score using the c_v coherence measure
coherence_model = CoherenceModel(model=ldamodel, corpus=test_corpus, dictionary=dictionary, coherence='u_mass')
coherence_score = coherence_model.get_coherence()

print('Coherence Score:', coherence_score)

Coherence Score: -2.923625446543384


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [64]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

  default_term_info = default_term_info.sort_values(


# LDA vs LSI

LSI is a technique that represents documents and queries as vectors in a high-dimensional space and then performs a matrix decomposition to find the most important dimensions or concepts that underlie the documents. This allows for efficient retrieval of relevant documents to a query.

LDA, on the other hand, is a probabilistic model that assumes that each document is a mixture of topics, and each topic is a distribution over words. LDA discovers these topics by iteratively assigning each word in a document to a topic and updating the distribution of topics based on the observed assignments.

The main difference between LDA and LSI is that LDA is a generative model that seeks to explain the observed data, while LSI is a purely descriptive model that captures the latent structure of the data. In other words, LDA tries to explain how the data was generated, while LSI simply describes the relationships between the data.