# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
# ! pip install pyLDAvis gensim spacy

### Import the libraries

In [2]:
import pandas as pd
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams
import string
import re
import matplotlib.pyplot as plt

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [3]:
import pandas as pd
import requests

# URL of the dataset
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"

# Send a GET request to the URL
response = requests.get(url)
with open("newsgroups.json", "wb") as f:
    f.write(response.content)

### Load the dataset

In [4]:
# Replace 'newsgroups.json' with the actual filename if you used a different name
json_file = "newsgroups.json"

# Load the JSON file into a DataFrame
df = pd.read_json(json_file)

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


# Preprocess the data

### Email Removal

In [5]:
# Define functions for preprocessing
def remove_emails(text):
    # Use regex to remove emails
    return re.sub(r'\S+@\S+', '', text)

In [6]:
df['content'] = df['content'].apply(remove_emails)

### Newline Removal

In [7]:
def remove_newlines(text):
    # Remove newline characters
    return text.replace('\n', ' ')

In [8]:
df['content'] = df['content'].apply(remove_newlines)

### Single Quotes Removal

In [9]:
def remove_single_quotes(text):
    # Remove single quotes
    return text.replace("'", '')

In [10]:
df['content'] = df['content'].apply(remove_single_quotes)

### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [11]:
from gensim.utils import simple_preprocess

# Define the generator function
def sent_to_words(sentences):
    for sentence in sentences:
        # Use gensim.utils.simple_preprocess to tokenize the sentence
        yield simple_preprocess(sentence, deacc=True)  # deacc=True removes punctuations

# Example usage:
text_data = df['content'].values  # Assuming 'content' is the column name in your DataFrame

# Tokenize the text using the sent_to_words generator
tokenized_data = list(sent_to_words(text_data))

# Display the tokenized data for the first few rows
print("Tokenized Data:")
print(tokenized_data[:5])

Tokenized Data:


### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [12]:
from gensim.parsing.preprocessing import STOPWORDS

# Define additional stop words
additional_stop_words = {'from', 'subject', 're', 'edu', 'use'}

# Extend the stop words corpus
stop_words = STOPWORDS.union(additional_stop_words)

#### remove_stopwords( )

In [13]:
def remove_stopwords(texts):
    # Remove stop words from each list of tokens
    result = []
    for doc in texts:
        filtered_tokens = []
        for word in doc:
            if word not in stop_words:
                filtered_tokens.append(word)
        result.append(filtered_tokens)
    return result

# Example usage:
tokenized_data_without_stopwords = remove_stopwords(tokenized_data)

# Display the tokenized data without stop words for the first few rows
print("Tokenized Data without Stopwords:")
print(tokenized_data_without_stopwords[:5])


Tokenized Data without Stopwords:


### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [14]:
from gensim.models import Phrases

# Create bigrams using Phrases
bigram = Phrases(tokenized_data, threshold=100)

# Apply the bigram model to the tokenized data
data_words_bigrams = list(bigram[tokenized_data])

# Display the tokenized data with bigrams for the first few rows
print("data_words_bigrams:")
print(data_words_bigrams[:5])


data_words_bigrams:


#### make_bigrams( )

In [15]:
from gensim.models import Phrases

def make_bigrams(texts):
    # Create bigrams using Phrases
    bigram = Phrases(texts, threshold=100)
    
    # Apply the bigram model to the tokenized data
    return list(bigram[texts])

# Example usage:
tokenized_data_with_bigrams = make_bigrams(tokenized_data)

# Display the tokenized data with bigrams for the first few rows
print("Tokenized Data with Bigrams:")
print(tokenized_data_with_bigrams[:5])

Tokenized Data with Bigrams:


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [24]:
# ! python -m spacy download en

In [21]:
import spacy

# Load the spaCy model with specified components disabled
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

#### lemmatizaton( )

In [22]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [23]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [25]:
print(data_lemmatized[:1])

[['s', 'thing', 'subject', 'car', 'nntp_poste', 'host', 'rac_wam', 'organization', 'park', 'line', 'wonder', 'out', 'there', 'enlighten', 'car', 'see', 'other', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [26]:
from gensim.corpora import Dictionary

# Create a Dictionary
id2word = Dictionary(data_lemmatized)

### Create Corpus

In [27]:
# Create a Corpus
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

### Filter low-frequency words

In [28]:
# Filter low-frequency words (you can adjust the threshold as needed)
id2word.filter_extremes(no_below=5, no_above=0.9)

# Update the Corpus after filtering
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

### Create Index 2 word dictionary

In [29]:
# Create an Index-to-Word dictionary
id2word_dict = {v: k for k, v in id2word.items()}

# Example usage:
# Display the first few entries in the Index-to-Word dictionary
print("Index-to-Word Dictionary:")
print({k: id2word_dict[k] for k in list(id2word_dict)[:5]})

Index-to-Word Dictionary:
{'addition': 0, 'body': 1, 'bring': 2, 'call': 3, 'car': 4}


In [30]:
# Display the first few entries in the Corpus
print("\nCorpus:")
print(corpus[:5])



Corpus:
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)], [(3, 2), (5, 2), (13, 1), (23, 1), (38, 1), (39, 1), (44, 1), (45, 1), (46, 2), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 5), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 3), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 3), (82, 1), (83, 1), (84, 2), (85, 1), (86, 1), (87, 1), (88, 3), (89, 1)], [(5, 2), (14, 2), (15, 1), (17, 2), (19, 1), (24, 1), (26, 1), (30, 2), (32, 1), (38, 1), (39, 1), (40, 1), (42, 1), (47, 1), (62, 1),

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [31]:
from gensim.models import LdaModel

# Set the parameters
num_topics = 10  # You can adjust the number of topics based on your requirements
chunksize = 100  # Number of documents to be used in each training chunk
alpha = 'auto'   # Hyperparameter affecting the sparsity of the topics
passes = 10      # Total number of training passes

# Build the LdaModel
lda_model = LdaModel(corpus=corpus,
                     id2word=id2word,
                     num_topics=num_topics,
                     chunksize=chunksize,
                     alpha=alpha,
                     passes=passes)

### Print the Keyword in the 10 topics

In [32]:
# Display the topics with their keywords
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.537*"ax" + 0.036*"max" + 0.016*"choose" + 0.013*"brain" + 0.012*"encrypt" + 0.012*"notice" + 0.009*"clipper_chip" + 0.008*"rsa" + 0.008*"keyboard" + 0.008*"announcement"')
(1, '0.026*"write" + 0.025*"subject" + 0.020*"organization" + 0.019*"article" + 0.019*"get" + 0.015*"know" + 0.013*"just" + 0.013*"go" + 0.013*"nntp_poste" + 0.011*"so"')
(2, '0.030*"drive" + 0.019*"car" + 0.015*"buy" + 0.014*"new" + 0.012*"price" + 0.012*"sale" + 0.012*"sell" + 0.009*"cost" + 0.008*"cheap" + 0.008*"bike"')
(3, '0.022*"say" + 0.015*"people" + 0.014*"evidence" + 0.013*"believe" + 0.010*"man" + 0.009*"gun" + 0.008*"faith" + 0.008*"life" + 0.008*"come" + 0.008*"claim"')
(4, '0.041*"space" + 0.023*"science" + 0.016*"research" + 0.016*"earth" + 0.015*"point" + 0.014*"sphere" + 0.012*"self" + 0.011*"launch" + 0.009*"moon" + 0.009*"theory"')
(5, '0.034*"year" + 0.033*"team" + 0.033*"game" + 0.027*"win" + 0.023*"play" + 0.016*"player" + 0.010*"first" + 0.010*"fan" + 0.010*"last" + 0.009*"run"')
(6, '0

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [34]:
# Calculate perplexity
perplexity = lda_model.log_perplexity(corpus)
print(f'Model Perplexity: {perplexity}')

Model Perplexity: -7.894254729132347


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [33]:
from gensim.models import CoherenceModel

# Calculate coherence score
coherence_model = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_score = coherence_model.get_coherence()

print(f'Topic Coherence Score: {coherence_score}')

Topic Coherence Score: 0.5384697421569475


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [36]:
# ! pip install pyldavis

In [37]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Create the pyLDAvis visualization
lda_visualization = gensimvis.prepare(lda_model, corpus, id2word)

# Display the visualization in the notebook
pyLDAvis.display(lda_visualization)
