# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
# ! pip install pyLDAvis gensim spacy
# !pip install pyLDAvis

### Import the libraries

In [2]:
import pandas as pd
import requests
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import pyLDAvis
import re
import random
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [3]:
# Dataset URL
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"

# Download the dataset
response = requests.get(url)
data = response.json()

### Load the dataset

In [4]:
# Load the data into a Pandas DataFrame
df = pd.DataFrame(data)

In [5]:
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


In [6]:
# Print the first row
print(df.iloc[0])

content         From: lerxst@wam.umd.edu (where's my thing)\nS...
target                                                          7
target_names                                            rec.autos
Name: 0, dtype: object


### Preprocess the data

### Email Removal

In [7]:
# Define a function to remove emails from text
def remove_emails(text):
    # Use regular expression to remove email patterns
    return re.sub(r'\S+@\S+', '', text)

# Apply the remove_emails function to the 'content' column
df['content'] = df['content'].apply(remove_emails)

In [8]:
df.head()

Unnamed: 0,content,target,target_names
0,From: (where's my thing)\nSubject: WHAT car i...,7,rec.autos
1,From: (Guy Kuo)\nSubject: SI Clock Poll - Fin...,4,comp.sys.mac.hardware
2,From: (Thomas E Willis)\nSubject: PB question...,4,comp.sys.mac.hardware
3,From: (Joe Green)\nSubject: Re: Weitek P9000 ...,1,comp.graphics
4,From: (Jonathan McDowell)\nSubject: Re: Shutt...,14,sci.space


### Newline Removal

In [9]:
# Display the columns in your DataFrame
print(df.columns)

Index(['content', 'target', 'target_names'], dtype='object')


In [10]:
# Replace newline characters in the 'target_names' column
df['content'] = df['content'].str.replace('\n', ' ')

# Display the DataFrame
df.head()

Unnamed: 0,content,target,target_names
0,From: (where's my thing) Subject: WHAT car is...,7,rec.autos
1,From: (Guy Kuo) Subject: SI Clock Poll - Fina...,4,comp.sys.mac.hardware
2,From: (Thomas E Willis) Subject: PB questions...,4,comp.sys.mac.hardware
3,From: (Joe Green) Subject: Re: Weitek P9000 ?...,1,comp.graphics
4,From: (Jonathan McDowell) Subject: Re: Shuttl...,14,sci.space


### Single Quotes Removal

In [11]:
# Replace single quotes in the 'target_names' column
df['content'] = df['content'].str.replace("'", "")

# Display the DataFrame
df.head()

Unnamed: 0,content,target,target_names
0,From: (wheres my thing) Subject: WHAT car is ...,7,rec.autos
1,From: (Guy Kuo) Subject: SI Clock Poll - Fina...,4,comp.sys.mac.hardware
2,From: (Thomas E Willis) Subject: PB questions...,4,comp.sys.mac.hardware
3,From: (Joe Green) Subject: Re: Weitek P9000 ?...,1,comp.graphics
4,From: (Jonathan McDowell) Subject: Re: Shuttl...,14,sci.space


### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [12]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

In [13]:
def sent_to_words(data):
    for sentence in data:
        # Use gensim.utils.simple_preprocess to tokenize the sentence
        yield simple_preprocess(str(sentence), deacc=True)  # deacc=True removes accent marks

# Example usage:
# Assuming df['target_names'] contains your text data
text_data = df['content']

# Tokenize the text using the sent_to_words generator
tokenized_text = list(sent_to_words(text_data))

# # Randomize the order of the tokenized sentences
# random.shuffle(tokenized_text)

# Print the first few tokenized sentences
print(tokenized_text[:5])



### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [14]:
# Additional stop words to be added
custom_stop_words = set(['from', 'subject', 're', 'edu', 'use'])

# Extend the default stop words with custom stop words
all_stop_words = STOPWORDS.union(custom_stop_words)

# Assuming 'tokenized_text' is your list of tokenized sentences
# Replace 'tokenized_text' with the actual variable containing your tokenized data

# Remove stop words from the tokenized text
filtered_text = [[word for word in sentence if word not in all_stop_words] for sentence in tokenized_text]

# # Randomize the order of the tokenized sentences
# random.shuffle(filtered_text)

# Print the first few sentences after stop words removal
print(filtered_text[:5])



#### remove_stopwords( )

In [15]:
from gensim.parsing.preprocessing import STOPWORDS

def remove_stopwords(texts, custom_stop_words=None):
    """
    Remove stop words from a list of tokenized sentences.

    Parameters:
    - texts: List of lists, where each inner list represents a tokenized sentence.
    - custom_stop_words: Set of custom stop words to be added.

    Returns:
    - List of lists, where each inner list represents a sentence after stop words removal.
    """
    if custom_stop_words is None:
        custom_stop_words = set()

    # Extend the default stop words with custom stop words
    all_stop_words = STOPWORDS.union(custom_stop_words)

    # Remove stop words from each tokenized sentence
    filtered_texts = [[word for word in sentence if word not in all_stop_words] for sentence in texts]

    return filtered_texts

In [16]:
# Remove stop words using the function
filtered_text = remove_stopwords(tokenized_text)

# Print the first few sentences after stop words removal
print(filtered_text[:5])



### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [17]:
from gensim.models import Phrases
from gensim.utils import simple_preprocess

In [18]:
# Tokenize the content column
tokenized_text = df['content'].apply(simple_preprocess)

# Create bigrams using Phrases
bigram = Phrases(tokenized_text, threshold=100)

# Apply the bigram model to the tokenized text
bigram_tokenized_text = list(bigram[tokenized_text])

# Replace the 'content' column with the new bigram tokenized text
df['content'] = bigram_tokenized_text

In [19]:
# Display the DataFrame
print(df.head())

                                             content  target  \
0  [from, wheres, my, thing, subject, what, car, ...       7   
1  [from, guy_kuo, subject, si, clock, poll, fina...       4   
2  [from, thomas, willis, subject, pb, questions,...       4   
3  [from, joe, green, subject, re, weitek, organi...       1   
4  [from, jonathan, mcdowell, subject, re, shuttl...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


In [20]:
# Print the first row
print(df.iloc[0])

content         [from, wheres, my, thing, subject, what, car, ...
target                                                          7
target_names                                            rec.autos
Name: 0, dtype: object


#### make_bigrams( )

In [21]:
def make_bigrams(texts):
    # Create bigrams using Phrases with a threshold of 100
    bigram = Phrases(texts, threshold=100)
    # Apply the bigram model to the tokenized text
    bigram_tokenized_text = list(bigram[texts])
    return bigram_tokenized_text

In [22]:
# Tokenize the content column
tokenized_text = df['content'].apply(lambda x: simple_preprocess(' '.join(x)))

# Apply the make_bigrams function to the tokenized text
df['content'] = make_bigrams(tokenized_text)

In [23]:
# Display the DataFrame
print(df.head())

                                             content  target  \
0  [from, wheres, my, thing, subject, what, car, ...       7   
1  [from, guy_kuo, subject, si, clock, poll, fina...       4   
2  [from, thomas, willis, subject, pb, questions,...       4   
3  [from, joe, green, subject, re, weitek, organi...       1   
4  [from, jonathan, mcdowell, subject, re, shuttl...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [24]:
# ! python -m spacy download en
# Download the English model if not already downloaded
# !python -m spacy download en_core_web_sm

In [25]:
import spacy

In [26]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

#### lemmatizaton( )

In [27]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [28]:
# Lemmatize the data
data_lemmatized = lemmatization(bigram_tokenized_text, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [29]:
print(data_lemmatized[:1])

[['s', 'thing', 'subject', 'car', 'nntp_poste', 'host', 'rac_wam', 'organization', 'park', 'line', 'wonder', 'out', 'there', 'enlighten', 'car', 'see', 'other', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [30]:
# Create a Dictionary representation of the texts
dictionary = corpora.Dictionary(data_lemmatized)

# Print the dictionary
print(dictionary)

Dictionary<51656 unique tokens: ['addition', 'body', 'bring', 'call', 'car']...>


### Create Corpus

In [31]:
# Create the Corpus
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

# Print the first few entries in the corpus
print(corpus[:5])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1)], [(3, 2), (5, 2), (13, 1), (18, 1), (25, 1), (40, 1), (42, 1), (47, 1), (48, 1), (49, 2), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 5), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 1), (65, 2), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 3), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 3), (85, 1), (86, 1), (87, 2), (88, 1), (89, 1), (90, 1), (91, 3), (92, 1)], [(5, 2), (14, 2), (15, 1), (18, 2), (19, 2), (21, 1), (26, 1), (28, 1), (32, 2), (34, 1), (40, 1), (42, 1),

### Filter low-frequency words

In [32]:
# Filter out words that appear in less than 5 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=5, no_above=0.5)

# Create the Corpus again after filtering
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

# Print the first few entries in the filtered corpus
print(corpus[:5])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)], [(3, 2), (5, 2), (13, 1), (23, 1), (37, 1), (42, 1), (43, 1), (44, 2), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 5), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 3), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 3), (80, 1), (81, 1), (82, 2), (83, 1), (84, 1), (85, 1), (86, 3), (87, 1)], [(5, 2), (14, 2), (15, 1), (17, 2), (19, 1), (25, 1), (29, 2), (31, 1), (37, 1), (38, 1), (40, 1), (45, 1), (60, 1), (68, 1), (69, 1), (84, 1), (88, 1), (89, 1), (90, 1),

In [33]:
# for doc in corpus:
#     print(doc)

In [34]:
# # Print the first few documents in your corpus
# for doc in corpus[:5]:
#     print(doc)

### Create Index 2 word dictionary

In [35]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary

In [36]:
# Create a Dictionary from the filtered corpus
dictionary = Dictionary(data_lemmatized)

# Create an index-to-word dictionary
id2word = dictionary.id2token

# Check the length of the dictionary and corpus
print("Length of the dictionary:", len(dictionary))
print("Length of the corpus:", len(corpus))

Length of the dictionary: 51656
Length of the corpus: 11314


In [37]:
# # Check the length of the dictionary and corpus
# print("Length of the dictionary:", len(id2word))
# print("Length of the corpus:", len(corpus))

In [38]:
# Print all entries in the index-to-word dictionary
for key, value in id2word.items():
    print(f'Index: {key}, Word: {value}')

# Check the length of the dictionary
print(f'Total number of unique words in the dictionary: {len(dictionary)}')

Total number of unique words in the dictionary: 51656


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [39]:
print("Length of the corpus:", len(corpus))
print("Length of the dictionary:", len(id2word))

Length of the corpus: 11314
Length of the dictionary: 0


In [40]:
print(data_lemmatized[:5])



In [41]:
# # Add debugging prints
# for doc in corpus[:5]:
#     print(doc)

In [42]:
# Check the length of each document in the corpus
for i, doc in enumerate(corpus):
    print(f"Document {i + 1} length: {len(doc)}")

# Check if there are any documents with zero terms
empty_docs = [i for i, doc in enumerate(corpus) if len(doc) == 0]
print("Documents with zero terms:", empty_docs)

Document 1 length: 42
Document 2 length: 51
Document 3 length: 112
Document 4 length: 37
Document 5 length: 55
Document 6 length: 108
Document 7 length: 34
Document 8 length: 104
Document 9 length: 14
Document 10 length: 66
Document 11 length: 46
Document 12 length: 184
Document 13 length: 10
Document 14 length: 231
Document 15 length: 73
Document 16 length: 70
Document 17 length: 100
Document 18 length: 324
Document 19 length: 44
Document 20 length: 57
Document 21 length: 41
Document 22 length: 38
Document 23 length: 87
Document 24 length: 42
Document 25 length: 26
Document 26 length: 49
Document 27 length: 117
Document 28 length: 6
Document 29 length: 116
Document 30 length: 47
Document 31 length: 87
Document 32 length: 75
Document 33 length: 15
Document 34 length: 131
Document 35 length: 36
Document 36 length: 60
Document 37 length: 54
Document 38 length: 174
Document 39 length: 85
Document 40 length: 247
Document 41 length: 90
Document 42 length: 20
Document 43 length: 17
Document 

In [43]:
# Create the LDA model
ldamodel = LdaModel(corpus, 
                    num_topics=15, 
                    id2word=dictionary, 
                    iterations=20,  # Adjust the number of iterations
                    chunksize=100,  # Adjust the chunk size based on your corpus size and available memory
                    alpha='auto'   # 'auto' estimates alpha based on the data
                   )

### Print the Keyword in the 10 topics

In [44]:
# Print the keywords in the 10 topics
topics = ldamodel.print_topics(num_topics=10, num_words=10)
for topic in topics:
    print(topic)

(7, '0.697*"professional" + 0.047*"wed_apr" + 0.027*"supervise" + 0.014*"possible" + 0.008*"southern" + 0.006*"subjective" + 0.005*"galileo" + 0.005*"mark_zeni" + 0.005*"copy_protecte" + 0.004*"purpose"')
(9, '0.120*"location" + 0.042*"caste" + 0.033*"bug" + 0.028*"hakki" + 0.028*"cubic" + 0.028*"environment" + 0.027*"trick" + 0.024*"informed" + 0.024*"then_proceede" + 0.019*"regard"')
(1, '0.061*"unix" + 0.051*"inconsiant" + 0.040*"interested" + 0.036*"civilian" + 0.034*"prob" + 0.031*"hell" + 0.025*"incur" + 0.025*"away" + 0.021*"mention" + 0.018*"obrother"')
(10, '0.098*"berth" + 0.033*"start" + 0.028*"spotty" + 0.027*"anticipation" + 0.021*"invasion" + 0.021*"time" + 0.021*"overdose" + 0.020*"control" + 0.019*"shutout" + 0.019*"gauche"')
(0, '0.027*"seattle_wa" + 0.026*"control" + 0.018*"bro" + 0.017*"pantheism" + 0.016*"sean_garrison" + 0.016*"doubt" + 0.016*"mozumder" + 0.015*"rpwhite" + 0.014*"postseason" + 0.013*"ini_file"')
(3, '0.035*"car" + 0.016*"such" + 0.015*"individual" 

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [45]:
# from gensim.models import CoherenceModel

In [46]:
# Compute Perplexity
print(f'Model Perplexity: {ldamodel.log_perplexity(corpus)}')

Model Perplexity: -11.030509773361375


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [47]:
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary

# Assuming data_lemmatized is your preprocessed data
id2word = Dictionary(data_lemmatized)
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

# Assuming ldamodel is your trained LDA model
coherence_model_lda = CoherenceModel(model=ldamodel, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print(f'Topic Coherence: {coherence_lda}')

Topic Coherence: 0.4712432854937226


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [48]:
# Create the pyLDAvis visualization
vis_data = gensimvis.prepare(ldamodel, corpus, id2word)

In [49]:
# Display the visualization
pyLDAvis.display(vis_data)