# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [None]:
! pip install pyLDAvis gensim spacy

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


### Import the libraries

In [None]:
import pandas as pd
import numpy as np
import json
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim
from gensim import corpora, models
import pyLDAvis
import pyLDAvis.gensim_models
import spacy

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load English language model in spaCy
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [None]:
!pip install wget
import wget

# Download the dataset
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"
wget.download(url)

  and should_run_async(code)


Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=5172fd1478fb83966e870c30426f8d50ae53a74c9a1587854b9ce917f2007a08
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


'newsgroups.json'

### Load the dataset

In [99]:
# prompt: load the dataset
import pandas as pd
# Load the dataset
df = pd.read_json('newsgroups.json')

  and should_run_async(code)


In [100]:
df.head()

  and should_run_async(code)


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


### Preprocess the data

### Email Removal

In [103]:
import re
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from nltk.corpus import stopwords
import nltk
import string

# Download required NLTK data
nltk.download('stopwords')

# 1. Email Removal
def remove_emails(text):
    # Updated regular expression to match email patterns with or without TLD
    email_pattern = r'\S+@\S+'
    # Substitute found email addresses with an empty string
    return re.sub(email_pattern, '', text)

# 2. Newline Removal
def remove_newlines(text):
    return re.sub(r'\n+', ' ', text)

# 3. Single Quotes Removal
def remove_quotes(text):
    return re.sub(r"\'", "", text)

# 4. Tokenization
def sent_to_words(texts):
    for sentence in texts:
        # Use simple_preprocess to tokenize while keeping important phrases
        yield simple_preprocess(str(sentence), deacc=True, min_len=1)

# 5. Bigrams Creation
def make_bigrams(texts):
    bigram = Phrases(texts, min_count=1, threshold=100)  # Adjust threshold as needed
    bigram_mod = Phraser(bigram)
    return [bigram_mod[doc] for doc in texts]

def remove_stopwords(texts):
    stop_words = set(stopwords.words('english'))
    # Specify words to always keep
    always_keep = ['from', 'subject', 're', 'edu', 'use']
    # Remove always-keep words from stopwords
    stop_words = stop_words - set(always_keep)
    # Add single-letter words (a-z) to stopwords
    single_letter_words = set(string.ascii_lowercase)  # {'a', 'b', ..., 'z'}
    stop_words = stop_words.union(single_letter_words)
    return [[word for word in doc if word not in stop_words] for doc in texts]

# Apply preprocessing pipeline
# Start with the dataset
data = df['content'].copy()

# Apply initial cleaning
data = data.apply(remove_emails)
data = data.apply(remove_newlines)
data = data.apply(remove_quotes)

# Convert to words
data_words = list(sent_to_words(data))

# Create and apply bigrams
data_bigrams = make_bigrams(data_words)

data_bigrams_nostops = remove_stopwords(data_bigrams)

# For first document
print("First document tokens:")
print(data_bigrams_nostops[0])

  and should_run_async(code)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


First document tokens:
['from', 'wheres', 'thing', 'subject', 'car', 'nntp_posting', 'host_rac', 'wam_umd', 'edu', 'organization', 'university', 'maryland_college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'from', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front_bumper', 'separate', 'from', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky_looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood_lerxst']


### Newline Removal

### Single Quotes Removal

### Tokenize
- Create **sent_to_words()**
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [104]:
print(data_bigrams_nostops[100])

['from', 'tsung', 'kun', 'chen', 'subject', 'software', 'forsale', 'lots', 'nntp_posting', 'host_magnusug', 'magnus_acs', 'ohio_state', 'edu', 'organization', 'ohio_state', 'university', 'post', 'friend', 'either', 'call', 'lee', 'drop', 'mail', 'distribution_usa', 'lines', 'software', 'publishing', 'superbase', 'windows', 'ocr', 'system_readright', 'windows', 'ocr', 'system_readright', 'dos', 'unregistered', 'zortech', 'bit', 'compiler', 'multiscope', 'windows', 'debugger', 'whitewater', 'resource', 'toolkit', 'library', 'source_code', 'glockenspiel', 'imagesoft', 'commonview', 'windows', 'applications', 'framework', 'borland', 'spontaneous', 'assembly', 'library', 'source_code', 'microsoft_macro', 'assembly', 'microsoft', 'windows', 'sdk', 'documentation', 'microsoft', 'foxpro', 'wordperfect', 'developers_toolkit', 'kedwell', 'software', 'databoss', 'code', 'generator', 'kedwell', 'installboss', 'installation', 'generator', 'liant', 'software', 'views', 'windows', 'application_framew

  and should_run_async(code)


#### remove_stopwords( )

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

#### make_bigrams( )

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [83]:
!python -m spacy download en_core_web_sm # Download the en_core_web_sm model

import spacy

# Load the model using its full name
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

  and should_run_async(code)


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m83.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


#### lemmatizaton( )

In [84]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

  and should_run_async(code)


In [105]:
data_lemmatized = lemmatization(data_bigrams_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

  and should_run_async(code)


In [107]:
print(data_lemmatized[:1])

[['thing', 'subject', 'car', 'nntp_poste', 'wam_umd', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky_looke', 'car', 'mail', 'thank', 'bring']]


  and should_run_async(code)


### Create a Dictionary

### Create Corpus

### Filter low-frequency words

### Create Index 2 word dictionary

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [109]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models
import pyLDAvis

# Step 1: Create a Dictionary
dictionary = corpora.Dictionary(data_lemmatized)

# Step 2: Filter low-frequency words
dictionary.filter_extremes(no_below=5, no_above=0.5)  # Adjust thresholds as needed

# Step 3: Create a Bag of Words representation
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

# Step 4: Make an index to word dictionary
temp = dictionary[0]  # Load the dictionary
id2word = dictionary.id2token  # Create the mapping

# Step 5: Train the Topic Model
lda_model = LdaModel(corpus, num_topics=10, id2word=id2word, passes=10,alpha='auto',chunksize=2000)


  and should_run_async(code)


In [110]:
for idx, topic in lda_model.print_topics(-1):
    print(f" {idx}: {topic}")

 0: 0.022*"file" + 0.018*"use" + 0.014*"program" + 0.009*"include" + 0.009*"image" + 0.009*"information" + 0.008*"available" + 0.007*"source" + 0.006*"also" + 0.006*"system"
 1: 0.011*"article" + 0.008*"think" + 0.007*"make" + 0.007*"people" + 0.007*"use" + 0.006*"much" + 0.006*"say" + 0.006*"get" + 0.006*"year" + 0.005*"well"
 2: 0.014*"say" + 0.014*"people" + 0.008*"think" + 0.008*"know" + 0.008*"believe" + 0.007*"make" + 0.005*"see" + 0.005*"article" + 0.005*"come" + 0.005*"many"
 3: 0.022*"nntp_poste" + 0.018*"host" + 0.014*"thank" + 0.013*"mail" + 0.012*"know" + 0.012*"article" + 0.011*"get" + 0.010*"reply" + 0.009*"new" + 0.009*"look"
 4: 0.891*"ax" + 0.059*"max" + 0.001*"ei" + 0.001*"wm" + 0.001*"tq" + 0.001*"fq" + 0.001*"tm" + 0.001*"pl" + 0.001*"qq" + 0.001*"qax"
 5: 0.026*"use" + 0.019*"window" + 0.016*"problem" + 0.015*"drive" + 0.012*"get" + 0.011*"run" + 0.010*"work" + 0.009*"bit" + 0.009*"system" + 0.008*"do"
 6: 0.035*"_" + 0.015*"qs" + 0.014*"m" + 0.012*"ai" + 0.011*"sp

  and should_run_async(code)


### Print the Keyword in the 10 topics

[(0,
  '0.165*"gun" + 0.063*"crime" + 0.054*"police" + 0.053*"bus" + 0.051*"weapon" '
  '+ 0.041*"criminal" + 0.040*"carry" + 0.039*"master" + 0.038*"shoot" + '
  '0.034*"cop"'),
 (1,
  '0.149*"team" + 0.130*"game" + 0.079*"win" + 0.070*"test" + 0.043*"run" + '
  '0.043*"score" + 0.039*"division" + 0.036*"wing" + 0.028*"cpu" + '
  '0.020*"resource"'),
 (2,
  '0.118*"list" + 0.077*"entry" + 0.056*"section" + 0.041*"author" + '
  '0.039*"special" + 0.032*"site" + 0.032*"sun" + 0.027*"send" + '
  '0.023*"laboratory" + 0.022*"student"'),
 (3,
  '0.070*"suggest" + 0.064*"church" + 0.060*"member" + 0.050*"process" + '
  '0.043*"community" + 0.040*"perform" + 0.038*"scientific" + 0.036*"ignore" + '
  '0.031*"weight" + 0.030*"significant"'),
 (4,
  '0.176*"question" + 0.097*"answer" + 0.056*"access" + 0.054*"page" + '
  '0.048*"format" + 0.036*"recommend" + 0.036*"trial" + 0.033*"ask" + '
  '0.031*"faq" + 0.029*"step"'),
 (5,
  '0.042*"people" + 0.028*"say" + 0.021*"believe" + 0.021*"reason" +

  and should_run_async(code)


## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [111]:
# Step 7: Evaluate the Topic Model
# Perplexity
print(f'Perplexity: {lda_model.log_perplexity(corpus)}')

# Topic Coherence
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score: {coherence_lda}')

  and should_run_async(code)


Perplexity: -7.310263172263326
Coherence Score: 0.5781158742763122


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [112]:
# Step 8: Visualize the Topics
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

  and should_run_async(code)


  and should_run_async(code)
