# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [3]:
! pip install pyLDAvis gensim spacy

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting spacy
  Downloading spacy-3.8.2-cp39-cp39-win_amd64.whl.metadata (27 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp39-cp39-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp39-cp39-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp39-cp39-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.0 (from spacy)
  Downloading thinc-8.3.2-cp39-cp39-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.

### Import the libraries

In [5]:
# Basic data handling and manipulation
import pandas as pd
import numpy as np

# Ensure NLTK resources are available 
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# NLP and text processing
import re  # For text preprocessing like removing non-alphabetic characters
from nltk.corpus import stopwords  # To remove stop words
from nltk.tokenize import word_tokenize  # For tokenization
from nltk import bigrams, trigrams  # For creating bigrams and trigrams
from nltk.stem import WordNetLemmatizer  # For lemmatization

# Libraries for building the topic model
from gensim import corpora  # For creating dictionary and corpus
from gensim.models import LdaModel, CoherenceModel  # LDA model and topic coherence evaluation
from gensim.models.phrases import Phrases, Phraser  # For bigram and trigram models

# Visualization
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis  # For interactive topic model visualization
import matplotlib.pyplot as plt  # For plotting graphs



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\deepa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\deepa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\deepa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

--2021-03-08 08:56:47--  https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... 

  and should_run_async(code)


connected.
HTTP request sent, awaiting response... 200 OK
Length: 23237087 (22M) [text/plain]
Saving to: ‘newsgroups.json’


2021-03-08 08:56:49 (13.8 MB/s) - ‘newsgroups.json’ saved [23237087/23237087]



### Load the dataset

In [7]:
# URL of the dataset
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"

# Load the dataset into a pandas DataFrame
newsgroups_data = pd.read_json(url)

In [9]:
# Display the first few rows of the dataset
print(newsgroups_data.head())
print("\nDataset Summary:")

                                             content  target  \
0  From: lerxst@wam.umd.edu (where's my thing)\nS...       7   
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...       4   
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...       4   
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  

Dataset Summary:


### Preprocess the data

### Email Removal

In [11]:
def remove_emails(text):
    # Regular expression to find email addresses
    email_pattern = r'\S+@\S+'
    # Replace all occurrences of email addresses with an empty string
    return re.sub(email_pattern, '', text)

# Apply the email removal function to the 'content' column
newsgroups_data['content'] = newsgroups_data['content'].apply(remove_emails)

# Display a sample of the data to verify email addresses are removed
print(newsgroups_data['content'].head())

0    From:  (where's my thing)\nSubject: WHAT car i...
1    From:  (Guy Kuo)\nSubject: SI Clock Poll - Fin...
2    From:  (Thomas E Willis)\nSubject: PB question...
3    From:  (Joe Green)\nSubject: Re: Weitek P9000 ...
4    From:  (Jonathan McDowell)\nSubject: Re: Shutt...
Name: content, dtype: object


### Newline Removal

In [13]:
# Function to remove newline characters
def remove_newlines(text):
    # Replace newline characters with a space
    return text.replace('\n', ' ').replace('\r', ' ')

# Apply the newline removal function to the 'content' column
newsgroups_data['content'] = newsgroups_data['content'].apply(remove_newlines)

# Display a sample of the data to verify newline characters are removed
print(newsgroups_data['content'].head())

0    From:  (where's my thing) Subject: WHAT car is...
1    From:  (Guy Kuo) Subject: SI Clock Poll - Fina...
2    From:  (Thomas E Willis) Subject: PB questions...
3    From:  (Joe Green) Subject: Re: Weitek P9000 ?...
4    From:  (Jonathan McDowell) Subject: Re: Shuttl...
Name: content, dtype: object


### Single Quotes Removal

In [15]:
# Function to remove single quotes
def remove_single_quotes(text):
    # Replace single quotes with an empty string
    return text.replace("'", "")

# Apply the single quotes removal function to the 'content' column
newsgroups_data['content'] = newsgroups_data['content'].apply(remove_single_quotes)

# Display a sample of the data to verify single quotes are removed
print(newsgroups_data['content'].head())

0    From:  (wheres my thing) Subject: WHAT car is ...
1    From:  (Guy Kuo) Subject: SI Clock Poll - Fina...
2    From:  (Thomas E Willis) Subject: PB questions...
3    From:  (Joe Green) Subject: Re: Weitek P9000 ?...
4    From:  (Jonathan McDowell) Subject: Re: Shuttl...
Name: content, dtype: object


### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [21]:
from gensim.utils import simple_preprocess

# Generator function to tokenize text
def sent_to_words(texts):
    for text in texts:
        # Use gensim's simple_preprocess for basic tokenization
        yield simple_preprocess(text, deacc=True)  # deacc=True removes punctuations

# Apply the generator function to the 'content' column
tokenized_words = list(sent_to_words(newsgroups_data['content']))

# Display the first tokenized document as a sample
print(tokenized_words[0])

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [25]:
# Load the default stop words from NLTK
stop_words = set(stopwords.words('english'))

# Extend the stop words list with additional words
extra_stop_words = {"from", "subject", "re", "edu", "use"}
stop_words.update(extra_stop_words)

#### remove_stopwords( )

In [28]:
# Function to remove stop words from tokenized words
def remove_stopwords(texts):
    return [[word for word in text if word not in stop_words] for text in texts]

# Apply the stop words removal function to the tokenized words
tokenized_words_no_stopwords = remove_stopwords(tokenized_words)

# Display the first document after stop words removal as a sample
print(tokenized_words_no_stopwords[0])

['wheres', 'thing', 'car', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'organization', 'university', 'maryland', 'college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']


### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [30]:
# Build the bigram model
bigram = Phrases(tokenized_words_no_stopwords, min_count=5, threshold=100)  # min_count=5 filters infrequent words
bigram_phraser = Phraser(bigram)

# Function to apply bigram model to the tokenized words
def make_bigrams(texts):
    return [bigram_phraser[text] for text in texts]

# Apply the bigram model to the tokenized words without stop words
tokenized_words_bigrams = make_bigrams(tokenized_words_no_stopwords)


In [31]:
# Display a sample document to check bigrams
print(tokenized_words_bigrams[0])

['wheres', 'thing', 'car', 'nntp_posting', 'host', 'rac_wam', 'umd', 'organization', 'university', 'maryland_college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [34]:
! python -m spacy download en

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------- ----------------------- 5.2/12.8 MB 35.3 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 38.2 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;3m[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use
the full pipeline package name 'en_core_web_sm' instead.[0m
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [42]:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#### lemmatizaton( )

In [44]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [48]:
data_lemmatized = lemmatization(tokenized_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [49]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host', 'rac_wam', 'university', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [56]:
from gensim.corpora import Dictionary
# Create a dictionary from the tokenized and lemmatized text
dictionary = Dictionary(data_lemmatized)

# Display the first 10 words in the dictionary
print(dictionary.token2id)



In [None]:
# Filter out words that appear in less than 5 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=5, no_above=0.5)

# Display the dictionary after filtering
print(dictionary.token2id)

### Create Corpus

In [60]:
# Create a corpus from the dictionary and tokenized documents
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

# Display the first document in the corpus (in BoW format)
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1)]


### Filter low-frequency words

In [74]:
# Filter out words that appear in less than 5 documents
dictionary.filter_extremes(no_below=5)

# Create the corpus using the filtered dictionary
corpus_filtered = [dictionary.doc2bow(text) for text in data_lemmatized]


### Create Index 2 word dictionary

In [77]:
# Create an index-to-word dictionary
index_to_word = {id: word for word, id in dictionary.token2id.items()}

In [79]:
# Display the first 10 entries to check the mapping
print(dict(list(index_to_word.items())[:10]))

{0: 'addition', 1: 'body', 2: 'bring', 3: 'call', 4: 'car', 5: 'day', 6: 'door', 7: 'early', 8: 'engine', 9: 'enlighten'}


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [82]:
from gensim.models import LdaModel

# Define parameters for the LDA model
num_topics = 10  # Number of topics to generate
chunksize = 2000  # Number of documents to be used in each training chunk
alpha = 'auto'  # 'auto' lets gensim decide an appropriate alpha
passes = 10  # Total number of passes through the corpus during training

# Create the LDA model
lda_model = LdaModel(
    corpus=corpus_filtered,         # Filtered corpus
    id2word=dictionary,             # Dictionary to map word IDs back to words
    num_topics=num_topics,          # Number of topics
    chunksize=chunksize,            # Number of documents in each chunk
    alpha=alpha,                    # Alpha hyperparameter
    passes=passes,                  # Number of passes through the data
    random_state=42 )                # For reproducibility

### Print the Keyword in the 10 topics

In [84]:
for i, topic in lda_model.print_topics(num_topics=num_topics, num_words=10):
    print(f"Topic #{i + 1}: {topic}")

Topic #1: 0.015*"file" + 0.013*"program" + 0.011*"window" + 0.010*"use" + 0.008*"system" + 0.007*"image" + 0.007*"include" + 0.007*"software" + 0.006*"information" + 0.006*"also"
Topic #2: 0.013*"year" + 0.011*"team" + 0.009*"game" + 0.009*"think" + 0.009*"go" + 0.008*"well" + 0.008*"right" + 0.008*"get" + 0.007*"make" + 0.007*"article"
Topic #3: 0.032*"nntp_poste" + 0.029*"host" + 0.019*"article" + 0.010*"know" + 0.010*"reply" + 0.009*"get" + 0.008*"good" + 0.008*"look" + 0.008*"sale" + 0.007*"m"
Topic #4: 0.012*"get" + 0.011*"gun" + 0.009*"article" + 0.007*"people" + 0.007*"take" + 0.006*"make" + 0.006*"go" + 0.006*"time" + 0.005*"think" + 0.005*"m"
Topic #5: 0.028*"key" + 0.010*"encryption" + 0.008*"use" + 0.008*"government" + 0.008*"system" + 0.007*"security" + 0.007*"public" + 0.007*"chip" + 0.006*"bit" + 0.005*"clipper"
Topic #6: 0.015*"space" + 0.006*"power" + 0.006*"use" + 0.005*"launch" + 0.005*"wire" + 0.005*"ground" + 0.005*"high" + 0.005*"earth" + 0.004*"system" + 0.004*"de

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [87]:
# Calculate the model perplexity
perplexity = lda_model.log_perplexity(corpus_filtered)

print(f'Model Perplexity: {perplexity}')

Model Perplexity: -7.454445363004655


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [89]:
from gensim.models import CoherenceModel

# Calculate topic coherence
coherence_model_lda = CoherenceModel(
    model=lda_model,                 # Your LDA model
    texts=data_lemmatized, # Preprocessed text data
    dictionary=dictionary,            # Dictionary used by the LDA model
    coherence='c_v'                   # Coherence measure
)

# Get the coherence score
coherence_score = coherence_model_lda.get_coherence()
print(f'Topic Coherence Score: {coherence_score}')

Topic Coherence Score: 0.5648285785926751


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [91]:
# Prepare the visualization
pyLDAvis.enable_notebook()  # For Jupyter notebooks, to display inline
lda_vis_data = gensimvis.prepare(lda_model, corpus_filtered, dictionary)

In [92]:
pyLDAvis.display(lda_vis_data)

In [97]:
pyLDAvis.save_html(lda_vis_data, 'lda_topic_visualization.html')