# Task 3  
## Syed Hamza Ali

# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
#! pip install pyLDAvis gensim spacy
# installing gensim
#! pip install gensim==3.6.0

In [2]:
# modifying the package
!sed -i 's/from collections import Mapping/from collections.abc import Mapping/g' /usr/local/lib/python3.10/dist-packages/gensim/corpora/dictionary.py
!sed -i 's/from collections.abc import Mapping, defaultdict/from collections.abc import Mapping\nfrom collections import defaultdict/g' /usr/local/lib/python3.10/dist-packages/gensim/corpora/dictionary.py

### Import the libraries

In [65]:
import pyLDAvis.gensim
import gensim
import spacy
import pandas as pd
import json
from gensim.summarization import summarize
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

  and should_run_async(code)


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [4]:
# Load the dataset from the URL
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"
dataset = pd.read_json(url)

  and should_run_async(code)


### Preprocess the data

### Email, Newline, and Single Quotes Removal

In [5]:
# Data Symbols and Spaces removal
dataset['content'] = dataset['content'].str.replace(r'\S*@\S*\s?', '', regex=True)
dataset['content'] = dataset['content'].str.replace(r'\s+', ' ', regex=True)
dataset['content'] = dataset['content'].str.replace(r"\'", '', regex=True)

  and should_run_async(code)


### Tokenize
- Create **sent_to_words()**
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [6]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True, min_len=1))

  and should_run_async(code)


In [7]:
dw = list(sent_to_words(dataset['content']))

  and should_run_async(code)


### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [8]:
from gensim.parsing.preprocessing import STOPWORDS
extra_stop_words = {'from', 'subject', 're', 'edu', 'use'}
stopwords = STOPWORDS.union(extra_stop_words)

  and should_run_async(code)


#### remove_stopwords( )

In [9]:
def remove_stopwords(texts):
    return [[word for word in doc if word not in stopwords] for doc in texts]

  and should_run_async(code)


In [10]:
dw_no_sw = remove_stopwords(dw)

  and should_run_async(code)


### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [11]:
from gensim.models import Phrases

  and should_run_async(code)


#### make_bigrams( )

In [12]:
def make_bigrams(texts):
    bigram = Phrases(texts, min_count=1, threshold=100)
    return [bigram[line] for line in texts]

  and should_run_async(code)


In [13]:
data_words_bigrams = make_bigrams(dw_no_sw)

  and should_run_async(code)


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [14]:
#! pip install -U spacy

  and should_run_async(code)


In [15]:
#! python -m spacy download en_core_web_sm
import spacy

  and should_run_async(code)


In [17]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

  and should_run_async(code)


#### lemmatizaton( )

In [18]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

  and should_run_async(code)


In [19]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

  and should_run_async(code)


In [20]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host_rac', 'wam_umd', 'organization', 'university', 'park', 'line', 'wondering_enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 's', 'early', 'door', 'small', 'addition', 'bumper_separate', 'rest', 'body', 'know', 'tellme_model', 'engine', 'spec', 'year', 'production', 'car', 'history', 'look', 'car', 'e', 'mail', 'thank', 'bring']]


  and should_run_async(code)


### Create a Dictionary

In [52]:
from gensim.corpora import Dictionary
dictionary =  Dictionary(data_lemmatized)
print(dictionary)

  and should_run_async(code)


Dictionary(81822 unique tokens: ['addition', 'body', 'bring', 'bumper_separate', 'car']...)


### Create Corpus

In [53]:
corpus = [dictionary.doc2bow(doc) for doc in data_lemmatized]

  and should_run_async(code)


### Filter low-frequency words

In [54]:
# Filter out tokens that appear in less than 5 documents
dictionary.filter_extremes(no_below=10, no_above=0.5)

corpus = [dictionary.doc2bow(doc) for doc in data_lemmatized]

  and should_run_async(code)


### Create Index 2 word dictionary

In [56]:
temp = dictionary[0]
id2word = dictionary.id2token

  and should_run_async(code)


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [67]:
from gensim.models import LdaModel

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=20,
    chunksize=100,
    alpha='auto',
    passes=20
)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad 

### Print the Keyword in the 10 topics

In [68]:
results = model.print_topics(num_words=10)
for result in results:
    print(result)

(0, '0.065*"public" + 0.026*"government" + 0.025*"distribution" + 0.024*"ripem" + 0.022*"service" + 0.021*"private" + 0.019*"issue" + 0.019*"section" + 0.017*"encrypt" + 0.016*"newsgroup"')
(1, '0.043*"article" + 0.035*"time" + 0.033*"good" + 0.023*"way" + 0.020*"work" + 0.020*"sure" + 0.017*"case" + 0.017*"question" + 0.016*"read" + 0.016*"long"')
(2, '0.035*"mail" + 0.035*"program" + 0.029*"include" + 0.029*"post" + 0.027*"send" + 0.026*"number" + 0.026*"information" + 0.025*"e" + 0.021*"available" + 0.021*"list"')
(3, '0.060*"nntp_poste" + 0.040*"host" + 0.035*"article" + 0.033*"problem" + 0.030*"need" + 0.030*"reply" + 0.027*"m" + 0.022*"help" + 0.020*"university" + 0.020*"look"')
(4, '0.074*"window" + 0.055*"run" + 0.050*"software" + 0.036*"machine" + 0.029*"application" + 0.027*"pc" + 0.024*"version" + 0.022*"screen" + 0.022*"slow" + 0.020*"mode"')
(5, '0.755*"ax" + 0.007*"data" + 0.006*"socket" + 0.005*"gay" + 0.005*"mount" + 0.005*"terrorism" + 0.004*"pitcher" + 0.004*"correspo

  and should_run_async(code)


## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [69]:
perplexity = model.log_perplexity(corpus)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad 

In [70]:
print(f"Model Perplexity: {perplexity}")

Model Perplexity: -6.992745967341638


  and should_run_async(code)


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [71]:
from gensim.models import CoherenceModel

# Compute Topic Coherence
coherence_model = CoherenceModel(model=model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model.get_coherence()
print(coherence_lda)

  and should_run_async(code)


0.5312985589609038


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [72]:
pyLDAvis.enable_notebook()
vis_data = gensimvis.prepare(model, corpus, dictionary)
pyLDAvis.display(vis_data)

  and should_run_async(code)
