<a href="https://colab.research.google.com/github/EmiljaB/NLP_Projects/blob/News_Modeling/News_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Modeling
#### Emilja Beneja

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [None]:
!pip install gensim nltk




In [None]:
!pip install pyLDAvis gensim spacy

# Install spaCy's English language model
!python -m spacy download en_core_web_sm


Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m70.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook

### Import the libraries



In [None]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import spacy
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import pandas as pd

# Load spaCy's English language model
nlp = spacy.load('en_core_web_sm')


### Upload and load the data

In [None]:
df = pd.read_json("newsgroups.json")
df.head()


  and should_run_async(code)


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


### Preprocess the data

In [None]:
# Initialize 'processed' column by copying 'content' column
df['processed'] = df['content']


  and should_run_async(code)


### Stop words removal

In [None]:
def remove_stop_words(text):
    doc = nlp(text)
    return ' '.join([token.text for token in doc if not token.is_stop])

df['processed'] = df['content'].apply(remove_stop_words)
df['processed'].head()


  and should_run_async(code)


Unnamed: 0,processed
0,: lerxst@wam.umd.edu ( thing ) \n Subject : ca...
1,: guykuo@carson.u.washington.edu ( Guy Kuo ) \...
2,: twillis@ec.ecn.purdue.edu ( Thomas E Willis ...
3,: jgreen@amber ( Joe Green ) \n Subject : : We...
4,: jcm@head-cfa.harvard.edu ( Jonathan McDowell...


### Email Removal

In [None]:
import re

def remove_emails(text):
    return re.sub(r'\S+@\S+', '', text)

df['processed'] = df['processed'].apply(remove_emails)
df['processed'].head()


  and should_run_async(code)


Unnamed: 0,processed
0,: ( thing ) \n Subject : car ! ? \n Nntp - Po...
1,: ( Guy Kuo ) \n Subject : SI Clock Poll - Fi...
2,: ( Thomas E Willis ) \n Subject : PB questio...
3,: ( Joe Green ) \n Subject : : Weitek P9000 ?...
4,: ( Jonathan McDowell ) \n Subject : : Shuttl...


### Non-Alphabetic Words Removal

In [None]:
import re

def clean_text(text):
    # Check if the input is a string; if not, return an empty string
    if not isinstance(text, str):
        return ''

    # Remove non-alphabetic characters (keeps spaces between words)
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove specific unwanted terms that appear frequently
    unwanted_terms = ["subject", "organization", "nntp", "posting"]
    text = ' '.join([word for word in text.split() if word.lower() not in unwanted_terms])

    return text

df['processed'] = df['processed'].apply(clean_text)
df['processed'].head()


  and should_run_async(code)


Unnamed: 0,processed
0,thing car Host racwamumdedu University Marylan...
1,Guy Kuo SI Clock Poll Final Summary Final SI c...
2,Thomas E Willis PB questions Purdue University...
3,Joe Green Weitek P Harris Computer Systems Div...
4,Jonathan McDowell Shuttle Launch Question Smit...


### Tokenize

In [None]:
def tokenize_text(text):
    doc = nlp(text)
    tokens = []
    for token in doc:
        # Filter out tokens that are whitespace, punctuation, or unwanted terms
        if not token.is_space and not token.is_punct and token.text.lower() not in ["subject", "re", "fw"]:
            tokens.append(token.text.lower())
    return tokens

df['processed'] = df['processed'].apply(tokenize_text)
df['processed'].head()


  and should_run_async(code)


Unnamed: 0,processed
0,"[thing, car, host, racwamumdedu, university, m..."
1,"[guy, kuo, si, clock, poll, final, summary, fi..."
2,"[thomas, e, willis, pb, questions, purdue, uni..."
3,"[joe, green, weitek, p, harris, computer, syst..."
4,"[jonathan, mcdowell, shuttle, launch, question..."


### Lowercase

In [None]:
df['processed'] = df['processed'].apply(lambda x: [token.lower() for token in x])
df['processed'].head()


  and should_run_async(code)


Unnamed: 0,processed
0,"[thing, car, host, racwamumdedu, university, m..."
1,"[guy, kuo, si, clock, poll, final, summary, fi..."
2,"[thomas, e, willis, pb, questions, purdue, uni..."
3,"[joe, green, weitek, p, harris, computer, syst..."
4,"[jonathan, mcdowell, shuttle, launch, question..."


### BiGrams & TriGrams

In [None]:
from gensim.models import Phrases

# Create bigram and trigram models
bigram = Phrases(df['processed'], min_count=5, threshold=100)
trigram = Phrases(bigram[df['processed']], threshold=100)

# Apply the bigram and trigram models
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

df['processed'] = df['processed'].apply(lambda x: trigram_mod[bigram_mod[x]])
df['processed'].head()


  and should_run_async(code)


Unnamed: 0,processed
0,"[thing, car, host_racwamumdedu, university, ma..."
1,"[guy_kuo, si, clock, poll, final, summary, fin..."
2,"[thomas, e, willis, pb, questions, purdue_univ..."
3,"[joe, green, weitek_p, harris, computer, syste..."
4,"[jonathan, mcdowell, shuttle_launch, question,..."


### Lemmatization

In [None]:
def lemmatize_text(text):
    doc = nlp(' '.join(text))
    return [token.lemma_ for token in doc if token.lemma_ != '-PRON-']  # Exclude pronouns

df['processed'] = df['processed'].apply(lemmatize_text)
df['processed'].head()


  and should_run_async(code)


Unnamed: 0,processed
0,"[thing, car, host_racwamumdedu, university, ma..."
1,"[guy_kuo, si, clock, poll, final, summary, fin..."
2,"[thomas, e, willis, pb, question, purdue_unive..."
3,"[joe, green, weitek_p, harris, computer, syste..."
4,"[jonathan, mcdowell, shuttle_launch, question,..."


### Create a Dictionary for the Document

In [None]:
from gensim import corpora

# Create a dictionary for the 'processed' text data
dictionary = corpora.Dictionary(df['processed'])


  and should_run_async(code)


### Filter Low-Frequency Words

In [None]:
# Filter words that are too rare or too common
dictionary.filter_extremes(no_below=5, no_above=0.5)


  and should_run_async(code)


### Create an Index-to-Word Dictionary

In [None]:
# Create a mapping from index to word
index_to_word = {id: word for word, id in dictionary.token2id.items()}


  and should_run_async(code)


### Train the Topic Model

In [None]:
from gensim.models import LdaModel

# Create a bag-of-words representation for each document
corpus = [dictionary.doc2bow(text) for text in df['processed']]

# Train the LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, random_state=42, passes=10)


  and should_run_async(code)


### Predict on the Dataset

In [None]:
# Predict topics for each document
df['topics'] = df['processed'].apply(lambda x: lda_model[dictionary.doc2bow(x)])
df[['content', 'topics']].head()


  and should_run_async(code)


Unnamed: 0,content,topics
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,"[(0, 0.52360725), (5, 0.06840608), (7, 0.39204..."
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,"[(2, 0.7212475), (3, 0.047728073), (6, 0.16559..."
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,"[(0, 0.5383933), (2, 0.29267898), (3, 0.046087..."
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,"[(0, 0.3160432), (2, 0.5339909), (5, 0.13594544)]"
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,"[(0, 0.7124134), (2, 0.061207075), (3, 0.14005..."


### Evaluate the Topic Model

#### a. Model Perplexity

In [None]:
# Model Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))


  and should_run_async(code)



Perplexity:  -8.231719920928807


#### b.Topic Coherence

Lower perplexity values imply better generalization in topic models

In [None]:
from gensim.models import CoherenceModel

# Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=df['processed'], dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


  and should_run_async(code)



Coherence Score:  0.565726383475668


Overall, a coherence score of 0.57 suggests the model has a fair balance of meaningful topics.

### Visualize the Topics with pyLDAvis

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, dictionary)
vis


  and should_run_async(code)
