# Natural Language Processing (NLP) and Topic Modeling

- LDA (Latent Dirichlet Allocation)
- NMF (Non-Negative Matrix Factorization) for topic modeling


- Learning Outcomes
    - Generate bite-size blocks of code so that you can see how different steps fit together
    - Deepen your understanding of NLP and LDA/topic modelling
    - Practice using AI for redundant tasks


- Use ChatGPT to answer the following questions before (or while) you perform the next tasks.
    - What is the tfidfvectorizer?
    - How is tfidf calculated?
    - How does the countvectorizer differ from the tfidfvectorizer?
    - How can I instantiate a TfidfVectorizer with the following parameters:
        - max_df = 0.95
        - min_df = 2
        - max_features = no_features
        - stop_words = 'english'
    - What are the methods of the tfidfvectorizer object?
    - What is NMF?
    - What is LDA?

In [None]:
# import TfidfVectorizer and CountVectorizer from sklearn
from Sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# import fetch_20newsgroups from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

# import NMF and LatentDirichletAllocation from sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [1]:
Create a variable called no_features and set its value to 100.

Then, create a variable no_topics and set its value to 100.

NMF
Instruction
Instantiate a TfidfVectorizer with the following parameters:

max_df = 0.95
min_df = 2
max_features = no_features
stop_words = 'english'
Instruction
Then, use the fit_transform method of TfidfVectorizer to transform the documents.

Instruction
Next, get the features names from TfidfVectorizer.

Instruction
Finally, instantiate NMF and fit_transform data.

LDA with Sklearn
Instruction
Instantiate a CountVectorizer with the following parameters:

max_df = 0.95
min_df = 2
max_features = no_features
stop_words = 'english'
Instruction
Then, use the fit_transform method of CountVectorizer to transform documents.

Instruction
Next, get the features names from CountVectorizer.

Instruction
Next, instantiate LatentDirichletAllocation and fit transformed data.

Instruction
Finally, create a function display_topics that is able to display the top words in a topic for different models.

The expect outputs include:

Display the top 10 words from each topic from NMF model
Display the top 10 words from each topic from LDA model

SyntaxError: invalid syntax (468294634.py, line 1)

Complete https://www.linkedin.com/pulse/nlp-a-complete-guide-topic-modeling-latent-dirichlet-sahil-m

In [None]:
Install NLP Libraries
Note
You do not need to download all the libraries and packages mentioned in this exercise. Only selected libraries and packages will be relevant to you depending upon the project you choose to do for this course (choosing the project is the next task you do after this exercise). Feel free to discuss what packages are most relevant for your project with mentors or instructors.

There are several libraries available for NLP tasks. For working with LLMs specifically, you might be interested in libraries such as transformers by Hugging Face, which provide interfaces to pre-trained models like GPT, BERT, and others. To install these, run:


$ conda install transformers

Install Additional Libraries (Optional)
Depending on your NLP tasks, you may need additional libraries. Some common ones include:

NLTK: A leading platform for building Python programs to work with human language data.

spaCy: An industrial-strength NLP library for Python.

Gensim: A robust semantic modeling library, useful for topic modeling and document similarity analysis.

You can install these libraries using conda:



$ conda install nltk spacy gensim


Download Pre-Trained Models (Optional)
Some tasks may require downloading pre-trained models or data. For instance, with transformers, you can download models directly in your code:



from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')


For NLTK, you might need to download certain data packages:



import nltk
nltk.download('popular')

For spaCy, downloading a language model is often required:


$ python -m spacy download en_core_web_sm


Write Your NLP Code
Now you are ready to write Python code for your NLP task. Here's a very simple example using the transformers library to generate text:



from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode some input text
input_ids = tokenizer.encode('As an example of NLP, ', return_tensors='pt')

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Generate text until the output length (which includes the given input) reaches 50 tokens
output = model.generate(input_ids, max_length=50, pad_token_id=tokenizer.eos_token_id)

# Decode the output token ids
decoded_output = tokenizer.decode(output[0])

print(decoded_output)


Remember that working with LLMs, especially when fine-tuning or training them, may require significant computational resources. For such cases, you might consider using cloud-based services with dedicated hardware like GPUs.

Install PyTorch
Instruction
Install PyTorch with the help of the instructions given on PyTorch Get Started page.