<a href="https://colab.research.google.com/github/LondheShubham153/natural_language_processing/blob/main/topic_modelling_using_lda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Packages and Imports

Supressed Warnings

In [None]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [None]:
! pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.1 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=b66a4eb2b593582b01d7a786b827b4f79458666d8cc0d21cf1a2166ca87da06e
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.17 pyLDAvis-3.3.1


In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
from nltk.corpus import stopwords
import string
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Loding Dataset

In [None]:
newsgroups = fetch_20newsgroups(remove=('headers','footers','quotes'))
docs_raw = newsgroups.data

In [None]:
print(f"The total number of Documents from the News Group Dataset is: {len(docs_raw)}")

The total number of Documents from the News Group Dataset is: 11314


## Case 1: CountVectorizer with Custom Function, TFIDF and BOW

Creating Custom Preprocessor

In [None]:
class PreProcessor:

  def remove_punctuation(self, record):

    cleaned_str = [char for char in record if char not in string.punctuation]
    return ''.join(cleaned_str)
  
  def normalize_sentences(self, sentences):
    words = sentences.split(" ")
    return [word.lower() for word in words]

  def remove_stopwords(self,words):
    return [word for word in words if word not in stopwords.words("english")]

  def process(self, record):
      # Remove Punctuation
      sentences = self.remove_punctuation(record)
      
      # Normalize
      norm_words = self.normalize_sentences(sentences)
      
      # Remove Stopwords
      final_words = self.remove_stopwords(norm_words)
      
      return final_words

In [None]:
processor = PreProcessor()

Using preprocessing and create BOW and TFIDF object

In [None]:
word_vector = CountVectorizer(analyzer=processor.process)
final_word_vocab = word_vector.fit(docs_raw)

In [None]:
bag_of_words = final_word_vocab.transform(docs_raw)

In [None]:
tfIdf_obj = TfidfTransformer().fit(bag_of_words)
final_feature = tfIdf_obj.transform(bag_of_words)

Creating a TFIDF object for LDA

In [None]:
tfidfVectorizerNew = TfidfVectorizer(**word_vector.get_params())
tfidfObjectNew = tfidfVectorizerNew.fit_transform(docs_raw)

In [None]:
lda_tf_custom = LatentDirichletAllocation(n_components=20 , random_state=0)
lda_tf_custom.fit(tfidfObjectNew)

LatentDirichletAllocation(n_components=20, random_state=0)

Initializing Notebook

In [None]:
pyLDAvis.enable_notebook()


In [None]:
pyLDAvis.sklearn.prepare(lda_tf_custom,tfidfObjectNew,word_vector)

## Case 2: CountVectorizer without Custom Function

In [None]:
tf_vectorizer = CountVectorizer(stop_words='english',
                               lowercase=True,
                               token_pattern= r'\b[a-zA-Z]{3,}',
                                max_df=0.5,
                                min_df=10,
                                strip_accents='unicode'
                               )

tfObject = tf_vectorizer.fit_transform(docs_raw)

Creating TFIDF object

In [None]:
tfidfVectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
tfidfObject = tfidfVectorizer.fit_transform(docs_raw)

Applying LDA

20 Components

In [None]:
lda_tf = LatentDirichletAllocation(n_components=20 , random_state=0)
lda_tf.fit(tfObject)

LatentDirichletAllocation(n_components=20, random_state=0)

10 Components

In [None]:
lda_tfIDF = LatentDirichletAllocation(n_components=10 , random_state=0)
lda_tfIDF.fit(tfidfObject)

LatentDirichletAllocation(random_state=0)

PyLDAVis Visualizations

Visualizing 20 Topics

In [None]:
pyLDAvis.sklearn.prepare(lda_tf,tfObject,tf_vectorizer)

Visualizing 10 Topics

In [None]:
pyLDAvis.sklearn.prepare(lda_tfIDF,tfObject,tf_vectorizer)

In [None]:
pyLDAvis.sklearn.prepare(lda_tf_custom,tfidfObjectNew,word_vector)

## Conclusion

- The model did well without Custom Function
- it took a long time to create the TFID Object
- When we observed the PyLDAVis for custom function with TFIDF and BOW, we saw overlapping Topics generated
- When we used CountVectorizer without any Custom PreProcessing functions, we saw better results