<a href="https://colab.research.google.com/github/AlirezaAhadipour/Topic-Modeling_NLP/blob/main/Gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
from gensim import corpora, models, similarities
from sklearn.datasets import fetch_20newsgroups as getdata
from sklearn.model_selection import train_test_split
import re

In [2]:
corpus = getdata(subset='train', remove=('headers', 'footers', 'quotes'))

X = corpus.data
y = corpus.target
y_names = corpus.target_names

In [3]:
len(X)

11314

In [4]:
print(X[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [5]:
y_names[y[0]]

'rec.autos'

In [6]:
y_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)

In [8]:
# tokenization

stoplist = ['a', 'the', 'of', 'and', 'for', 'to', 'in']
texts = [[word for word in re.split('\W+', doc.lower()) if word not in stoplist] for doc in X_train]

In [9]:
# print(texts[0])

In [10]:
frequency = {}
for text in texts:
  for token in text:
    frequency[token] = frequency.get(token, 0) + 1

threshold = 10
processed_corpus = [[token for token in text if frequency[token] >= threshold] for text in texts]

In [11]:
token_dict = corpora.Dictionary(processed_corpus)
print(token_dict)

Dictionary<13201 unique tokens: ['', '1', '12', '253', '33']...>


In [12]:
# print(token_dict.token2id)

In [13]:
# Bag of Words 
bow_corpus = [token_dict.doc2bow(text) for text in processed_corpus]

In [14]:
print(bow_corpus[10])   #(id, count)

[(0, 2), (18, 1), (19, 1), (41, 1), (55, 1), (116, 1), (129, 1), (162, 1), (211, 1), (331, 2), (333, 1), (344, 1), (406, 1), (443, 1), (568, 1), (664, 1), (665, 1), (666, 1), (667, 2), (668, 1), (669, 1), (670, 1), (671, 1), (672, 1), (673, 1), (674, 1), (675, 1), (676, 1), (677, 1)]


In [15]:
# Topic Modeling

model = models.TfidfModel(bow_corpus)   # Term Frequency-Inverted Document Frequency
# model = models.LsiModel(bow_corpus)   # Latent Semantic Indexing
# model = models.LdaModel(bow_corpus)   # Latent Dirichlet Allocation

In [16]:
index = similarities.SparseMatrixSimilarity(model[bow_corpus], num_features=len(token_dict))

In [17]:
# testing 
rndm = np.random.randint(len(X_test))
query_document = re.split('\W+', X_test[rndm].lower())
query_bow = token_dict.doc2bow(query_document)
similarities = index[model[query_bow]]
doc_number = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)[0][0]

print('Predicted: ', y_names[y_train[doc_number]])
print('Ground truth: ', y_names[y_test[rndm]])

Predicted:  misc.forsale
Ground truth:  misc.forsale
