<a href="https://colab.research.google.com/github/AlirezaAhadipour/Topic-Modeling_NLP/blob/main/Gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
from gensim import corpora, models, similarities
from sklearn.datasets import fetch_20newsgroups as getdata
from sklearn.model_selection import train_test_split
from collections import defaultdict
import re

In [2]:
corpus = getdata(subset='train', remove=('headers', 'footers', 'quotes'))

X = corpus.data
y = corpus.target
y_names = corpus.target_names

In [3]:
len(X)

11314

In [4]:
print(X[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [5]:
y_names[y[0]]

'rec.autos'

In [6]:
y_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)

In [8]:
# tokenization

stoplist = ['a', 'the', 'of', 'and', 'for', 'to', 'in']
texts = [[word for word in re.split('\W+', doc.lower()) if word not in stoplist] for doc in X_train]

In [9]:
print(texts[0])

['', 'rather', 'people', 'kill', 'people', 'with', 'guns', 'sad', 'truth', 'is', 'sometimes', 'that', 'is', 'good', 'or', 'at', 'least', 'better', 'than', 'alternative', 'ok', 'there', 'are', 'about', '1400', 'fatal', 'firearm', 'accidents', 'per', 'year', '1', 'number', 'has', 'been', 'decline', 'since', 'early', 'this', 'century', '2', 'most', 'these', 'accidents', 'involve', 'rifles', 'or', 'shot', 'guns', 'not', 'handguns', 'fact', 'there', 'are', 'both', 'guns', 'bullets', 'designed', 'specifically', 'that', 'idea', 'that', 'my', 'ruger', 'mark', 'ii', 'bull', 'barrel', 'semi', 'auto', '0', '22', 'caliber', 'handgun', 'was', 'designed', 'kill', 'or', 'hurt', 'people', 'even', 'self', 'defense', 'would', 'i', 'm', 'sure', 'come', 'as', 'surprise', 'its', 'designer', 'it', 'certainly', 'isn', 't', 'why', 'i', 'have', 'it', 'it', 'certainly', 'would', 'hurt', 'someone', 'if', 'you', 'shot', 'them', 'with', 'it', 'might', 'even', 'kill', 'them', 'but', 'it', 'is', 'simply', 'wrong', '

In [10]:
frequency = {}
for text in texts:
  for token in text:
    frequency[token] = frequency.get(token, 0) + 1

threshold = 10
processed_corpus = [[token for token in text if frequency[token] >= threshold] for text in texts]

In [11]:
token_dict = corpora.Dictionary(processed_corpus)
print(token_dict)

Dictionary<13244 unique tokens: ['', '0', '00', '000', '001']...>


In [12]:
print(token_dict.token2id)



In [13]:
# Bag of Words 
bow_corpus = [token_dict.doc2bow(text) for text in processed_corpus]

In [14]:
print(bow_corpus[10])   #(id, count)

[(0, 1), (5, 1), (8, 1), (31, 2), (154, 1), (190, 1), (204, 3), (217, 2), (223, 1), (276, 1), (284, 1), (285, 2), (311, 1), (312, 2), (317, 3), (325, 1), (335, 5), (345, 1), (350, 6), (351, 2), (352, 3), (355, 1), (356, 2), (361, 2), (367, 1), (370, 9), (372, 1), (375, 1), (376, 1), (377, 2), (379, 2), (381, 1), (383, 3), (395, 1), (415, 1), (417, 2), (420, 7), (464, 1), (467, 1), (481, 2), (487, 10), (515, 1), (543, 5), (548, 2), (549, 2), (557, 1), (575, 1), (615, 1), (616, 1), (620, 1), (663, 1), (681, 1), (683, 2), (696, 1), (722, 1), (736, 1), (740, 2), (752, 1), (777, 2), (786, 1), (791, 1), (818, 1), (835, 1), (845, 2), (862, 1), (867, 4), (876, 1), (877, 1), (878, 1), (879, 1), (880, 1), (881, 3), (882, 2), (883, 2), (884, 3), (885, 1), (886, 3), (887, 1), (888, 1), (889, 2), (890, 1), (891, 1), (892, 1), (893, 1), (894, 2), (895, 1), (896, 1), (897, 1), (898, 2), (899, 1), (900, 1), (901, 1), (902, 1), (903, 1), (904, 1), (905, 1), (906, 1), (907, 1), (908, 1), (909, 2), (910,

In [15]:
# Topic Modeling

model = models.TfidfModel(bow_corpus)   # Term Frequency-Inverted Document Frequency
# model = models.LsiModel(bow_corpus)   # Latent Semantic Indexing
# model = models.LdaModel(bow_corpus)   # Latent Dirichlet Allocation

In [16]:
index = similarities.SparseMatrixSimilarity(model[bow_corpus], num_features=len(token_dict))

In [19]:
# testing 
rndm = np.random.randint(len(X_test))
query_document = re.split('\W+', X_test[rndm].lower())
query_bow = token_dict.doc2bow(query_document)
similarities = index[model[query_bow]]
doc_number = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)[0][0]

print('Predicted: ', y_names[y_train[doc_number]])
print('Ground truth: ', y_names[y_test[rndm]])

Predicted:  rec.autos
Ground truth:  rec.autos
