<a href="https://colab.research.google.com/github/AdityaVarmaUddaraju/Topic_Modelling/blob/main/Topic_modelling_with_svd_and_nmf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Singular Value Decomposition

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg

In [2]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [5]:
first_3_text = newsgroups_train.data[:3]
first_3_label = newsgroups_train.target[:3]
for text,label in zip(first_3_text, first_3_label):
  print(f'{text}')
  print(f'topic: {label}')

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych
topic: 1


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.
topic: 3

 >In article <1993Apr19.020359.26996@sq.sq

In [6]:
newsgroups_train.target_names

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
vectorizer = CountVectorizer(stop_words='english')

In [9]:
vectors = vectorizer.fit_transform(newsgroups_train.data).todense()
vectors.shape

(2034, 26576)

In [10]:
len(newsgroups_train.data)

2034

In [11]:
vocab = np.array(vectorizer.get_feature_names())

In [12]:
vocab.shape

(26576,)

In [13]:
# Usinf svd to decompose term document matrix 
U, s, Vh = linalg.svd(vectors, full_matrices=False)

In [14]:
U.shape, s.shape, Vh.shape

((2034, 2034), (2034,), (2034, 26576))

In [15]:
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [26]:
show_topics(Vh[383:384])

['read file koran linux mode given list interesting']

In [24]:
np.argmax(U[0])

383

In [34]:
newsgroups_train.data[7]

"\nAcorn Replay running on a 25MHz ARM 3 processor (the ARM 3 is about 20% slower\nthan the ARM 6) does this in software (off a standard CD-ROM). 16 bit colour at\nabout the same resolution (so what if the computer only has 8 bit colour\nsupport, real-time dithering too...). The 3D0/O is supposed to have a couple of\nDSPs - the ARM being used for housekeeping.\n\n\nA 25MHz ARM 6xx should clock around 20 ARM MIPS, say 18 flat out. Depends\nreally on the surrounding system and whether you are talking ARM6x or ARM6xx\n(the latter has a cache, and so is essential to run at this kind of speed with\nslower memory).\n\nI'll stop saying things there 'cos I'll hopefully be working for ARM after\ngraduation...\n\nMike\n\nPS Don't pay heed to what reps from Philips say; if the 3D0/O doesn't beat the\n   pants off 3DI then I'll eat this postscript."

In [35]:
np.argmax(U[7])

431

In [36]:
show_topics(Vh[431:432])

['arm hard try funding ideas big hi method']

# Non-negative Matrix Factorization

In [37]:
clf = decomposition.NMF(n_components=5, random_state=1)

In [38]:
W1 = clf.fit_transform(vectors)
H1 = clf.components_

In [39]:
show_topics(H1)

['jpeg image gif file color images format quality',
 'edu graphics pub mail 128 ray ftp send',
 'space launch satellite nasa commercial satellites year market',
 'jesus god people matthew atheists does atheism said',
 'image data available software processing ftp edu analysis']

In [41]:
W1[0]

array([0.08858936, 0.02984714, 0.        , 0.04220515, 0.        ])

# Truncated SVD

In [43]:
!pip install fbpca
import fbpca

Collecting fbpca
  Downloading fbpca-1.0.tar.gz (11 kB)
Building wheels for collected packages: fbpca
  Building wheel for fbpca (setup.py) ... [?25l[?25hdone
  Created wheel for fbpca: filename=fbpca-1.0-py3-none-any.whl size=11376 sha256=0ba0e3732d8f48850f61a19b21358c0c90a5d80b92b86ed5f03f4d6009798e58
  Stored in directory: /root/.cache/pip/wheels/93/08/0c/1b9866c35c8d3f136d100dfe88036a32e0795437daca089f70
Successfully built fbpca
Installing collected packages: fbpca
Successfully installed fbpca-1.0


In [44]:
%time u, s, v = np.linalg.svd(vectors, full_matrices=False)

CPU times: user 1min 21s, sys: 4.4 s, total: 1min 26s
Wall time: 44.4 s


In [45]:
%time u, s, v = decomposition.randomized_svd(vectors, 10)

CPU times: user 13.1 s, sys: 1.82 s, total: 15 s
Wall time: 10.2 s


In [46]:
%time u, s, v = fbpca.pca(vectors, 10)

CPU times: user 2.95 s, sys: 734 ms, total: 3.68 s
Wall time: 1.99 s


In [49]:
show_topics(v)

['kent cheers bobby islamic muslim ico manhattan prize',
 'jpeg gif file color quality image jfif bit',
 'graphics edu pub mail 128 3d ray send',
 'jesus god matthew people atheists atheism does religious',
 'image data processing analysis software available tools display',
 'god atheists atheism religious believe argument religion true',
 'nasa space lunar mars probe moon missions available',
 'image probe mars surface lunar probes moon atheists',
 'argument fallacy conclusion example true ad argumentum premises',
 'space image nasa sci processing news edu include']