Задача: запустить модель LDA и Gibbs Sampling с числов тегов 20. Вывести топ-10 слов по каждому тегу. Соотнести полученные теги с тегами из датасета, сделать выводы.

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

In [2]:
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

vectorizer = CountVectorizer(lowercase=True, stop_words=ENGLISH_STOP_WORDS, analyzer='word', binary=True, min_df=0.2, max_df=1.0)
popular = vectorizer.fit_transform(newsgroups_train.data)
print(popular.shape)



(11314, 4)


In [4]:
print(vectorizer.vocabulary_)

{'know': 2, 'don': 0, 'like': 3, 'just': 1}


Вывел самые "популярные слова" (ничего удивительного, однако не совсем понятно что за слово "don")

Проведем два исследования, для часто используемых слов и нет:

In [61]:
vectorizer = CountVectorizer(lowercase=True, stop_words=ENGLISH_STOP_WORDS, analyzer='word', binary=True, min_df=0.02, max_df=1.0)
train1 = vectorizer.fit_transform(newsgroups_train.data)
print(train1.shape)

(11314, 451)


In [6]:
vectorizer = CountVectorizer(lowercase=True, stop_words=ENGLISH_STOP_WORDS, analyzer='word', binary=True, min_df=0.001, max_df=0.002)
train2 = vectorizer.fit_transform(newsgroups_train.data)
print(train2.shape)

(11314, 3694)


In [7]:
from tqdm import tqdm

In [62]:
doc, word = train1.nonzero()
z = np.random.choice(20, len(doc))
alpha = 2 * np.ones(20)
beta = 2 * np.ones(train1.shape[1])

In [63]:
mass1 = np.zeros(20)
mass2 = np.zeros(20 * train1.shape[0]).reshape(train1.shape[0], 20)
mass3 = np.zeros(20 * train1.shape[1]).reshape(20, train1.shape[1])

In [64]:
for i, j, k in zip(doc, word, z):
    mass1[k] += 1
    mass2[i, k] += 1
    mass3[k, j] += 1

In [11]:
def LDA(mass1, mass2, mass3, z, doc, word, alpha, beta):    
    for i in tqdm(range(50)):
        for j in range(len(doc)):
            
            topic1 = z[j]
            doc1 = doc[j]
            word1 = word[j]
            
            mass1[topic1] -= 1
            mass2[doc1, topic1] -= 1
            mass3[topic1, word1] -= 1
            
            p = (mass2[doc1, :] + alpha) * (mass3[:, word1] + beta[word1]) / (mass1 + beta.sum())
            z[j] = np.random.choice(np.arange(20), p = p / p.sum())
            
            mass1[z[j]] += 1
            mass2[doc1, z[j]] += 1
            mass3[z[j], word1] += 1

    return mass1, mass2, mass3, z

In [65]:
mass1, mass2, mass3, z = LDA(mass1, mass2, mass3, z, doc, word, alpha, beta)

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [13:38<00:00, 16.38s/it]


In [66]:
answer = np.argsort(mass3, axis=1)[:, -10:]
for i in range(20):
    matrix = np.zeros((1, train1.shape[1]))
    for j in answer[i]:
        matrix[0, j] = 1
    print('\t '.join(vectorizer.inverse_transform(matrix)[0]))

believe	 christian	 did	 don	 god	 just	 like	 people	 say	 think
don	 good	 just	 know	 like	 lot	 really	 think	 ve	 want
card	 dos	 file	 files	 mail	 pc	 program	 software	 thanks	 windows
10	 11	 12	 14	 15	 16	 1993	 20	 25	 30
best	 going	 good	 ll	 make	 real	 think	 ve	 want	 way
data	 general	 help	 know	 new	 number	 possible	 time	 use	 used
don	 got	 just	 know	 need	 new	 non	 old	 time	 used
day	 don	 government	 just	 know	 little	 ll	 new	 people	 saw
case	 come	 did	 people	 really	 said	 say	 think	 time	 world
bit	 does	 don	 good	 just	 like	 people	 right	 think	 time
course	 didn	 don	 going	 like	 really	 say	 sure	 think	 ve
does	 don	 just	 point	 problem	 read	 time	 want	 way	 work
did	 does	 don	 know	 need	 people	 point	 probably	 right	 thing
drive	 got	 hard	 just	 know	 like	 make	 problem	 use	 used
case	 didn	 don	 just	 like	 tell	 think	 trying	 want	 years
actually	 does	 make	 thing	 think	 time	 use	 ve	 way	 work
does	 doesn	 don	 like	 line	 m

In [14]:
doc, word = train2.nonzero()
z = np.random.choice(20, len(doc))
alpha = 2 * np.ones(20)
beta = 2 * np.ones(train2.shape[1])

In [15]:
mass1 = np.zeros(20)
mass2 = np.zeros(20 * train2.shape[0]).reshape(train2.shape[0], 20)
mass3 = np.zeros(20 * train2.shape[1]).reshape(20, train2.shape[1])

In [16]:
for i, j, k in zip(doc, word, z):
    mass1[k] += 1
    mass2[i, k] += 1
    mass3[k, j] += 1

In [17]:
mass1, mass2, mass3, z = LDA(mass1, mass2, mass3, z, doc, word, alpha, beta)

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [04:08<00:00,  4.97s/it]


In [18]:
answer = np.argsort(mass3, axis=1)[:, -10:]
for i in range(20):
    matrix = np.zeros((1, train2.shape[1]))
    for j in answer[i]:
        matrix[0, j] = 1
    print('\t '.join(vectorizer.inverse_transform(matrix)[0]))

165	 booted	 buddy	 bure	 fundamentalist	 loops	 overtime	 proposing	 sand	 subset
331	 acknowledged	 cabin	 consumers	 dartmouth	 dev	 franchise	 manhattan	 photographs	 que
263	 apostles	 danny	 dozens	 forwarded	 illusion	 netnews	 ott	 provisions	 rice
222	 brightness	 eaten	 eyewitness	 moto	 pcx	 royals	 slmr	 specifics	 wash
billions	 contention	 infinity	 kirk	 roster	 rotate	 seals	 smell	 ulf	 unlimited
accompanied	 davidian	 dies	 fallacy	 markets	 parked	 stewart	 thereof	 titled	 vesselin
attractive	 boring	 italian	 kicking	 mis	 outstanding	 paranoia	 pet	 tolerance	 tue
625	 accessible	 bury	 demanding	 espionage	 logically	 precise	 regulars	 struggling	 understandable
116	 abused	 documentary	 lacks	 liberals	 salary	 slaves	 sounded	 thompson	 underground
beware	 contend	 corpses	 denies	 existent	 irrational	 jsc	 mice	 originated	 spokesman
acute	 conscious	 copyrighted	 drum	 lamb	 magnum	 moments	 nearest	 objections	 republicans
chelios	 col	 elementary	 freewar

In [19]:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Из этих двух таблиц следует довольно очевидные выводы:

первая соответствует наиболее популярным словам, поэтому в каждой строке мы видим примерно одинаковые "общие" слова, из-за этого сложно определить к какому тегу соотносится каждая строка

вторая соответствует довольно непопулярным словам, однако из-за этого иногда проскакивают числа, которые в себе никакой содержательной информации о тегах не несут, но тем не менее, из-за более специализированных "необщих" слов, соотнести теги гораздо проще

Теперь же не будем обращаться к крайностям, и попытаемся подобрать такие параметры, чтобы по словам можно было восстановить тег:

In [53]:
vectorizer = CountVectorizer(lowercase=True, stop_words=ENGLISH_STOP_WORDS, analyzer='word', binary=True, min_df=0.005, max_df=0.05)
train3 = vectorizer.fit_transform(newsgroups_train.data)
print(train3.shape)

(11314, 2330)


In [54]:
doc, word = train3.nonzero()
z = np.random.choice(20, len(doc))
alpha = 2 * np.ones(20)
beta = 2 * np.ones(train3.shape[1])

In [55]:
mass1 = np.zeros(20)
mass2 = np.zeros(20 * train3.shape[0]).reshape(train3.shape[0], 20)
mass3 = np.zeros(20 * train3.shape[1]).reshape(20, train3.shape[1])

In [56]:
for i, j, k in zip(doc, word, z):
    mass1[k] += 1
    mass2[i, k] += 1
    mass3[k, j] += 1

In [57]:
mass1, mass2, mass3, z = LDA(mass1, mass2, mass3, z, doc, word, alpha, beta)

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [22:03<00:00, 26.47s/it]


In [58]:
answer = np.argsort(mass3, axis=1)[:, -10:]
for i in range(20):
    matrix = np.zeros((1, train3.shape[1]))
    for j in answer[i]:
        matrix[0, j] = 1
    print('Tag', i + 1,'\t', '\t'.join(vectorizer.inverse_transform(matrix)[0]))

Tag 1 	 aren	deal	idea	important	isn	likely	reason	seen	times	wrong
Tag 2 	 card	computer	disk	dos	drive	mac	memory	pc	software	video
Tag 3 	 ask	group	isn	kind	mean	place	quite	seen	sort	wrong
Tag 4 	 guess	hand	idea	kind	makes	means	person	pretty	thought	wrong
Tag 5 	 11	12	14	15	16	20	24	25	30	team
Tag 6 	 bible	christ	christian	christians	claim	faith	jesus	life	religion	word
Tag 7 	 application	code	email	file	files	list	program	version	window	works
Tag 8 	 bad	buy	gets	goes	known	large	line	looks	times	yes
Tag 9 	 ago	big	change	mean	months	near	note	small	times	won
Tag 10 	 big	couldn	group	low	order	pretty	support	talk	tried	working
Tag 11 	 car	control	free	info	instead	line	order	pretty	seen	won
Tag 12 	 article	common	fast	getting	given	heard	line	matter	state	support
Tag 13 	 ago	banks	chastity	dsl	gordon	intellect	pitt	skepticism	soon	surrender
Tag 14 	 advance	anybody	doing	free	interested	kind	local	send	stuff	yes
Tag 15 	 bad	cost	couple	posting	problems	quite	seen	start

По этим словам уже можно определять некоторые теги наверняка:

2) - comp.sys.ibm.pc.hardware

5), 9) - rec.sport.baseball || rec.sport.hockey

6) - soc.religion.christian

7) - comp.os.ms-windows.misc

11) - rec.motorcycles || rec.autos

16) - sci.crypt

19) - talk.politics.mideast

Также можно сделать выводы о тегах 8, 14, 18, 20: во всех четырех присутствует слово "yes", которое употребляется только в разговорном стиле, то есть все эти четыре тега соответствуют одному из четырех пунктов списка ниже.

In [59]:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Оставшиеся теги определить досаточно сложно, но можно руководствоваться остаточным принципом, и раскидать оставшиеся теги, как например: 13) - alt.atheism, 1) - misc.forsale и так далее