### LDA aka Latent Dirichlet Allocation (не путать с LDA = Linear Discriminant Analysis)

In [1]:
#запустить эту ячейку до начала занятия

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\elena\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\elena\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import nltk
import numpy as np
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import *
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import  CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA,LatentDirichletAllocation
import matplotlib.pyplot as plt

### Тематическое моделирование (topic modelling)

Тематическое моделирование - это присваивание темы (topic) каждому документу. Каждая тема представлена определенными словами.

Рассмотрим пример:

У нас есть два топика: топик 1 и топик 2. Топик1 представлен словами "apple, banana, mange",
топик2 - словами "tennis, cricket, hockey". Можем предположить, что в топике1 речь идет о фруктах, а в топике2 - о спорте. Затем каждому новому документу мы присваиваем одну из этих тем (топик1 или топик2).

Другой пример: предположим, у нас есть 6 документов

apple banana
apple orange
banana orange
tiger cat
tiger dog
cat dog

Что будет происходить с тематическим моделированием, если мы захотим извлечь две темы (два топика) из этих документов?
Мы получим два распределения: распределение тема-слово (topic-word) и распределение документ-тема (doc-topic).

Идеальное распределение документ-слово в данном примере будет таким:

![How](df1.png)

Идеальное распределение документ-тема будет таким:

![How](df2.png)

Предположим, что у нас есть новый документ "cat dog apple", тогда его представление по темам должно быть следующим:

Topic1: 0.33

Topic2: 0.66

LDA широко применяется в таких задачах. Его использование для тематического моделирования продемонстрировано ниже. 

Мы подаем на вход LDA число тем (topics), которые хотим выделить в корпусе. 

Но сначала необходимо векторизовать слова (будем использовать подход - мешок слов), поэтому взаимосвязь между словами в текстах при таком подходе исчезнет.

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\elena\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\elena\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
lemmatizer = WordNetLemmatizer() #For words Lemmatization
stemmer = PorterStemmer()  #For stemming words
stop_words = set(stopwords.words('english'))

In [5]:
def TokenizeText(text):
    ''' 
     Tokenizes text by removing various stopwords and lemmatizing them
    '''
    text=re.sub('[^A-Za-z0-9\s]+', '', text)
    word_list=word_tokenize(text)
    word_list_final=[]
    
    for word in word_list:
        if word not in stop_words:
            word_list_final.append(lemmatizer.lemmatize(word))
    return word_list_final

In [6]:
def gettopicwords(topics, cv, n_words=10):
    '''
        Print top n_words for each topic.
        cv=Countvectorizer
    '''
    for i, topic in enumerate(topics):
        top_words_array = np.array(cv.get_feature_names())[np.argsort(topic)[::-1][:n_words]]
        print("For  topic {} it's top {} words are ".format(str(i),str(n_words)))
             
        combined_sentence=""
        for word in top_words_array:
            combined_sentence+=word+" "
        print(combined_sentence)
#        print(")

In [7]:
df = pd.read_csv('million-headlines.zip',usecols=[1])
df = df.iloc[:10000]

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\elena\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Data link:

https://www.kaggle.com/therohk/million-headlines

In [9]:
print(len(df))
df.head()

10000


Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


In [10]:
%%time 

num_features = 100000
# cv=CountVectorizer(min_df=0.01,max_df=0.97,tokenizer=TokenizeText,max_features=num_features)
cv = CountVectorizer(tokenizer=TokenizeText, max_features=num_features)
transformed_data = cv.fit_transform(df['headline_text'])

Wall time: 3.59 s


In [11]:
# transformed_data

In [12]:
%%time
no_topics=10  ## We can change this, hyperparameter
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', \
                                learning_offset=50.,random_state=0, n_jobs=-1).fit(transformed_data)

Wall time: 27.2 s


Lda.components_ - это таблица тема-слово, она показывает, какими словами представлена каждая тема.

In [13]:
gettopicwords(lda.components_,cv)

For  topic 0 it's top 10 words are 
crash sars woman new lead final car dead clash strike 
For  topic 1 it's top 10 words are 
iraqi force plan three missing case missile open air denies 
For  topic 2 it's top 10 words are 
u iraq war say troop anti get fire australia wa 
For  topic 3 it's top 10 words are 
back killed set nsw coalition home minister year election korea 
For  topic 4 it's top 10 words are 
man police face court vic take two charge charged coast 
For  topic 5 it's top 10 words are 
world saddam pm dy cup melbourne former stay power blue 
For  topic 6 it's top 10 words are 
win say protest marine found union make battle probe accident 
For  topic 7 it's top 10 words are 
baghdad may hospital hit group support concern seek ban inquiry 
For  topic 8 it's top 10 words are 
council govt death claim police plan qld new water hope 
For  topic 9 it's top 10 words are 
call report attack begin bridge near work appeal rail put 


Присваивание темы документу

Можно заметить, что каждый документ содержит комбинацию тем. Посмотрим на темы первых десяти документов.

In [14]:
docs = df['headline_text'][:10]

In [15]:
data = []
for doc in docs:
    data.append(lda.transform(cv.transform([doc])))

In [16]:
cols = ['topic'+str(i) for i in range(1,11)]
doc_topic_df = pd.DataFrame(columns=cols, data=np.array(data).reshape((10,10)))

In [17]:
doc_topic_df['major_topic'] = doc_topic_df.idxmax(axis=1)
doc_topic_df['raw_doc'] = docs

In [18]:
doc_topic_df.head(3)

Unnamed: 0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,major_topic,raw_doc
0,0.016667,0.016667,0.016667,0.85,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,topic4,aba decides against community broadcasting lic...
1,0.014286,0.014286,0.871428,0.014286,0.014286,0.014286,0.014286,0.014286,0.014286,0.014286,topic3,act fire witnesses must be aware of defamation
2,0.683333,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.183334,topic1,a g calls for infrastructure protection summit


Мы увидели, как LDA может быть использован для тематического моделирования. Такой подход также может быть применен для кластеризации документов, основанной на группировке по темам.

Ссылки

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

https://sebastianraschka.com/faq/docs/lda-vs-pca.html