## Лабораторная работа 4: topic modeling

В данной лабораторной работе мы попытаемся обучить LDA-модель topic-моделингу на двух принципиально различных корпусах. 

В первой части вы познакомитесь с новыми возможностями библиотеки gensim, а также с возможностями парсинга в языке Python. Во второй части вам предстоит самостоятельно обучить LDA-модель и оценить качество её работы.

### Часть 1: topic modeling уровня /b/

Краеугольным камнем в машинном обучений в целом, и в NLP в частности, является выбор датасетов. Доселе мы использовали только стандартные, многократно обкатанные датасеты, но сегодня попробуем собрать свой. Практика работы с сырыми, необработанными данными весьма полезно! Заодно изучим возможности парсеров в Питоне.

Давайте напишем парсер, собирающий информацию о сообщения с русскомязычного анононимного форума (имиджборды) "Двач" ("Сосач", "Хиккач", если вам угодно). Двач, как и всякая имиджборда разделён на разделы (доски, борды), посвященные различным тематикам -- аниме, видеоигры, литература, религия... Каждая доска состоит из тем (тредов, топиков), которые создаются анонимными (при их желании) пользователями. Каждый тред посвящен обсуждению какого-то конкретного вопроса.

У некоторых разделов есть раздел архив, располагается он по адресу https://2ch.hk/(название раздела)/arch/, например для раздела музыка -- https://2ch.hk/mu/arch/. Если у вас есть минимальные навыки в языке html, а также если вы изучили документацию встроенного класса HTMLParser, то вам будет несложно написать два парсера.

Первый парсер (ArchiveParser) парсит HTML-страницу архива доски, вытягивает из неё ссылки на заархивированные треды, и скармливает их второму парсеру.

Второй парсер (ThreadParser) парсит HTML-страницу заархивированного треда, вытягивает из неё сообщения, складывает их вместе и собирает.

In [1]:
import time
import urllib.request
from html.parser import HTMLParser
from gensim.utils import simple_preprocess

def get_value_by_key(attrs, key):
    for (k, v) in attrs:
        if(k == key):
            
            return v;
    return None

class ArchiveParser(HTMLParser):
    flag = False
    threads = []
    limit = 200
    def handle_starttag(self, tag, attrs):
        if(self.limit > 0):
            if(tag == 'div'):
                cl = get_value_by_key(attrs, 'class')
                if (cl == 'box-data'):
                    self.flag = True;
            if(self.flag == True and tag == 'a'):
                href = get_value_by_key(attrs, 'href')
                if(len(href)>20):
                    print(href)
                    print(self.limit)
                    thread = parse_thread('https://2ch.hk' + href)
                    if(len(thread) > 10):
                        self.threads.append(thread)
                        self.limit = self.limit - 1
                    thread = []
        

    def handle_endtag(self, tag):
        if(tag == 'div'):
            self.flag = False;

    def handle_data(self, data):
        1+1
        
    def get_threads(self):
        return self.threads
    
    def clean(self):
        self.threads = []
        
parser = ArchiveParser()

def parse_archive(board = '/b/', page_number = 0):
    lines = []
    link = 'https://2ch.hk' + board + 'arch/' + str(page_number) +'.html'
    print(link)
    parser.limit = 100
    url = urllib.request.urlopen(link)
    for line in url.readlines():
        lines.append(line.decode('utf-8'))
    for line in lines:
        parser.feed(line)
    res = parser.get_threads()
    parser.clean()
    return res

In [2]:
class ThreadParser(HTMLParser):
    flag = False
    message = []
    messages = []
            
    def handle_starttag(self, tag, attrs):
        if(tag == 'blockquote'):
            self.flag = True;
            self.message = []
        

    def handle_endtag(self, tag):
        if(tag == 'blockquote'):
            self.flag = False
            if(self.message != []):
                self.messages.append(self.message)
            self.message = []

    def handle_data(self, data):
        if(self.flag):
            self.message.extend(simple_preprocess(data))
            
    def get_messages(self):
        return self.messages
    
    def clear_messages(self):
        flag = False
        self.message = []
        self.messages = []

t_parser = ThreadParser()

def parse_thread (link):
    url = urllib.request.urlopen(link)
    lines = []
    for line in url.readlines():
        lines.append(line.decode('utf-8', errors='ignore'))
    for line in lines:
        t_parser.feed(line)
    res = t_parser.get_messages()
    t_parser.clear_messages()
    #print(res)
    return res

Весьма много кода, верно? Если не потерялись, могли заметить функцию parse_archive, которая парсит страницу архива по доске и номеру страницы.


$\textbf{Задание.}$
Давайте применим её к каким-нибудь доскам. Выберите две доски двача, имеющие архив и скачайте архивы функцией parse_archive.

In [4]:
boards = ['/fiz/', '/re/'] 
#TODO: напишите название досок в формате /'доска'/, например /mu/ для Музыки
threads_by_topic = [parse_archive(board=board) for board in boards]

https://2ch.hk/fiz/arch/0.html
/fiz/arch/2016-04-29/res/828107.html
100
/fiz/arch/2016-05-02/res/827737.html
100
/fiz/arch/2016-09-04/res/827560.html
99
/fiz/arch/2016-05-09/res/827545.html
98
/fiz/arch/2016-04-30/res/826831.html
97
/fiz/arch/2016-05-11/res/826655.html
96
/fiz/arch/2016-04-28/res/826506.html
95
/fiz/arch/2016-05-01/res/826467.html
95
/fiz/arch/2016-04-28/res/826451.html
94
/fiz/arch/2016-04-29/res/826378.html
93
/fiz/arch/2016-05-07/res/826266.html
92
/fiz/arch/2016-05-29/res/826087.html
91
/fiz/arch/2016-05-01/res/825996.html
90
/fiz/arch/2016-05-08/res/825948.html
89
/fiz/arch/2016-04-29/res/825832.html
88
/fiz/arch/2016-05-16/res/825684.html
87
/fiz/arch/2016-05-15/res/825479.html
87
/fiz/arch/2016-04-30/res/825287.html
86
/fiz/arch/2016-05-09/res/825230.html
86
/fiz/arch/2016-04-28/res/825199.html
85
/fiz/arch/2016-04-28/res/825193.html
85
/fiz/arch/2016-04-28/res/825183.html
84
/fiz/arch/2016-08-13/res/825133.html
83
/fiz/arch/2016-04-29/res/824943.html
82
/fiz/ar

/re/arch/2016-06-08/res/349342.html
49
/re/arch/2016-06-13/res/349286.html
49
/re/arch/2016-07-21/res/349266.html
48
/re/arch/2016-06-13/res/349225.html
47
/re/arch/2016-06-09/res/349129.html
46
/re/arch/2016-06-04/res/349111.html
45
/re/arch/2016-06-03/res/348969.html
45
/re/arch/2016-06-04/res/348958.html
45
/re/arch/2016-06-04/res/348924.html
44
/re/arch/2016-06-03/res/348864.html
44
/re/arch/2016-06-15/res/348731.html
44
/re/arch/2016-06-28/res/348705.html
44
/re/arch/2016-09-04/res/348643.html
43
/re/arch/2016-06-20/res/348602.html
42
/re/arch/2016-06-01/res/348492.html
41
/re/arch/2016-06-01/res/348484.html
41
/re/arch/2016-06-14/res/348466.html
41
/re/arch/2016-07-20/res/348387.html
40
/re/arch/2016-06-20/res/348385.html
39
/re/arch/2016-05-31/res/348350.html
38
/re/arch/2016-06-07/res/348234.html
38
/re/arch/2016-05-29/res/348197.html
37
/re/arch/2016-06-05/res/348191.html
37
/re/arch/2016-06-10/res/347964.html
36
/re/arch/2016-06-06/res/347936.html
35
/re/arch/2016-06-14/res/3

Разделим наши данны на тренировочые и тестовые. Пусть каждый десятый тред попадает в тест-сет.

In [5]:
data = []
test = []

it = 0
for topic in threads_by_topic:
    for thread in topic:
        full = []
        for post in thread:
            full.extend(post)
        it = it + 1
        if(it % 10 == 0):
            test.append(full)
        else:
            data.append(full)

data[0]

['сап',
 'физач',
 'есть',
 'один',
 'больной',
 'скиннифэт',
 'кароче',
 'меня',
 'была',
 'операция',
 'по',
 'поводу',
 'иссечение',
 'липомы',
 'на',
 'уровне',
 'иссечение',
 'сакральной',
 'кисты',
 'на',
 'уровне',
 'на',
 'последнем',
 'мрт',
 'все',
 'норм',
 'но',
 'врач',
 'сказал',
 'что',
 'можно',
 'заниматься',
 'только',
 'без',
 'осевой',
 'нагрузки',
 'на',
 'позвоночник',
 'вообще',
 'избегать',
 'нагрузок',
 'на',
 'поясницу',
 'впринципе',
 'позвоночник',
 'особых',
 'поясните',
 'боги',
 'физача',
 'апполоны',
 'во',
 'плоти',
 'как',
 'мне',
 'можно',
 'заниматься',
 'какими',
 'упражнениями',
 'чего',
 'начинать',
 'бамп',
 'помогите',
 'op',
 'зал',
 'без',
 'осевой',
 'нагрузки',
 'на',
 'позвоночник',
 'это',
 'невозможно',
 'врач',
 'пиздит',
 'заниматься',
 'можно',
 'начинать',
 'надо',
 'веса',
 'которым',
 'ничего',
 'не',
 'болит',
 'шти',
 'ты',
 'из',
 'ниво',
 'инвалида',
 'сделаишь',
 'эмммм',
 'ну',
 'жим',
 'лёжа',
 'это',
 'же',
 'тип',
 'не',
 '

$\textbf{Задание.}$
В русском языке есть множество слов (частицы, междометия, всё что вы хотите), которые никак не отображают смысл слов и являются вспомогательными. Чтобы ваша модель работала лучше -- добавьте стоп-слова в список RUSSIAN_STOP_WORDS или в строку st_str. Эти слова отфильтруются из датасета перед тем, как модель начнет обучаться на датасете.

In [14]:
from gensim.utils import simple_preprocess
from gensim import corpora

RUSSIAN_STOP_WORDS = ['не', 'это', 'лишь', 'поэтому' 'что','чем','это','как','https','нет','op','он','же','так','но','да','нет','или','и', 'на', "то", "бы", "все", "ты", "если", "по", "за", "там", "ну", "уже", "от", "есть","был", "даже", "было", "www", "com", "youtube", "из", "будет", "mp", "они", "только", "его", "она", "вот", 'просто', 'watch', 'кто', 'для', 'когда', 'тут', 'мне', 'где', 'мы', 'какой', 'может', 'меня', 'до', 'про', 'http', 'раз', 'почему', 'тебя', 'ещё', 'их', 'сейчас', 'тоже', 'во', 'чтобы', 'этого','без', 'него','вы','такой', 'можно', 'надо', 'нахуй', 'ли', 'потом', 'тред', 'больше', 'лучше', 'хуй', 'сам', 'после', 'со', 'лол', 'быть', 'нужно', 'этом', 'блять', 'бля', 'того', 'ничего', 'потому', 'нибудь', 'этот', 'под', 'через', 'ни', 'себе', 'ему', 'при', 'какие', 'пиздец', 'теперь', 'хоть', 'говно', 'тогда', 'блядь', 'кстати', 'че', 'себя', 'конечно', 'типа', 'много', 'том', 'нихуя', 'куда', 'всегда', 'нас', 'тот', 'ведь', 'эти', 'них', 'сука', 'пока', 'более', 'чего', 'html', 'были', 'всех', 'была', 'например', 'тем', 'ru', 'зачем', 'либо', 'вроде', 'всего', 'вопрос', 'php', 'против', 'здесь', 'ее', 'значит', 'совсем', 'сколько', 'им', 'org', 'именно', 'эту',]
st_str = "которых которые твой которой которого сих ком свой твоя этими слишком нами всему будь саму чаще ваше сами наш затем еще самих наши ту каждое мочь весь этим наша своих оба который зато те этих вся ваш такая теми ею которая нередко каждая также чему собой самими нем вами ими откуда такие тому та очень сама нему алло оно этому кому тобой таки твоё каждые твои мой нею самим ваши ваша кем мои однако сразу свое ними всё неё тех хотя всем тобою тебе одной другие этао само эта буду самой моё своей такое всею будут своего кого свои мог нам особенно её самому наше кроме вообще вон мною никто это"
RUSSIAN_STOP_WORDS.extend(st_str.split(' '))

data = [list(filter(lambda word: not word in RUSSIAN_STOP_WORDS, piece)) for piece in data]

Создадим словарь и на его основе преобразуем слова в их id.

In [15]:
id2word = corpora.Dictionary(data)

# Create Corpus
texts = data

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

Обучим LDA-модель, используя библиотеку gensim. Зададим число тем равно числу скачанных досок.

In [16]:
from gensim.models import LdaModel

model = LdaModel(corpus, id2word=id2word, num_topics=len(threads_by_topic))

Теперь получим топ-10 самых используемых в каждой теме слов.

$\textbf{Задание.}$
Оцените насколько хорошо модель разделила темы.

In [17]:
for i in range(len(threads_by_topic)):
    print([id2word[id[0]] for id in model.get_topic_terms(topicid = i, topn = 10)])

['день', 'кг', 'бог', 'один', 'делать', 'лет', 'жизни', 'время', 'человек', 'люди']
['кг', 'день', 'человек', 'бог', 'можешь', 'время', 'делать', 'бога', 'жизни', 'людей']


Теперь прогоним тестовые треды на модели. Тестовый датасет разделен на n равных частей по 20 тредов, i-ая соответствует i-й доске.

In [20]:
other_corpus = [id2word.doc2bow(text) for text in [list(filter(lambda word: not word in RUSSIAN_STOP_WORDS, piece)) for piece in test]]

vector = [model[unseen_doc] for unseen_doc in other_corpus]
print(vector[0]) #вероятности принадлежности 0-го тестового треда в ту или иную тему

[(0, 0.21614233), (1, 0.7838577)]


In [21]:
i = 0

for res in vector:
    max_it = 0
    if(len(res) > 0):
        for it in range(1, len(res)):
            if(res[max_it][1] < res[it][1]):
                max_it = it
        print("Text #" + str(i) + ", topic #" + str(max_it) + str(", prob = " + str(res[max_it][1])))
    i = i + 1

Text #0, topic #1, prob = 0.7838577
Text #1, topic #1, prob = 0.7610595
Text #2, topic #0, prob = 0.56745833
Text #3, topic #1, prob = 0.5545186
Text #4, topic #0, prob = 0.6075075
Text #5, topic #0, prob = 0.55626833
Text #6, topic #1, prob = 0.53630984
Text #7, topic #1, prob = 0.55875
Text #8, topic #0, prob = 0.86626035
Text #9, topic #1, prob = 0.5035742
Text #10, topic #0, prob = 0.58734775
Text #11, topic #1, prob = 0.7281541
Text #12, topic #1, prob = 0.7851186
Text #13, topic #0, prob = 0.5630218
Text #14, topic #0, prob = 0.60416245
Text #15, topic #0, prob = 0.55250865
Text #16, topic #0, prob = 0.68980014
Text #17, topic #0, prob = 0.7368434
Text #18, topic #1, prob = 0.75275594
Text #19, topic #0, prob = 0.6599304


$\textbf{Задание.}$

Оцените результаты работы модели на тест сете. Если модель разделили данные плохо -- объясните, почему?

## Часть 2. А теперь нормальный датасет.

А теперь давайте воспользуемся более стандартным датасетом библиотеки sklreatn -- 20newsgroups, посвященную статьям на различные темы. Выберем 6 -- Атеизм, яблочное железо, автомобили, хоккей, космос, христианство, ближний восток.

In [22]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism',
 'comp.sys.mac.hardware',
 'rec.autos',
 'rec.sport.hockey',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories = categories)

$\textbf{Задание}$

Найдите библиотечный или опишите свой список ENGSLISH_STOP_WORDS, убирающий не несущие никакого смысла английские слова.

In [23]:
from gensim.utils import simple_preprocess
from gensim import corpora

ENGLISH_STOP_WORDS = ["a", "about", "above", "after", "again", "against", "ain", "all", "am", "an", "and", "any", "are", "aren", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "couldn", "couldn't", "d", "did", "didn", "didn't", "do", "does", "doesn", "doesn't", "doing", "don", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn", "hadn't", "has", "hasn", "hasn't", "have", "haven", "haven't", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "i", "if", "in", "into", "is", "isn", "isn't", "it", "it's", "its", "itself", "just", "ll", "m", "ma", "me", "mightn", "mightn't", "more", "most", "mustn", "mustn't", "my", "myself", "needn", "needn't", "no", "nor", "not", "now", "o", "of", "off", "on", "once", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "re", "s", "same", "shan", "shan't", "she", "she's", "should", "should've", "shouldn", "shouldn't", "so", "some", "such", "t", "than", "that", "that'll", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "ve", "very", "was", "wasn", "wasn't", "we", "were", "weren", "weren't", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "won", "won't", "wouldn", "wouldn't", "y", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "could", "he'd", "he'll", "he's", "here's", "how's", "i'd", "i'll", "i'm", "i've", "let's", "ought", "she'd", "she'll", "that's", "there's", "they'd", "they'll", "they're", "they've", "we'd", "we'll", "we're", "we've", "what's", "when's", "where's", "who's", "why's", "would", "able", "abst", "accordance", "according", "accordingly", "across", "act", "actually", "added", "adj", "affected", "affecting", "affects", "afterwards", "ah", "almost", "alone", "along", "already", "also", "although", "always", "among", "amongst", "announce", "another", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "apparently", "approximately", "arent", "arise", "around", "aside", "ask", "asking", "auth", "available", "away", "awfully", "b", "back", "became", "become", "becomes", "becoming", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "believe", "beside", "besides", "beyond", "biol", "brief", "briefly", "c", "ca", "came", "cannot", "can't", "cause", "causes", "certain", "certainly", "co", "com", "come", "comes", "contain", "containing", "contains", "couldnt", "date", "different", "done", "downwards", "due", "e", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "else", "elsewhere", "end", "ending", "enough", "especially", "et", "etc", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "except", "f", "far", "ff", "fifth", "first", "five", "fix", "followed", "following", "follows", "former", "formerly", "forth", "found", "four", "furthermore", "g", "gave", "get", "gets", "getting", "give", "given", "gives", "giving", "go", "goes", "gone", "got", "gotten", "h", "happens", "hardly", "hed", "hence", "hereafter", "hereby", "herein", "heres", "hereupon", "hes", "hi", "hid", "hither", "home", "howbeit", "however", "hundred", "id", "ie", "im", "immediate", "immediately", "importance", "important", "inc", "indeed", "index", "information", "instead", "invention", "inward", "itd", "it'll", "j", "k", "keep", "keeps", "kept", "kg", "km", "know", "known", "knows", "l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "lets", "like", "liked", "likely", "line", "little", "'ll", "look", "looking", "looks", "ltd", "made", "mainly", "make", "makes", "many", "may", "maybe", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "million", "miss", "ml", "moreover", "mostly", "mr", "mrs", "much", "mug", "must", "n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need", "needs", "neither", "never", "nevertheless", "new", "next", "nine", "ninety", "nobody", "non", "none", "nonetheless", "noone", "normally", "nos", "noted", "nothing", "nowhere", "obtain", "obtained", "obviously", "often", "oh", "ok", "okay", "old", "omitted", "one", "ones", "onto", "ord", "others", "otherwise", "outside", "overall", "owing", "p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed", "please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly", "present", "previously", "primarily", "probably", "promptly", "proud", "provides", "put", "q", "que", "quickly", "quite", "qv", "r", "ran", "rather", "rd", "readily", "really", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "respectively", "resulted", "resulting", "results", "right", "run", "said", "saw", "say", "saying", "says", "sec", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sent", "seven", "several", "shall", "shed", "shes", "show", "showed", "shown", "showns", "shows", "significant", "significantly", "similar", "similarly", "since", "six", "slightly", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "specifically", "specified", "specify", "specifying", "still", "stop", "strongly", "sub", "substantially", "successfully", "sufficiently", "suggest", "sup", "sure", "take", "taken", "taking", "tell", "tends", "th", "thank", "thanks", "thanx", "thats", "that've", "thence", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "thereto", "thereupon", "there've", "theyd", "theyre", "think", "thou", "though", "thoughh", "thousand", "throug", "throughout", "thru", "thus", "til", "tip", "together", "took", "toward", "towards", "tried", "tries", "truly", "try", "trying", "ts", "twice", "two", "u", "un", "unfortunately", "unless", "unlike", "unlikely", "unto", "upon", "ups", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "v", "value", "various", "'ve", "via", "viz", "vol", "vols", "vs", "w", "want", "wants", "wasnt", "way", "wed", "welcome", "went", "werent", "whatever", "what'll", "whats", "whence", "whenever", "whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon", "wherever", "whether", "whim", "whither", "whod", "whoever", "whole", "who'll", "whomever", "whos", "whose", "widely", "willing", "wish", "within", "without", "wont", "words", "world", "wouldnt", "www", "x", "yes", "yet", "youd", "youre", "z", "zero", "a's", "ain't", "allow", "allows", "apart", "appear", "appreciate", "appropriate", "associated", "best", "better", "c'mon", "c's", "cant", "changes", "clearly", "concerning", "consequently", "consider", "considering", "corresponding", "course", "currently", "definitely", "described", "despite", "entirely", "exactly", "example", "going", "greetings", "hello", "help", "hopefully", "ignored", "inasmuch", "indicate", "indicated", "indicates", "inner", "insofar", "it'd", "keep", "keeps", "novel", "presumably", "reasonably", "second", "secondly", "sensible", "serious", "seriously", "sure", "t's", "third", "thorough", "thoroughly", "three", "well", "wonder", "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "co", "op", "research-articl", "pagecount", "cit", "ibid", "les", "le", "au", "que", "est", "pas", "vol", "el", "los", "pp", "u201d", "well-b", "http", "volumtype", "par", "0o", "0s", "3a", "3b", "3d", "6b", "6o", "a1", "a2", "a3", "a4", "ab", "ac", "ad", "ae", "af", "ag", "aj", "al", "an", "ao", "ap", "ar", "av", "aw", "ax", "ay", "az", "b1", "b2", "b3", "ba", "bc", "bd", "be", "bi", "bj", "bk", "bl", "bn", "bp", "br", "bs", "bt", "bu", "bx", "c1", "c2", "c3", "cc", "cd", "ce", "cf", "cg", "ch", "ci", "cj", "cl", "cm", "cn", "cp", "cq", "cr", "cs", "ct", "cu", "cv", "cx", "cy", "cz", "d2", "da", "dc", "dd", "de", "df", "di", "dj", "dk", "dl", "do", "dp", "dr", "ds", "dt", "du", "dx", "dy", "e2", "e3", "ea", "ec", "ed", "ee", "ef", "ei", "ej", "el", "em", "en", "eo", "ep", "eq", "er", "es", "et", "eu", "ev", "ex", "ey", "f2", "fa", "fc", "ff", "fi", "fj", "fl", "fn", "fo", "fr", "fs", "ft", "fu", "fy", "ga", "ge", "gi", "gj", "gl", "go", "gr", "gs", "gy", "h2", "h3", "hh", "hi", "hj", "ho", "hr", "hs", "hu", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ic", "ie", "ig", "ih", "ii", "ij", "il", "in", "io", "ip", "iq", "ir", "iv", "ix", "iy", "iz", "jj", "jr", "js", "jt", "ju", "ke", "kg", "kj", "km", "ko", "l2", "la", "lb", "lc", "lf", "lj", "ln", "lo", "lr", "ls", "lt", "m2", "ml", "mn", "mo", "ms", "mt", "mu", "n2", "nc", "nd", "ne", "ng", "ni", "nj", "nl", "nn", "nr", "ns", "nt", "ny", "oa", "ob", "oc", "od", "of", "og", "oi", "oj", "ol", "om", "on", "oo", "oq", "or", "os", "ot", "ou", "ow", "ox", "oz", "p1", "p2", "p3", "pc", "pd", "pe", "pf", "ph", "pi", "pj", "pk", "pl", "pm", "pn", "po", "pq", "pr", "ps", "pt", "pu", "py", "qj", "qu", "r2", "ra", "rc", "rd", "rf", "rh", "ri", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "rv", "ry", "s2", "sa", "sc", "sd", "se", "sf", "si", "sj", "sl", "sm", "sn", "sp", "sq", "sr", "ss", "st", "sy", "sz", "t1", "t2", "t3", "tb", "tc", "td", "te", "tf", "th", "ti", "tj", "tl", "tm", "tn", "tp", "tq", "tr", "ts", "tt", "tv", "tx", "ue", "ui", "uj", "uk", "um", "un", "uo", "ur", "ut", "va", "wa", "vd", "wi", "vj", "vo", "wo", "vq", "vt", "vu", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y2", "yj", "yl", "yr", "ys", "yt", "zi", "zz"]
data = [list(filter(lambda word: not word in ENGLISH_STOP_WORDS, simple_preprocess(piece))) for piece in newsgroups_train.data]

$\textbf{Большое задание 1.}$

Для списка data создайте словарь id2word. Получите преобразованный TermDocumentFrequency список corpust и обучите на нем LDA модель.

In [30]:
from gensim.models import LdaModel, LsiModel

id2word = corpora.Dictionary(data)

# Create Corpus
texts = data

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

model = LdaModel(corpus, id2word=id2word, num_topics=len(categories))

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 4), (10, 2), (11, 6), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]]


In [25]:
#Выведем получившийся список тем:
for i in range(len(categories)):
    print([id2word[id[0]] for id in model.get_topic_terms(topicid = i, topn = 10)])

['turkey', 'kurds', 'armenian', 'tony', 'power', 'russian', 'volunteers', 'bristol', 'caucasus', 'conform']
['rate', 'failure', 'finnish', 'abstinence', 'chart', 'school', 'catholic', 'people', 'tony', 'sex']
['fingers', 'mediocrity', 'lindros', 'surprise', 'ottawa', 'thankfully', 'space', 'sign', 'patrick', 'people']
['gt', 'govern', 'manta', 'pretty', 'surprise', 'team', 'early', 'european', 'sold', 'mid']
['blues', 'hawks', 'offended', 'cares', 'predicted', 'day', 'left', 'place', 'wrote', 'wake']
['psuvm', 'space', 'nasa', 'data', 'people', 'longer', 'launch', 'mac', 'time', 'earth']
['people', 'bible', 'religious', 'reasonable', 'point', 'christian', 'list', 'arguments', 'argument', 'rational']


$\textbf{Большое задание 2.}$

В соответствии с тренировочными, обработайте тестовые данные.

Напишите функцию, которая с помощью модели возвращает наиболее вероятный id темы. С помощью F-меры оцените правильность работы модели.

In [31]:
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories = categories)

other_corpus = [id2word.doc2bow(text) for text in [list(filter(lambda word: not word in ENGLISH_STOP_WORDS, simple_preprocess(piece))) for piece in newsgroups_test.data]]
vector = [model[unseen_doc] for unseen_doc in other_corpus]

#TODO: YOUD CODE

[(2, 0.2848444), (4, 0.28398177), (5, 0.07771322), (6, 0.35068744)]


In [27]:
newsgroups_test.data

['Andrew - continuing the discussion on the Deuterocanonicals...\n\n\nArguably, it is both. Since authority is a matter of both\ncommunicator and recepiant we can say that, for example "Jesus\nis Lord" whether the world at large accepts the authority or\nnot. Thus the Bible can be considered for its authoritative\ncontent whether or not it is accepted (This issue is at the\nheart of Pilate\'s pragmatic question "What is truth?" to Jesus\nwhen our Lord was brought before Him. Jesus\' reply was to appeal\nto the authority of his Father)\nYou also might like to consider the claimed authority\nrepresented by the statements "thus says the Lord" in the Bible,\nwhich claim to put across the exact words of God.\n\nYou fall into the danger of relativism with your rejection of\ninherant authority and claim that it lies only in the "community\nof faith" - does something become truth because it is accepted?\nThe main thrust of my argument is that there is a Godward\ndirection as well as a manward 

In [28]:
def probability_of_text(text):
    res = vector[text]
    max_it = 0
    if(len(res) > 0):
        for it in range(1, len(res)):
            if(res[max_it][1] < res[it][1]):
                max_it = it
        print("Text #" + str(text) + ", topic #" + str(max_it) + str(", prob = " + str(res[max_it][1])))
    
probability_of_text(3)

Text #3, topic #1, prob = 0.7933356


In [29]:
i = 0
vector_pred = []
for res in vector:
    max_it = 0
    if(len(res) > 0):
        for it in range(1, len(res)):
            if(res[max_it][1] < res[it][1]):
                max_it = it
        vector_pred.append(max_it)
    i = i + 1

y_true = newsgroups_test.target
y_pred = vector_pred
f1_score(y_true, y_pred, average='micro')

NameError: name 'f1_score' is not defined

In [19]:
print(newsgroups_test.target[:10])
print(vector_pred[:10])

NameError: name 'newsgroups_test' is not defined