# Topic modeling

Zadanie proszę wykonać w parach, ale... na dwóch komputerach. Na obu komputerach proszę najpierw wykonać instrukcje z sekcji **Przygotowanie**. Następnie na jednym komputerze proszę od razu odkomentować i uruchomić kod w sekcji **Wikipedia** i **Wizualizacja**. Na drugim komputerze proszę rozwiązać zadania z sekcji **Tweety** i **Wizualizacja**.

Po wykonaniu tego zadania powinieneś:
+ potrafić wykonać podstawowy topic modeling,
+ umieć stworzyć słownik mapujący identyfikatory na słowa,
+ potrafić stworzyć macierz wektorów TF-IDF,
+ wiedzieć jak wykorzystać LDA do określenia proporcji tematów w nowym dokumencie tekstowym,
+ potrafić zwizualizować wyniki algorytmu LDA.

Do wykonania zadania wykorzystamy bibliotekę [gensim](https://radimrehurek.com/gensim/), która oferuje szereg metod do analizy tekstu. Warto w wolnej chwili zobaczyć co oprócz algorytmu LDA zostało zaimplementowane w ramach tego modułu!

## Przygotowanie

Aby zadziałała wizualizacja, musimy najpierw zauktualizować bibliotekę `scipy` i doinstalować bibliotekę `pyldavis`. Będzie to okazja, żeby zobaczyć jak zarządza się bibliotekami w anacondzie. Jeśli kogoś interesuje co oznaczają kolejne komendy, proszę zajrzeć do [dokumentacji Anacondy](http://conda.pydata.org/docs/using/pkgs.html).

1. Zatrzymaj kernel (serwer jupyter notebooka)
2. Otwórz terminal
3. Wpisz `conda update scipy` (i Enter gdy spytają `Proceed`)
4. Popatrz na paski postępu
5. Wpisz `activate root`
6. Wpisz `pip install pyldavis`
7. Wpisz `deactivate`

Koniec. Możesz ponownie odpalić notatnik i przejść do kolejnych kroków.

## Wikipedia

Ten fragemnt kodu stanowi przykład uruchomienia topic modelingu na większym zbiorze danych. Ponieważ wyliczenie modelu będzie trwać od kilku do kilkunastu minut, niech każda para uruchomi ten przykład tylko na jednym komputerze.

Aby uruchomić przykład, w folderze z notatnikiem muszą znajdować się pliki `wiki_wordids.txt.bz2` i `wiki_tfidf.mm` ściągnięte wraz z notatnikiem. Przykład zbudowany jest w oparciu o podzbiór stron wikipedii dostępny pod adresem: https://dumps.wikimedia.org/enwiki/latest/. Strony zostały przekonwertowane na reprezentację wektorową za pomocą skryptu:
`python -m gensim.scripts.make_wiki`.

**Przeczytaj komentarze zanim uruchomisz kod.**

In [27]:
import logging
import gensim
import warnings
warnings.filterwarnings("ignore")

# Włączamy logowanie, żeby śledzić postępy algorytmu (to akurat nie będzie działać w Jupyter Notebooku,
# ale warto o tym wspomnieć)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Odczytujemy z pliku mapowanie/słownik id->słowo
id2word = gensim.corpora.Dictionary.load_from_text('wiki_wordids.txt.bz2')

# Odczytujemy z pliku reprezentację wektorową korpusu (macierz wetkorów TF-IDF)
mm = gensim.corpora.MmCorpus('wiki_tfidf.mm')
print(mm)

# Tworzymy model LDA z 20 grupami wykonując 20 iteracji na całym zbiorze
lda = gensim.models.LdaMulticore(corpus=mm, id2word=id2word, num_topics=20, passes=20, workers=4)
# Alternatywnie w razie problemów z wielowątkowością:
# lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=20, update_every=0, passes=20)
lda.print_topics(20)

2021-01-15 20:08:20,691 : INFO : initializing cython corpus reader from wiki_tfidf.mm
2021-01-15 20:08:20,692 : INFO : accepted corpus with 4672 documents, 19960 features, 1762996 non-zero entries
2021-01-15 20:08:20,694 : INFO : using symmetric alpha at 0.05
2021-01-15 20:08:20,695 : INFO : using symmetric eta at 0.05
2021-01-15 20:08:20,698 : INFO : using serial LDA version on this node
2021-01-15 20:08:20,753 : INFO : running online LDA training, 20 topics, 20 passes over the supplied corpus of 4672 documents, updating every 8000 documents, evaluating every ~4672 documents, iterating 50x with a convergence threshold of 0.001000
2021-01-15 20:08:20,754 : INFO : training LDA model using 4 processes


MmCorpus(4672 documents, 19960 features, 1762996 non-zero entries)


2021-01-15 20:08:21,301 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2021-01-15 20:08:22,074 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2021-01-15 20:08:22,504 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #4672/4672, outstanding queue size 3
2021-01-15 20:08:23,886 : INFO : topic #19 (0.050): 0.001*"album" + 0.001*"eritrea" + 0.001*"catherine" + 0.001*"jpg" + 0.001*"doctor" + 0.001*"binary" + 0.001*"alfonso" + 0.001*"disambiguation" + 0.001*"ball" + 0.001*"psychology"
2021-01-15 20:08:23,887 : INFO : topic #13 (0.050): 0.001*"px" + 0.001*"dune" + 0.001*"jpg" + 0.001*"communist" + 0.001*"bohemia" + 0.001*"czech" + 0.001*"plants" + 0.001*"alfonso" + 0.001*"vector" + 0.001*"brain"
2021-01-15 20:08:23,888 : INFO : topic #7 (0.050): 0.001*"px" + 0.001*"darwin" + 0.001*"apollo" + 0.001*"antarctic" + 0.001*"khmer" + 0.001*"lynch" + 0.001*"cornish" + 0.001*"sa

2021-01-15 20:08:39,623 : INFO : topic diff=0.951022, rho=0.369207
2021-01-15 20:08:40,851 : INFO : -14.381 per-word bound, 21337.5 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2021-01-15 20:08:41,351 : INFO : PROGRESS: pass 5, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2021-01-15 20:08:42,101 : INFO : PROGRESS: pass 5, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2021-01-15 20:08:42,529 : INFO : PROGRESS: pass 5, dispatched chunk #2 = documents up to #4672/4672, outstanding queue size 3
2021-01-15 20:08:43,343 : INFO : topic #19 (0.050): 0.001*"bantu" + 0.001*"cadillac" + 0.001*"alfonso" + 0.001*"anarchism" + 0.001*"amplifier" + 0.001*"affinity" + 0.001*"crete" + 0.001*"anarchist" + 0.001*"guthrie" + 0.001*"catherine"
2021-01-15 20:08:43,345 : INFO : topic #17 (0.050): 0.004*"actor" + 0.004*"singer" + 0.004*"politician" + 0.004*"actress" + 0.003*"footballer" + 0.003*"songwriter" + 0.002*"p

2021-01-15 20:08:57,935 : INFO : topic #2 (0.050): 0.005*"baseball" + 0.005*"px" + 0.003*"batman" + 0.003*"ball" + 0.003*"pitcher" + 0.003*"sox" + 0.003*"batter" + 0.002*"batting" + 0.002*"yankees" + 0.002*"antigua"
2021-01-15 20:08:57,938 : INFO : topic diff=0.295095, rho=0.284717
2021-01-15 20:08:59,084 : INFO : -14.036 per-word bound, 16800.0 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2021-01-15 20:08:59,573 : INFO : PROGRESS: pass 10, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2021-01-15 20:09:00,319 : INFO : PROGRESS: pass 10, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2021-01-15 20:09:00,747 : INFO : PROGRESS: pass 10, dispatched chunk #2 = documents up to #4672/4672, outstanding queue size 3
2021-01-15 20:09:01,472 : INFO : topic #1 (0.050): 0.004*"burundi" + 0.004*"burkina" + 0.003*"faso" + 0.002*"dice" + 0.001*"cocktail" + 0.001*"jutland" + 0.001*"cache" + 0.001*"fried" + 0.001

2021-01-15 20:09:15,847 : INFO : topic #12 (0.050): 0.003*"catalan" + 0.002*"samoa" + 0.002*"rum" + 0.002*"aberration" + 0.002*"abba" + 0.002*"bicarbonate" + 0.002*"catalonia" + 0.001*"emmy" + 0.001*"casino" + 0.001*"asteroids"
2021-01-15 20:09:15,848 : INFO : topic #11 (0.050): 0.003*"binomial" + 0.002*"cyril" + 0.002*"bronx" + 0.001*"elf" + 0.001*"acropolis" + 0.001*"bengal" + 0.001*"balfour" + 0.001*"archimedes" + 0.001*"spinoza" + 0.001*"badminton"
2021-01-15 20:09:15,849 : INFO : topic diff=0.092454, rho=0.240174
2021-01-15 20:09:16,992 : INFO : -13.997 per-word bound, 16346.7 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2021-01-15 20:09:17,479 : INFO : PROGRESS: pass 15, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2021-01-15 20:09:18,238 : INFO : PROGRESS: pass 15, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2021-01-15 20:09:18,791 : INFO : PROGRESS: pass 15, dispatched chunk #2 = doc

2021-01-15 20:09:33,216 : INFO : topic #17 (0.050): 0.005*"actor" + 0.005*"singer" + 0.004*"politician" + 0.004*"actress" + 0.003*"footballer" + 0.003*"songwriter" + 0.003*"producer" + 0.002*"composer" + 0.002*"screenwriter" + 0.002*"journalist"
2021-01-15 20:09:33,217 : INFO : topic #18 (0.050): 0.002*"cell" + 0.001*"acid" + 0.001*"cells" + 0.001*"christ" + 0.001*"carbon" + 0.001*"mathematics" + 0.001*"jesus" + 0.001*"dna" + 0.001*"theorem" + 0.001*"disease"
2021-01-15 20:09:33,218 : INFO : topic #19 (0.050): 0.002*"affinity" + 0.002*"cadillac" + 0.002*"bantu" + 0.002*"anarchism" + 0.001*"amplifier" + 0.001*"chromatography" + 0.001*"cloning" + 0.001*"garner" + 0.001*"dolphins" + 0.001*"garrett"
2021-01-15 20:09:33,219 : INFO : topic diff=0.033220, rho=0.211591
2021-01-15 20:09:34,356 : INFO : -13.978 per-word bound, 16136.4 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2021-01-15 20:09:34,392 : INFO : topic #0 (0.050): 0.004*"ajax" + 0.002*"subclass" 

[(0,
  '0.004*"ajax" + 0.002*"subclass" + 0.001*"alexandra" + 0.001*"mbit" + 0.001*"antimony" + 0.001*"pointer" + 0.001*"plants" + 0.001*"cumberland" + 0.001*"cocaine" + 0.001*"flowering"'),
 (1,
  '0.005*"burundi" + 0.005*"burkina" + 0.004*"faso" + 0.003*"dice" + 0.001*"cocktail" + 0.001*"jutland" + 0.001*"cache" + 0.001*"shrimp" + 0.001*"fried" + 0.001*"desserts"'),
 (2,
  '0.006*"baseball" + 0.004*"batman" + 0.004*"px" + 0.004*"ball" + 0.004*"pitcher" + 0.004*"sox" + 0.003*"batter" + 0.003*"batting" + 0.003*"yankees" + 0.003*"antigua"'),
 (3,
  '0.002*"apache" + 0.002*"franc" + 0.001*"cfa" + 0.001*"cy" + 0.001*"aston" + 0.001*"davenport" + 0.001*"conditioning" + 0.001*"thornton" + 0.001*"bind" + 0.001*"cracking"'),
 (4,
  '0.002*"album" + 0.002*"jpg" + 0.001*"aircraft" + 0.001*"disambiguation" + 0.001*"players" + 0.001*"px" + 0.001*"km" + 0.001*"airport" + 0.001*"films" + 0.001*"railway"'),
 (5,
  '0.006*"est" + 0.005*"tfr" + 0.005*"cbr" + 0.003*"births" + 0.003*"demographic" + 0.00

## Tweety

W parach stwórzcie model LDA w oparciu o załączone Tweety. W tym celu należy przekonwertować pliki tekstowe na reprezentację wektorową zapisując wcześniej mapowanie id->słowo w postaci słownika. Opis jak stworzyć wymienione struktury danych można znaleźć na stronie: https://radimrehurek.com/gensim/tut1.html.

**Zad. 1: Wczytaj tweety z pliku tweets.tsv do zmiennej `tweets`.**

In [4]:
import gensim
import logging
import nltk
import re
import pandas as pd

stopwords = ["a", "about", "after", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been",
            "before", "being", "between", "both", "by", "could", "did", "do", "does", "doing", "during", "each",
            "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here",
            "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've",
            "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "of",
            "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "own", "shan't", "she", "she'd",
            "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs",
            "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're",
            "they've", "this", "those", "through", "to", "until", "up", "very", "was", "wasn't", "we", "we'd",
            "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's",
            "which", "while", "who", "who's", "whom", "with", "would", "you", "you'd", "you'll", "you're", "you've",
            "your", "yours", "yourself", "yourselves", "above", "again", "against", "aren't", "below", "but", "can't",
            "cannot", "couldn't", "didn't", "doesn't", "don't", "down", "few", "hadn't", "hasn't", "haven't", "if",
            "isn't", "mustn't", "no", "nor", "not", "off", "out", "over", "shouldn't", "same", "too", "under", "why",
            "why's", "won't", "wouldn't"]

tweets = pd.read_csv("tweets.tsv",sep="\t",header=None)[2]
tweets.head()

0    dear @Microsoft the newOoffice for Mac is grea...
1    @Microsoft how about you make a system that do...
2                                        Not Available
3                                        Not Available
4    If I make a game as a #windows10 Universal App...
Name: 2, dtype: object


**Zad. 2: Dokonaj tokenizacji słów, usuń te z stoplisty (`stopwords`) oraz występujące tylko raz. Wynik przypisz do zmiennej `texts`.**

In [39]:

import re
import nltk
from collections import Counter

RE_SPACES = re.compile("\s+")
RE_HASHTAG = re.compile("[@#][_a-z0-9]+")
RE_EMOTICONS = re.compile("(:-?\))|(:p)|(:d+)|(:-?\()|(:/)|(;-?\))|(<3)|(=\))|(\)-?:)|(:'\()|(8\))")
RE_HTTP = re.compile("http(s)?://[/\.a-z0-9]+")

class Tokenizer():
    @staticmethod
    def tokenize(text):
        pass
    
class BeforeTokenizationNormalizer():
    @staticmethod
    def normalize(text):
        text = text.strip().lower()
        text = text.replace('&nbsp;', ' ')
        text = text.replace('&lt;', '<')
        text = text.replace('&gt;', '>')
        text = text.replace('&amp;', '&')
        text = text.replace('&pound;', u'£')
        text = text.replace('&euro;', u'€')
        text = text.replace('&copy;', u'©')
        text = text.replace('&reg;', u'®')
        return text
class NltkTokenizer(Tokenizer):
    @staticmethod
    def tokenize(text):
        # Napisz tokenizator korzystający z funkcji word_tokenize() z biblioteki NLTK.
        # Czy w przypadku tweetów wszystkie słowa zostały poprawnie rozdzielone?
        return nltk.word_tokenize(text)
    
class TweetTokenizer(Tokenizer):
    @staticmethod
    def tokenize(text):
        tokens = SimpleTokenizer.tokenize(text)
        i = 0
        result_t = []
        for token in tokens:
            # sprawdź czy w ramach tokena występuje emotikona, hashtag lub link
            match_HASHTAG = re.match(RE_HASHTAG,token)
            match_EMOTICONS = re.match(RE_EMOTICONS,token)
            match_HTTP = re.match(RE_HTTP,token)

            if match_HTTP is not None:
                continue
            elif match_HASHTAG is not None or match_EMOTICONS is not None:
                result_t.append(token)
            else:
                result_t +=  NltkTokenizer.tokenize(token)
            
        # stwórz stemmer i w pętli stemmuj wszystkie tokeny
        porter = nltk.PorterStemmer()
        result_t = [porter.stem(token) for token in result_t]
        return result_t

all_tweets_list = []
words = Counter()

# tokenize, stopwords and make counter
for i in tweets.index:
    tweet = BeforeTokenizationNormalizer.normalize(tweets.iat[i])
    words_tweet = TweetTokenizer.tokenize(tweet)
    words_tweet_ns = [w for w in words_tweet if w not in stopwords]
    all_tweets_list.append(words_tweet_ns)
    words.update(words_tweet_ns)
    
words_to_delete = [word for word,count in words.items() if count == 1]

# delete ones that appeard once
texts = []

for tweet in all_tweets_list:
    tweets_list = []
    for word in tweet:
        if word not in words_to_delete:
            tweets_list.append(word)
    if len(tweets_list) > 0:
        texts.append(tweets_list)
print(len(texts))

  RE_SPACES = re.compile("\s+")
  RE_EMOTICONS = re.compile("(:-?\))|(:p)|(:d+)|(:-?\()|(:/)|(;-?\))|(<3)|(=\))|(\)-?:)|(:'\()|(8\))")
  RE_HTTP = re.compile("http(s)?://[/\.a-z0-9]+")


5887


**Zad. 3: Stwórz słownik id->słowo i przypisz do zmiennej `id2word`.**

In [40]:
from gensim.corpora import Dictionary

id2word = Dictionary(texts)
print(id2word)


2021-01-15 20:19:54,484 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-15 20:19:54,601 : INFO : built Dictionary(4240 unique tokens: [',', '.', '?', '@microsoft', "c'mon"]...) from 5887 documents (total 75689 corpus positions)


Dictionary(4240 unique tokens: [',', '.', '?', '@microsoft', "c'mon"]...)


**Zad. 4: Stwórz reprezentację wektorową korpusu (macierz wektorów TF-IDF), wynik przypisz do zmiennej `mm`.**

In [41]:
from gensim.models import TfidfModel
corpus = [id2word.doc2bow(tweet) for tweet in texts]
model = TfidfModel(corpus)
mm=model[corpus]
print(mm)

2021-01-15 20:19:54,678 : INFO : collecting document frequencies
2021-01-15 20:19:54,679 : INFO : PROGRESS: processing document #0
2021-01-15 20:19:54,699 : INFO : calculating IDF weights for 5887 documents and 4240 features (69429 matrix non-zeros)


<gensim.interfaces.TransformedCorpus object at 0x7f7b406a9280>


**Zad. 5: Odpal poniższy kod i odkryj 10 tematów za pomocą algorytmu LDA. Jeśli masz czas, zwiększ wartości parametrów num_topics i passes przy tworzeniu modelu LDA, i sprawdź jak to wpłynie na rezultat.**

In [42]:
# Włączamy logowanie, żeby śledzić postępy algorytmu (to akurat nie będzie działać w Jupyter Notebooku,
# ale warto o tym wspomnieć)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10, update_every=0, passes=20)
# alternatywnie lda = gensim.models.LdaMulticore(corpus=mm, id2word=id2word, num_topics=10, passes=20)
lda.print_topics(10)

2021-01-15 20:19:54,732 : INFO : using symmetric alpha at 0.1
2021-01-15 20:19:54,733 : INFO : using symmetric eta at 0.1
2021-01-15 20:19:54,735 : INFO : using serial LDA version on this node
2021-01-15 20:19:54,741 : INFO : running batch LDA training, 10 topics, 20 passes over the supplied corpus of 5887 documents, updating model once every 5887 documents, evaluating perplexity every 5887 documents, iterating 50x with a convergence threshold of 0.001000
2021-01-15 20:19:54,912 : INFO : PROGRESS: pass 0, at document #2000/5887
2021-01-15 20:19:55,564 : INFO : PROGRESS: pass 0, at document #4000/5887
2021-01-15 20:19:56,858 : INFO : -12.483 per-word bound, 5724.7 perplexity estimate based on a held-out corpus of 1887 documents with 5837 words
2021-01-15 20:19:56,859 : INFO : PROGRESS: pass 0, at document #5887/5887
2021-01-15 20:19:57,351 : INFO : topic #7 (0.100): 0.009*"." + 0.007*"," + 0.006*"'s" + 0.005*"?" + 0.005*"may" + 0.005*"!" + 0.004*"tomorrow" + 0.004*"wa" + 0.004*"go" + 0.

2021-01-15 20:20:08,331 : INFO : topic #1 (0.100): 0.008*"," + 0.008*"." + 0.006*"'s" + 0.006*"appl" + 0.005*")" + 0.005*"(" + 0.005*"tomorrow" + 0.005*"!" + 0.005*"..." + 0.004*"-"
2021-01-15 20:20:08,332 : INFO : topic diff=0.087328, rho=0.334385
2021-01-15 20:20:08,476 : INFO : PROGRESS: pass 6, at document #2000/5887
2021-01-15 20:20:08,986 : INFO : PROGRESS: pass 6, at document #4000/5887
2021-01-15 20:20:10,031 : INFO : -9.389 per-word bound, 670.4 perplexity estimate based on a held-out corpus of 1887 documents with 5837 words
2021-01-15 20:20:10,031 : INFO : PROGRESS: pass 6, at document #5887/5887
2021-01-15 20:20:10,398 : INFO : topic #7 (0.100): 0.009*"." + 0.008*"eric" + 0.008*"church" + 0.008*"," + 0.007*"?" + 0.006*"'s" + 0.005*"!" + 0.005*"tomorrow" + 0.005*"go" + 0.005*"see"
2021-01-15 20:20:10,399 : INFO : topic #0 (0.100): 0.010*"!" + 0.009*"," + 0.008*"." + 0.008*"magic" + 0.008*"mike" + 0.008*"xxl" + 0.006*"'s" + 0.006*"..." + 0.005*"may" + 0.005*"?"
2021-01-15 20:2

2021-01-15 20:20:21,200 : INFO : topic #2 (0.100): 0.010*"." + 0.008*"tomorrow" + 0.008*"," + 0.008*"go" + 0.007*"'s" + 0.006*"!" + 0.006*"'m" + 0.006*"ihop" + 0.005*":" + 0.005*"may"
2021-01-15 20:20:21,201 : INFO : topic diff=0.044632, rho=0.258687
2021-01-15 20:20:21,343 : INFO : PROGRESS: pass 12, at document #2000/5887
2021-01-15 20:20:21,864 : INFO : PROGRESS: pass 12, at document #4000/5887
2021-01-15 20:20:22,943 : INFO : -9.321 per-word bound, 639.7 perplexity estimate based on a held-out corpus of 1887 documents with 5837 words
2021-01-15 20:20:22,944 : INFO : PROGRESS: pass 12, at document #5887/5887
2021-01-15 20:20:23,320 : INFO : topic #7 (0.100): 0.010*"church" + 0.010*"eric" + 0.009*"." + 0.008*"," + 0.007*"?" + 0.006*"'s" + 0.005*"tomorrow" + 0.005*"go" + 0.005*"!" + 0.005*"feder"
2021-01-15 20:20:23,321 : INFO : topic #3 (0.100): 0.008*"." + 0.008*"," + 0.007*"biden" + 0.006*"joe" + 0.006*"day" + 0.006*"amazon" + 0.006*"black" + 0.005*":" + 0.005*"prime" + 0.005*"may"

2021-01-15 20:20:33,976 : INFO : topic #3 (0.100): 0.009*"biden" + 0.008*"." + 0.008*"," + 0.008*"joe" + 0.006*"amazon" + 0.006*"day" + 0.006*"black" + 0.006*":" + 0.006*"prime" + 0.005*"may"
2021-01-15 20:20:33,977 : INFO : topic diff=0.027989, rho=0.218512
2021-01-15 20:20:34,133 : INFO : PROGRESS: pass 18, at document #2000/5887
2021-01-15 20:20:34,665 : INFO : PROGRESS: pass 18, at document #4000/5887
2021-01-15 20:20:35,764 : INFO : -9.285 per-word bound, 623.8 perplexity estimate based on a held-out corpus of 1887 documents with 5837 words
2021-01-15 20:20:35,765 : INFO : PROGRESS: pass 18, at document #5887/5887
2021-01-15 20:20:36,133 : INFO : topic #6 (0.100): 0.013*"!" + 0.009*"." + 0.008*"," + 0.008*"jurass" + 0.008*"tomorrow" + 0.006*"world" + 0.006*"watch" + 0.005*"?" + 0.005*"'s" + 0.005*"may"
2021-01-15 20:20:36,134 : INFO : topic #3 (0.100): 0.009*"biden" + 0.008*"." + 0.008*"," + 0.008*"joe" + 0.007*"amazon" + 0.006*"day" + 0.006*"black" + 0.006*"prime" + 0.006*":" + 0

[(0,
  '0.010*"magic" + 0.010*"!" + 0.010*"mike" + 0.010*"xxl" + 0.009*"," + 0.009*"." + 0.006*"\'s" + 0.006*"may" + 0.005*"..." + 0.005*"?"'),
 (1,
  '0.009*"," + 0.008*"." + 0.007*"appl" + 0.006*"\'s" + 0.006*")" + 0.006*"cobain" + 0.006*"kurt" + 0.006*"(" + 0.005*"!" + 0.005*"mariah"'),
 (2,
  '0.010*"." + 0.009*"tomorrow" + 0.009*"go" + 0.009*"," + 0.007*"\'s" + 0.007*"ihop" + 0.007*"!" + 0.006*"\'m" + 0.005*"?" + 0.005*"day"'),
 (3,
  '0.009*"biden" + 0.008*"." + 0.008*"joe" + 0.008*"," + 0.007*"amazon" + 0.006*"day" + 0.006*"black" + 0.006*":" + 0.006*"prime" + 0.006*"may"'),
 (4,
  '0.354*"avail" + 0.004*"," + 0.003*"arsen" + 0.003*"conor" + 0.003*"mcgregor" + 0.003*"." + 0.002*"!" + 0.002*"kurt" + 0.002*"perfect" + 0.002*"may"'),
 (5,
  '0.009*"." + 0.008*"!" + 0.008*"?" + 0.008*"mac" + 0.008*"fleetwood" + 0.007*"," + 0.006*"\'s" + 0.006*"day" + 0.005*"david" + 0.005*"may"'),
 (6,
  '0.013*"!" + 0.009*"." + 0.008*"," + 0.008*"jurass" + 0.008*"tomorrow" + 0.006*"world" + 0.006*"

**Zad. 6*: Na podstawie zbudowanego modelu określ proporcje tematów w następującym tweecie:
`Zlatan is looking mighty attractive at the moment,if LVG doesn't get a striker by Tuesday, I really don't fancy us scoring goals this season`. Jeśli zostało niewiele czasu, przejdź od razu do wizualizacji.**

In [43]:
new_tweet = "Zlatan is looking mighty attractive at the moment,if LVG doesn't get a striker by Tuesday, I really don't fancy us scoring goals this season"
new_tweet_text = BeforeTokenizationNormalizer.normalize(new_tweet)
new_tweet_text = TweetTokenizer.tokenize(new_tweet_text)
new_tweet_text = [word for word in new_tweet_text if word not in stopwords]
new_tweet_mm = id2word.doc2bow(new_tweet_text)
new_tweet_lda = lda[new_tweet_mm]
print(new_tweet_lda)

[(1, 0.11487074), (4, 0.11274952), (6, 0.73550785)]


## Wizualizacja

Spróbujemy teraz zwizualizować uzyskane tematy. Podążaj za komentarzami a powinno się udać.

P.S. Możesz uruchomić tę wizualizację również dla tematów odkrytych z Wikipedii...

**Zad. 7: Zwizualizuj uzyskane tematy.**

In [44]:
# Poniższy kod korzysta modułu pyLDAvis z githuba:
# https://github.com/bmabey/pyLDAvis

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda, mm, id2word)