# Topic modeling

Zadanie proszę wykonać w parach, ale na dwóch komputerach. Na jednym komputerze proszę od razu odkomentować i uruchomić kod w sekcji Wikipedii. Na drugim komputerze prosze rozwiązać zadania z sekcji Tweety i Wizualizacja.

Po wykonaniu tego zadania powinieneś:
+ potrafić wykonać podstawowy topic modeling,
+ umieć stworzyć słownik mapujący identyfikatory na słowa,
+ potrafić stworzyć macierz wektorów TF-IDF,
+ wiedzieć jak wykorzystać LDA do określenia proporcji tematów w nowym dokumencie tekstowym,
+ potrafić zwizualizować wyniki algorytmu LDA.

Do wykonania zadania wykorzystamy bibliotekę [`gensim`](https://radimrehurek.com/gensim/), która oferuje szereg metod do analizy tekstu. Warto zobaczyć co oprócz algorytmu LDA zostało zaimplementowane w ramach tego modułu!

## Wikipedia

Ten fragment kodu stanowi przykład uruchomienia topic modelingu na większym zbiorze danych. Ponieważ wyliczenie modelu będzie trwać kilka do kilkunastu minut, niech każda para urochami ten przykład tylko na jednym komputerze.

Aby uruchomić ten przykład, w folderze z plikiem lda_wiki.py muszą znajdować się pliki wiki_wordids.txt.bz2 i ściągnięte wraz z przykładem. Przykład zbudowany w oparciu o podzbiór stron wikipedii dostępny pod adresem: https://dumps.wikimedia.org/enwiki/latest/ przekonwertowany na reprezentację wektorową z pomocą skryptu:
`python -m gensim.scripts.make_wiki`.

**Przeczytaj komentarze zanim uruchomisz kod.**

In [1]:
import logging
import gensim

# Włączamy logowanie, żeby śledzić postępy algorytmu (to akurat nie będzie działać w Jupyter Notebooku,
# ale warto o tym wspomnieć)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Odczytujemy z pliku mapowanie/słownik id->słowo
id2word = gensim.corpora.Dictionary.load_from_text('wiki_wordids.txt.bz2')

# Odczytujemy z pliku reprezentację wektorową korpusu (macierz wetkorów TF-IDF)
mm = gensim.corpora.MmCorpus('wiki_tfidf.mm')
print(mm)

# Tworzymy model LDA z 20 grupami wykonując 20 iteracji na całym zbiorze
lda = gensim.models.LdaMulticore(corpus=mm, id2word=id2word, num_topics=20, passes=20, workers=4)
# Alternatywnie w razie problemów z wielowątkowością:
# lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=20, update_every=0, passes=20)
lda.print_topics(20)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2020-12-21 10:51:48,178 : INFO : initializing cython corpus reader from wiki_tfidf.mm
2020-12-21 10:51:48,186 : INFO : accepted corpus with 4672 documents, 19960 features, 1762996 non-zero entries
2020-12-21 10:51:48,187 : INFO : using symmetric alpha at 0.05
2020-12-21 10:51:48,188 : INFO : using symmetric eta at 0.05
2020-12-21 10:51:48,191 : INFO : using serial LDA version on this node
2020-12-21 10:51:48,267 : INFO : running online LDA training, 20 topics, 20 passes over the supplied corpus of 4672 documents, updating every 8000 documents, evaluating every ~4672 documents, iterating 50x with a convergence threshold of 0.001000
2020-12-21 10:51:48,278 : INFO : training LDA model using 4 processes


MmCorpus(4672 documents, 19960 features, 1762996 non-zero entries)


2020-12-21 10:51:49,845 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2020-12-21 10:51:51,643 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2020-12-21 10:51:52,797 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #4672/4672, outstanding queue size 3
2020-12-21 10:51:55,981 : INFO : topic #9 (0.050): 0.001*"aircraft" + 0.001*"album" + 0.001*"aa" + 0.001*"columbus" + 0.001*"cauchy" + 0.001*"birds" + 0.001*"danish" + 0.001*"goddess" + 0.001*"afghanistan" + 0.001*"disambiguation"
2020-12-21 10:51:55,983 : INFO : topic #5 (0.050): 0.001*"cell" + 0.001*"acid" + 0.001*"cells" + 0.001*"conspiracy" + 0.001*"salvador" + 0.001*"athens" + 0.001*"compression" + 0.001*"signal" + 0.001*"album" + 0.001*"telegraph"
2020-12-21 10:51:55,985 : INFO : topic #16 (0.050): 0.001*"jpg" + 0.001*"plants" + 0.001*"acid" + 0.001*"carbon" + 0.001*"constellation" + 0.001*"subclass" + 0.001*

2020-12-21 10:52:29,899 : INFO : topic diff=0.937332, rho=0.369207
2020-12-21 10:52:31,657 : INFO : -14.551 per-word bound, 24008.0 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2020-12-21 10:52:32,720 : INFO : PROGRESS: pass 5, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2020-12-21 10:52:34,712 : INFO : PROGRESS: pass 5, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2020-12-21 10:52:35,811 : INFO : PROGRESS: pass 5, dispatched chunk #2 = documents up to #4672/4672, outstanding queue size 2
2020-12-21 10:52:37,635 : INFO : topic #8 (0.050): 0.002*"node" + 0.001*"ada" + 0.001*"barbados" + 0.001*"acceleration" + 0.001*"khmer" + 0.001*"hook" + 0.001*"boxing" + 0.001*"nodes" + 0.001*"arkansas" + 0.001*"alaska"
2020-12-21 10:52:37,636 : INFO : topic #6 (0.050): 0.002*"comoros" + 0.002*"algorithm" + 0.002*"server" + 0.002*"colorado" + 0.002*"turing" + 0.001*"complexity" + 0.001*"binary" + 0.001*"mod

2020-12-21 10:53:08,967 : INFO : topic #8 (0.050): 0.002*"ada" + 0.002*"acceleration" + 0.002*"barbados" + 0.002*"khmer" + 0.002*"node" + 0.001*"hook" + 0.001*"arkansas" + 0.001*"boxing" + 0.001*"oscillator" + 0.001*"autism"
2020-12-21 10:53:08,971 : INFO : topic diff=0.302442, rho=0.284717
2020-12-21 10:53:10,979 : INFO : -14.131 per-word bound, 17941.7 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2020-12-21 10:53:11,994 : INFO : PROGRESS: pass 10, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2020-12-21 10:53:14,037 : INFO : PROGRESS: pass 10, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2020-12-21 10:53:15,106 : INFO : PROGRESS: pass 10, dispatched chunk #2 = documents up to #4672/4672, outstanding queue size 2
2020-12-21 10:53:16,745 : INFO : topic #1 (0.050): 0.006*"croatia" + 0.004*"bosnia" + 0.004*"bulgaria" + 0.004*"belarus" + 0.003*"croatian" + 0.003*"herzegovina" + 0.002*"serbia" + 0

2020-12-21 10:53:47,499 : INFO : topic #7 (0.050): 0.003*"strips" + 0.002*"ccc" + 0.002*"elliptic" + 0.002*"calvin" + 0.001*"hobbes" + 0.001*"alkali" + 0.001*"gauss" + 0.001*"nsa" + 0.001*"engined" + 0.001*"cobol"
2020-12-21 10:53:47,501 : INFO : topic #4 (0.050): 0.016*"actor" + 0.016*"singer" + 0.015*"politician" + 0.013*"actress" + 0.012*"footballer" + 0.011*"songwriter" + 0.008*"producer" + 0.007*"screenwriter" + 0.006*"journalist" + 0.006*"coach"
2020-12-21 10:53:47,505 : INFO : topic diff=0.101673, rho=0.240174
2020-12-21 10:53:49,343 : INFO : -14.061 per-word bound, 17086.9 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2020-12-21 10:53:50,394 : INFO : PROGRESS: pass 15, dispatched chunk #0 = documents up to #2000/4672, outstanding queue size 1
2020-12-21 10:53:52,395 : INFO : PROGRESS: pass 15, dispatched chunk #1 = documents up to #4000/4672, outstanding queue size 2
2020-12-21 10:53:53,396 : INFO : PROGRESS: pass 15, dispatched chunk #2 = docu

2020-12-21 10:54:24,195 : INFO : topic #9 (0.050): 0.003*"breed" + 0.003*"ambrose" + 0.002*"elijah" + 0.002*"aa" + 0.002*"crowley" + 0.002*"spears" + 0.001*"hound" + 0.001*"nero" + 0.001*"banjo" + 0.001*"barrel"
2020-12-21 10:54:24,197 : INFO : topic #13 (0.050): 0.001*"brabant" + 0.001*"pentagon" + 0.001*"voltaire" + 0.001*"folio" + 0.001*"chaplin" + 0.001*"cumberland" + 0.001*"asimov" + 0.001*"clutch" + 0.001*"booker" + 0.001*"citadel"
2020-12-21 10:54:24,199 : INFO : topic #15 (0.050): 0.004*"dune" + 0.004*"azerbaijan" + 0.003*"bt" + 0.003*"baku" + 0.002*"circumcision" + 0.001*"antony" + 0.001*"hop" + 0.001*"hip" + 0.001*"abduction" + 0.001*"superman"
2020-12-21 10:54:24,203 : INFO : topic diff=0.038916, rho=0.211591
2020-12-21 10:54:26,671 : INFO : -14.039 per-word bound, 16836.8 perplexity estimate based on a held-out corpus of 672 documents with 6132 words
2020-12-21 10:54:26,797 : INFO : topic #0 (0.050): 0.002*"comedy" + 0.002*"alfonso" + 0.002*"comic" + 0.002*"heat" + 0.002*"f

[(0,
  '0.002*"comedy" + 0.002*"alfonso" + 0.002*"comic" + 0.002*"heat" + 0.002*"films" + 0.002*"albums" + 0.002*"batman" + 0.002*"theatre" + 0.001*"broadway" + 0.001*"awards"'),
 (1,
  '0.006*"croatia" + 0.005*"bosnia" + 0.004*"belarus" + 0.004*"herzegovina" + 0.003*"croatian" + 0.003*"bulgaria" + 0.002*"serbia" + 0.002*"zagreb" + 0.002*"kraków" + 0.002*"yugoslavia"'),
 (2,
  '0.007*"theorem" + 0.006*"algebra" + 0.005*"algebraic" + 0.004*"vowel" + 0.004*"equation" + 0.004*"alphabet" + 0.004*"integers" + 0.004*"vector" + 0.004*"polynomial" + 0.004*"equations"'),
 (3,
  '0.003*"christ" + 0.003*"jesus" + 0.002*"bible" + 0.002*"pope" + 0.002*"churches" + 0.002*"jewish" + 0.002*"apollo" + 0.002*"jews" + 0.002*"jerusalem" + 0.002*"bishop"'),
 (4,
  '0.017*"actor" + 0.016*"singer" + 0.016*"politician" + 0.014*"actress" + 0.013*"footballer" + 0.012*"songwriter" + 0.008*"producer" + 0.007*"screenwriter" + 0.006*"journalist" + 0.006*"coach"'),
 (5,
  '0.002*"chromosomes" + 0.002*"delaware" + 0.

## Tweety

W parach stwórzcie model LDA w oparciu o załączone Tweety. W tym celu należy przekonwertować pliki tekstowe na reprezentację wektorową zapisując wcześniej mapowanie id->słowo w postaci słownika. Opis jak stworzyć wymienione struktury można znaleźć na stronie: https://radimrehurek.com/gensim/tut1.html.

**Zad. 1: Wczytaj tweety z pliku tweets.tsv do zmiennej `tweets`.**

In [15]:
import gensim
import logging
import nltk
import re
import pandas as pd

stopwords = ["a", "about", "after", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been",
            "before", "being", "between", "both", "by", "could", "did", "do", "does", "doing", "during", "each",
            "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here",
            "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've",
            "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "of",
            "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "own", "shan't", "she", "she'd",
            "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs",
            "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're",
            "they've", "this", "those", "through", "to", "until", "up", "very", "was", "wasn't", "we", "we'd",
            "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's",
            "which", "while", "who", "who's", "whom", "with", "would", "you", "you'd", "you'll", "you're", "you've",
            "your", "yours", "yourself", "yourselves", "above", "again", "against", "aren't", "below", "but", "can't",
            "cannot", "couldn't", "didn't", "doesn't", "don't", "down", "few", "hadn't", "hasn't", "haven't", "if",
            "isn't", "mustn't", "no", "nor", "not", "off", "out", "over", "shouldn't", "same", "too", "under", "why",
            "why's", "won't", "wouldn't"]

RE_SPACES = re.compile("\s+")
RE_HASHTAG = re.compile("[@#][_a-z0-9]+")
RE_EMOTICONS = re.compile("(:-?\))|(:p)|(:d+)|(:-?\()|(:/)|(;-?\))|(<3)|(=\))|(\)-?:)|(:'\()|(8\))")
RE_HTTP = re.compile("http(s)?://[/\.a-z0-9]+")

class Tokenizer():
    @staticmethod
    def tokenize(text):
        pass


class SimpleTokenizer(Tokenizer):
    @staticmethod
    def tokenize(text):
        return RE_SPACES.split(text.strip())


class NltkTokenizer(Tokenizer):
    @staticmethod
    def tokenize(text):
        return nltk.word_tokenize(text)


class TweetTokenizer(Tokenizer):
    @staticmethod
    def tokenize(text):
        tokens = SimpleTokenizer.tokenize(text)
        i = 0
        while i < len(tokens):
            token = tokens[i]
            match = None
            for regexpr in [RE_HTTP, RE_HASHTAG, RE_EMOTICONS]:
                match = regexpr.search(token)
                if match is not None:
                    break
            if match is not None:
                idx_start, idx_end = match.start(), match.end()
                if idx_start != 0 or idx_end != len(token):
                    if idx_start != 0:
                        tokens[i] = token[:idx_start]
                        tokens.insert(i + 1, token[idx_start:])
                    else:
                        tokens[i] = token[:idx_end]
                        tokens.insert(i + 1, token[idx_end:])
                    i -= 1
            else:
                del tokens[i]
                tokens[i:i] = NltkTokenizer.tokenize(token)
            i += 1

        porter = nltk.PorterStemmer()
        for i in range(tokens.__len__()):
            tokens[i] = porter.stem(tokens[i])

        final_tokens = []
        for i in range(tokens.__len__()):
            if not tokens[i] in [".", ",", "-", ";", "!", "?", ":", "...", "'", '"', '(', ')']:
                final_tokens.append(tokens[i])
        return final_tokens


class BeforeTokenizationNormalizer():
    @staticmethod
    def normalize(text):
        text = text.strip().lower()
        text = text.replace('&nbsp;', ' ')
        text = text.replace('&lt;', '<')
        text = text.replace('&gt;', '>')
        text = text.replace('&amp;', '&')
        text = text.replace('&pound;', u'£')
        text = text.replace('&euro;', u'€')
        text = text.replace('&copy;', u'©')
        text = text.replace('&reg;', u'®')
        return text

tweets = pd.read_csv("tweets.tsv", sep="\t", header=None).iloc[:, 2]

**Zad. 2: Dokonaj tokenizacji słów, usuń te z stoplisty (`stopwords`) oraz występujące tylko raz. Wynik przypisz do zmiennej `texts`.**

In [None]:
texts = [[word for word in TweetTokenizer.tokenize(BeforeTokenizationNormalizer.normalize(tweet)) if word not in stopwords]
             for tweet in tweets]

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

**Zad. 3: Stwórz słownik id->słowo i przypisz do zmiennej `id2word`.**

In [5]:
id2word = gensim.corpora.Dictionary(texts)

**Zad. 4: Stwórz reprezentację wektorową korpusu (macierz wektorów TF-IDF), wynik przypisz do zmiennej `mm`.**

In [6]:
mm = [id2word.doc2bow(text) for text in texts]

**Zad. 5: Odpal poniższy kod i odkryj 10 tematów za pomocą algorytmu LDA. Jeśli masz czas, zwiększ wartości parametrów num_topics i passes przy tworzeniu modelu LDA, i sprawdź jak to wpłynie na rezultat.**

In [7]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10, update_every=0, passes=20)
# alternatywnie lda = gensim.models.LdaMulticore(corpus=mm, id2word=id2word, num_topics=10, passes=20)
lda.print_topics(10)

  def _ipython_display_formatter_default(self):
  def _formatters_default(self):
  def _deferred_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _deferred_printers_default(self):


[(0,
  '0.023*"may" + 0.019*"\'s" + 0.013*"just" + 0.011*"will" + 0.011*"feder" + 0.011*"1st" + 0.011*"bush" + 0.011*"ha" + 0.010*"jeb" + 0.008*"trump"'),
 (1,
  '0.016*"may" + 0.015*"wa" + 0.013*"just" + 0.012*"bob" + 0.012*"marley" + 0.012*"mcgregor" + 0.012*"conor" + 0.011*"na" + 0.010*"n\'t" + 0.010*"game"'),
 (2,
  '0.024*"n\'t" + 0.020*"biden" + 0.018*"joe" + 0.016*"mac" + 0.015*"fleetwood" + 0.012*"run" + 0.012*"carey" + 0.012*"ca" + 0.012*"mariah" + 0.011*"tomorrow"'),
 (3,
  '0.093*"avail" + 0.046*"amazon" + 0.045*"day" + 0.034*"prime" + 0.023*"friday" + 0.016*"magic" + 0.016*"labor" + 0.016*"mike" + 0.016*"\'s" + 0.015*"xxl"'),
 (4,
  '0.032*"appl" + 0.023*"iphon" + 0.018*"\'s" + 0.017*"new" + 0.016*"will" + 0.015*"watch" + 0.012*"event" + 0.010*"wa" + 0.010*"septemb" + 0.010*"today"'),
 (5,
  '0.023*"jurass" + 0.020*"go" + 0.019*"world" + 0.017*"tomorrow" + 0.014*"park" + 0.013*"time" + 0.013*"may" + 0.012*"thi" + 0.012*"see" + 0.011*"david"'),
 (6,
  '0.024*"may" + 0.023*"\

**Zad. 6*: Na podstawie zbudowanego modelu określ proporcje tematów w następującym tweecie:
`Zlatan is looking mighty attractive at the moment,if LVG doesn't get a striker by Tuesday, I really don't fancy us scoring goals this season`. Jeśli zostało niewiele czasu, przejdź od razu do wizualizacji.**

In [11]:
new_tweet = "Zlatan is looking mighty attractive at the moment,if LVG doesn't get a striker by Tuesday, I really don't fancy us scoring goals this season"
new_tweet_text = [word for word in TweetTokenizer.tokenize(BeforeTokenizationNormalizer.normalize(new_tweet)) if word not in stopwords]
new_tweet_mm = id2word.doc2bow(new_tweet_text)
new_tweet_lda = lda[new_tweet_mm]
print(new_tweet_lda)

[(0, 0.50570196137581358), (2, 0.15915990644841352), (9, 0.29395047003629038)]


## Wizualizacja

Spróbujemy teraz zwizualizować uzyskane tematy. Podążaj za komentarzami a powinno się udać.

P.S. Możesz uruchomić tę wizualizacje również dla tematów odkrytych z Wikipedii...

**Zad. 7: Zwizualizuj uzyskane tematy.**

In [2]:
# Aby uruchomić poniższy kod zainstaluj moduł pyLDAvis z githuba:
# https://github.com/bmabey/pyLDAvis
# w razie problemów, zawołać prowadzącego

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda, mm, id2word)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
