# Esercizio 2.2: Topic modelling

Il seguente esercizio consisterà nell'estrarre 3 topic da una lista di 30 documenti estratti da Wikipedia usando Sketch Engine

In particolare avremo i seguenti topic che cercheremo di estrarre:
![image.png](attachment:image.png)

Dove per ogni topic abbiamo usato le seguenti pagine di Wikipedia:
<table><tr>
<td> <img src="attachment:image-2.png" alt="Drawing" /> </td>
<td> <img src="attachment:image-3.png" alt="Drawing" /> </td>
<td> <img src="attachment:image-4.png" alt="Drawing" /> </td>
</tr></table>

Come si può notare dalla prima immagine c'è una differenza importante nel numero di parole tra i diversi topic quindi ci si aspetta che quello legato alle autovetture risulti più difficilmente individuabile dal programma

In [1]:
from nltk.corpus import stopwords
import os
import re
from pprint import pprint
# Carichiamo le stop words
stop_words = set(stopwords.words('english'))

def load_dataset(file):
    str = open(os.getcwd()+f'\\{file}.txt', 'r', encoding="utf8").read()
    list_document = []
    #extract all document
    docs = re.findall(r'<doc[^>]*(.*?)<\/doc>', str, re.DOTALL)
    #extract all sentence for document
    for doc in docs:
        list_document.append(''.join(re.findall(r'<p>(.*)</p>', doc)))
    return list_document

# list of document, each document is a string
corpus = load_dataset('corpus_2_2_big')

In [2]:
from gensim.models import CoherenceModel
import gensim

# Build the bigram and trigram models
#bigram = gensim.models.Phrases(corpus_sent, min_count=5, threshold=100)
#trigram = gensim.models.Phrases(bigram[corpus_sent], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
#bigram_mod = gensim.models.phrases.Phraser(bigram)
#trigram_mod = gensim.models.phrases.Phraser(trigram)

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def make_bigrams2(text):
    bigram = gensim.models.Phrases(corpus_sent, min_count=5)
    for idx in range(len(text)):
        for token in bigram[text[idx]]:
            if '_' in token:
                # Token is a bigram, add to document.
                text[idx].append(token)
    return text

## Preprocessing

Come preprocessing si inizia con una regural expression per eliminare le nuove linee.

Segue una fase in cui filtriamo le stopwords, importate da nltk, e i simboli di punteggiatura.

In fine si è deciso di utilizzare la libreria di Spacy per effettuare un filtraggio sulle non-content words e allo stesso tempo 
estarne soltanto i Lemmi.

In [3]:
import spacy
from gensim.utils import simple_preprocess

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
       
def remove_stopwords_and_punct(texts):
    return [[word for word in simple_preprocess(str(doc),  deacc=True) if word not in stop_words] for doc in texts]

# load spacy corpus without parser and ner
nlp = spacy.load('en_core_web_lg', disable=['ner'])

# removing new line
corpus_sent = [re.sub(r'\s+', ' ', sent) for sent in corpus] 

# remove punctuations and stop words
corpus_sent = remove_stopwords_and_punct(corpus_sent)

corpus_sent = make_bigrams2(corpus_sent)
#corpus_sent = make_trigrams(corpus_sent)

# lemmatization
corpus_sent = lemmatization(corpus_sent)

## LDA

A questo punto possiamo iniziare ad applicare la libreria `gensim` per l'estrazione dei topic.

Iniziamo definendo il Dictionary passando il nostro testo processato, costruiamo la lista delle frequenze e passiamo il tutto in pasto all'algoritmo LDA 
con la chiamata ```LdaModel(dictionary, term frequency, numero topic)```

In [11]:
from gensim.corpora.dictionary import Dictionary
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel
from pprint import pprint
# Creation Dictionary
common_dictionary  = corpora.Dictionary(corpus_sent)  
#common_dictionary.filter_extremes(no_below=10, no_above=0.5)

# Term Document Frequency 
common_corpus  = [common_dictionary.doc2bow(text) for text in corpus_sent]  

# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=3)

pprint(lda.print_topics())

[(0,
  '0.009*"first" + 0.009*"country" + 0.008*"year" + 0.007*"goal" + '
  '0.007*"music" + 0.006*"car" + 0.006*"contest" + 0.006*"team" + 0.006*"time" '
  '+ 0.005*"season"'),
 (1,
  '0.011*"music" + 0.010*"goal" + 0.009*"first" + 0.008*"player" + '
  '0.008*"club" + 0.007*"football" + 0.007*"season" + 0.007*"play" + '
  '0.006*"year" + 0.006*"score"'),
 (2,
  '0.013*"club" + 0.010*"player" + 0.009*"football" + 0.007*"season" + '
  '0.007*"goal" + 0.006*"music" + 0.006*"play" + 0.006*"car" + 0.006*"time" + '
  '0.006*"team"')]


In [12]:
#!{sys.executable} -m pip install pyLDAvis
import matplotlib.pyplot as plt
import pyLDAvis
import pyLDAvis.gensim_models 
import matplotlib.pyplot as plt
pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim_models.prepare(lda, common_corpus, common_dictionary)
vis

In [None]:
from LDAExplore_master.processdata.lda import LDAVisualModel
from LDAExplore_master.processdata import fileops

word_corpus = fileops.read_file('20_newsgroups/alt.atheism/53350')
lda = LDAVisualModel([word_corpus])
lda.create_word_corpus([word_corpus])
lda.train_lda(3)
topics = lda.get_lda_corpus()

print (topics)

In [None]:
import sys
!{sys.executable} -m pip uninstall pyqtwebengine  install pyqtwebengine==5.12