# Laboratorio 5

Si richiede un'implementazione di un esercizio di Topic Modeling, utilizzando librerie open (come ad es. GenSim (https://radimrehurek.com/gensim/). Si richiede l'utilizzo di un corpus di almeno 1k documenti. Testare un algoritmo (ad esempio LDA) con più valori di k (num. di topics) e valutare la coerenza dei risultati, attraverso fine-tuning su parametri e pre-processing. Update: essendo che spesso i topic, per essere interpretabili, devono contenere content words, potete pensare di filtrare solamente i sostantivi in fase di preprocessing (cioè POS=noun).

In [2]:
import pandas as pd
import numpy as np
import os
import re
import nltk
from nltk import MWETokenizer, WordNetLemmatizer
from nltk.corpus import wordnet as wn, stopwords
import gensim
from gensim import corpora

## Preprocessing delle frasi

In [12]:
stop_words = set(stopwords.words('english')) 
mwes = [x for x in wn.all_lemma_names() if '_' in x]
mwes = [tuple(x.split('_')) for x in mwes]
#tokenizer = MWETokenizer(mwes, separator=' ')
lemmatizer = WordNetLemmatizer()

def preprocessing(text):
    text = re.sub(r'[^\w\s]',' ',text) # rimuovo la punteggiatura
    text = text.lower()
    text = nltk.pos_tag(text.split()) # prendo i pos tag delle parole (fa anche il tokenizing)
    text = [x for x in text if x[1] in ['NN','NNS','NNP','NNPS']] # mantengo solo i noun
    text = [x[0] for x in text] # rimuovo i pos tag
    text = [lemmatizer.lemmatize(x) for x in text]
    text = [x for x in text if x not in stop_words]
    return text

text = "The plain green Norway spruce is displayed in the gallery's foyer. Its light bulb adornments are dimmed, ordinary domestic ones joined together with string. The plates decorating the branches will be auctioned off for the children's charity ArtWorks. Wentworth worked as an assistant to sculptor Henry Moore in the late 1960s. His reputation as a sculptor grew in the 1980s, while he has been one of the most influential teachers during the last two decades. Wentworth is also known for his photography of mundane, everyday subjects such as a cigarette packet jammed under the wonky leg of a table."
text = preprocessing(text)
print(text)

['plain', 'spruce', 'gallery', 'bulb', 'adornment', 'one', 'plate', 'branch', 'child', 'charity', 'artwork', 'assistant', 'henry', 'moore', '1960s', 'reputation', 'sculptor', 'teacher', 'decade', 'wentworth', 'photography', 'mundane', 'subject', 'cigarette', 'packet', 'leg', 'table']


## Prelevo i documenti

In [13]:
paths = [
    "documents\\bbc\\entertainment", 
    "documents\\bbcsport\\athletics", 
    "documents\\bbcsport\\cricket", 
    "documents\\bbcsport\\football", 
    "documents\\bbcsport\\rugby",
    "documents\\bbcsport\\tennis"
]

documents = []

for path in paths:
    for file_name in os.listdir(path):
        if os.path.isfile(os.path.join(path, file_name)):
            file = open(path + "/" + file_name, "r", encoding="utf-8")
            document = preprocessing(file.read())
            documents.append(document)

### Divido in training set e test set

In [14]:
# crea un array test_documents con il 10% dei documenti e rimuove i documenti di test da documents
test_documents = []
training_documents = documents.copy()

for i in range(0, int(len(documents) * 0.1)):
    random_index = np.random.randint(0, len(training_documents)-1)
    test_documents.append(training_documents[random_index])
    training_documents.pop(random_index)

print("Numero di documenti di training: " + str(len(training_documents)))
print("Numero di documenti di test: " + str(len(test_documents)))
print("Numero di documenti totali: " + str(len(documents)))

Numero di documenti di training: 779
Numero di documenti di test: 86
Numero di documenti totali: 865


## LDA

### Training

In [15]:
# Creo il dizionario
dictionary = corpora.Dictionary(training_documents)
dictionary.filter_extremes(no_below=5, no_above=0.3, keep_n=None)  # use Dictionary to remove un-relevant tokens

# Creo la rappresentazione del corpus
corpus = [dictionary.doc2bow(doc) for doc in training_documents]

# Definisco il modello LDA
k = 10  # Numero di topic da identificare
lda_model = gensim.models.LdaModel(corpus, num_topics=k, id2word=dictionary)

# Visualizzazione dei topic identificati
for topic_id, topic in lda_model.show_topics(formatted=True, num_topics=k, num_words=10):
    print(f"Topic {topic_id}: {topic}")

Topic 0: 0.013*"minute" + 0.009*"england" + 0.008*"victory" + 0.008*"nation" + 0.008*"goal" + 0.007*"club" + 0.007*"ireland" + 0.006*"way" + 0.006*"wale" + 0.006*"cup"
Topic 1: 0.013*"england" + 0.009*"week" + 0.009*"injury" + 0.008*"series" + 0.007*"cricket" + 0.007*"season" + 0.007*"coach" + 0.007*"number" + 0.006*"test" + 0.006*"tour"
Topic 2: 0.010*"film" + 0.009*"cup" + 0.006*"england" + 0.006*"sport" + 0.005*"jones" + 0.005*"decision" + 0.005*"chance" + 0.005*"football" + 0.005*"ball" + 0.005*"chelsea"
Topic 3: 0.012*"number" + 0.012*"film" + 0.011*"race" + 0.010*"award" + 0.008*"week" + 0.006*"cup" + 0.005*"goal" + 0.005*"place" + 0.005*"half" + 0.005*"home"
Topic 4: 0.014*"club" + 0.012*"cup" + 0.010*"chelsea" + 0.008*"football" + 0.008*"goal" + 0.007*"ball" + 0.006*"v" + 0.006*"manager" + 0.005*"season" + 0.005*"test"
Topic 5: 0.010*"champion" + 0.009*"england" + 0.009*"series" + 0.007*"test" + 0.007*"season" + 0.006*"wale" + 0.006*"club" + 0.006*"jones" + 0.005*"bbc" + 0.005*

### Decoding

In [24]:
# Inferenza dei topic per un nuovo documento
flattened_test_documents = [token for sublist in test_documents for token in sublist]
new_bow = dictionary.doc2bow(flattened_test_documents)
topic_distribution = lda_model.get_document_topics(new_bow)

print("Topic distribution for new document:")
for topic_id, topic_prob in topic_distribution:
    print(f"Topic {topic_id}: {topic_prob}")


Topic distribution for new document:
Topic 0: 0.3191267251968384
Topic 1: 0.2813403308391571
Topic 2: 0.05434315279126167
Topic 3: 0.05442822724580765
Topic 4: 0.05994976684451103
Topic 6: 0.09403547644615173
Topic 7: 0.010968337766826153
Topic 8: 0.08119918406009674
Topic 9: 0.040543898940086365
