# Programming Metodologies for Data Analysis

## Authors
- Lorenzo Dell'Oro
- Giovanni Toto
- Gian Luca Vriz

## 1. Introduction

<font color='red'>KEY IDEA AND OBJECTIVE OF THE PROJECT</font>

<font color='blue'>LIBRARIES</font>

In [1]:
%load_ext autoreload
%autoreload 2

import json
import pandas as pd
import re
import string

## 2. Guardian dataset

<font color='red'>INTRODUCTION TO DATA</font>

We need an API key in order to download *Guardian* articles using the [official API](https://open-platform.theguardian.com/): in the following block, you should replace `'test'` with your API key; it can be easily obtained [here](https://open-platform.theguardian.com/access/).

In [2]:
api_key = 'test'

### 2.i. Download using API

Meaning of the query

In [3]:
"""
query = {'type': 'article',
         'q': '"climate change"',
         'section': 'environment',
         'from-date': "2013-01-01",
         'to-date': "2022-12-31",
         'lang': 'en',
         'order-by': 'oldest',
         'page-size': 200,
         'show-fields': 'all',
         'api-key': api_key}

from src.file_io import download_guardian
corpus = download_guardian(query, 'data/raw/guardian_environment.csv')
"""
corpus = pd.read_csv('data/raw/guardian_environment.csv')

### 2.ii Selection of the variable of interest

`corpus` is a `pandas` dataset containing several pieces of information for each document, most of which are not relevant to our analysis; hence, we decided to keep only the following fields:
- document-related:
    - `id`: identifier
    - `year`: year of publication on the website (`webPublicationDate`)
    - `where`: original place on which the article was published (`fields_publication`)
    - `author`: author (`fields_byline`)
- text-related:
    - `headline`: title (`fields_headline`)
    - `standfirst`: summary (`fields_standfirst`)
    - `body`: text with tags (`fields_body`)
    - `bodyText`: text without tags (`fields_bodyText`)
- count-related:
    - `wordcount`: number of words in the body (`fields_wordcount`)
    - `charcount`: number of characters in the body without tags (`fields_charcount`)
    
We reported in parentheses the original name of the field or, in the case of `year`, the name of the field from which `year` was obtained.

In [4]:
col_of_interest = ['id', 'webPublicationDate', 'fields_publication', 'fields_byline',
                   'fields_headline', 'fields_standfirst', 'fields_body', 'fields_bodyText',
                   'fields_wordcount', 'fields_charCount']
# select columns of interest
corpus = corpus.loc[:, col_of_interest]
# convert 'webPublicationDate' field to year
corpus['webPublicationDate'] = pd.to_datetime(corpus['webPublicationDate']).dt.year
# rename columns
new_colnames = {'webPublicationDate': 'year', 'fields_publication': 'where', 'fields_byline': 'author',
                'fields_headline': 'headline', 'fields_standfirst': 'standfirst', 'fields_body': 'body', 'fields_bodyText':'bodyText',
                'fields_wordcount': 'wordcount', 'fields_charCount': 'charcount'}
corpus.rename(columns=new_colnames, inplace=True)

Then, we remove the articles which do not have a body, i.e. the value of `bodyText` is `nan`.

In [5]:
corpus = corpus[corpus['bodyText'].isna() == False]

Finally, if an article does not have the headline and/or standfirst, we replace the `nan`s with empty strings.

In [6]:
corpus['headline'][corpus['headline'].isna() == True] = ''
corpus['standfirst'][corpus['standfirst'].isna() == True] = ''

### 2.ii. Pre-processing of the texts

<font color='blue'>PRE-PROCESSING OF TEXTS</font>

In this section we consider the texts of the article downloaded in the previous section and we process them in order to make them compatible with *ETM* and *DETM*. In particular, we want to obtain two list of strings:
- `timestamps` containing the timestamps of the articles, i.e. the years in which they were published;
- `docs` containing the processed texts of the articles, i.e. the headline, standfirst and body.


First, we obtain `timestamps` variable from year field:

In [7]:
timestamps = [str(y) for y in corpus['year'].tolist()]

Then, we process the texts in order to obtain an input compatible with text analysis approaches:

In [8]:
# regex for urls; source: https://stackoverflow.com/a/50790119
url_regex = r"\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b"

docs = corpus['headline'] + " " + corpus['standfirst'] + " " + corpus['bodyText']
docs = docs.tolist()
docs = [re.sub(url_regex, '', docs[doc].lower()) for doc in range(len(docs))]
docs = [[w.translate(str.maketrans('', '', string.punctuation + "0123456789")) for w in docs[doc].split()] for doc in range(len(docs))]
docs = [[w for w in docs[doc] if len(w)>1] for doc in range(len(docs))]
docs = [" ".join(docs[doc]) for doc in range(len(docs))]

Finally, we use `preprocessing` function contained in `src/preprocessing.py` module, which creates files compatible with *ETM* and *DETM*. Before launching the function, we need to import stopwords:

In [9]:
# Read stopwords
with open("./data/stops.txt", "r") as f:
    stopwords = f.read().split('\n')
# Pre-processing
from src.preprocessing import preprocessing
preprocessing(data_path="data/guardian_environment", docs=docs, timestamps=timestamps, stopwords=stopwords,
              min_df=10, max_df=0.7, data_split=[0.7, 0.2, 0.1], seed=28)

***************
Preparing data:


ValueError: "path_save" not valid: the folder already exists.

The last function also divides the corpus into train, test and validation set: we are going to consider the training set only for the exploratory analyses in the next section.

## 3. Exploratory analysis

<font color='blue'>
EXPLORATORY ANALYSIS OF THE <b>TRAIN</b> CORPUS:

- tabella con info corpus (num documenti, num timestamps, documenti per timestamp, ...)
- distribuzione lunghezza documenti (numero parole)
- parole più frequenti (word cloud)
</font>

Before moving on, we remove all variables that are no longer used:

In [10]:
del api_key, col_of_interest, docs, new_colnames, timestamps, url_regex

As said before, here we consider train set only: we do not need test and validation set from now on, so we replace the whole corpus, stored in `corpus` variable, with the train set. We also store the vocabulary of the train set in `vocab` variable.

In [11]:
with open('data/guardian_environment/info.json', 'r') as f:
    info = json.load(f)
    corpus = corpus.iloc[sorted(info['indices_tr']), :]
    docs_bow = info['docs_tr']
    vocab = info['vocab_tr']
    del info

**EXPLORATORY ANALYSIS ON `corpus` variable** 

[Code wordcloud](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#from-strings-to-vectors)

## 4. Word embeddings

<font color='red'>BRIEF DESCRIPTION OF THE DIFFERENT APPROACHES</font><br>
<font color='blue'>EMBEDDING FITTING</font><br>

Static word embeddings:
- *GloVe* <br> (Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.)
- *Word2vec* (*CBOW* & *Skip Gram*) <br> (Mikolov, Tomas & Chen, Kai & Corrado, G.s & Dean, Jeffrey. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR. 2013.) <br> (Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13). Curran Associates Inc., Red Hook, NY, USA, 3111–3119.)
- *Sent2Vec* <br> (Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In NAACL-HLT.)
- *fastText* <br>

Static word embeddings obtained from dynamic ones:
- Prakhar Gupta and Martin Jaggi. 2021. Obtaining Better Static Word Embeddings Using Contextual Embedding Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5241–5253, Online. Association for Computational Linguistics. ([GitHub](https://github.com/epfml/X2Static))

We want the use the same embedding space for both *ETM* and *DETM*, so we first fit the word embeddings and then we provide them as input to the topic models. We import here the required functions:

In [None]:
from gensim.models import FastText, KeyedVectors, Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec
from src.file_io import save_embeddings

embedding_path, embedding_file = 'embeddings/', 'guardian_environment'
embedding_path = embedding_path + embedding_file

### 4.i. Word2vec on train corpus

We fit CBOW and skip gram on train corpus:

In [None]:
# sg=0: CBOW
cbow = Word2Vec(sentences=docs_bow, min_count=100, sg=0, size=100, iter=5, workers=5, negative=10, window=4)
save_embeddings(emb_model=cbow, emb_file=embedding_path+'_cbow.txt', vocab=vocab)
# sg=1: skip-gram
skipgram = Word2Vec(sentences=docs_bow, min_count=100, sg=1, size=100, iter=5, workers=5, negative=10, window=4)
save_embeddings(emb_model=skipgram, emb_file=embedding_path+'_skipgram.txt', vocab=vocab)

### 4.ii. FastText on train corpus

In [None]:
fasttext = FastText(min_count=100, vector_size=300, iter=5, negative=10, window=4)
fasttext.build_vocab(corpus_file=docs_bow)
fasttext.train(corpus_file=docs_bow, total_examples=fasttext.corpus_count, total_words=fasttext.corpus_total_words)
save_embeddings(emb_model=fasttext, emb_file=embedding_path+'_fasttext.txt', vocab=vocab)

### 4.iii. Google’s Word2vec

First, we download the pre-trained word embedding from [here](https://code.google.com/archive/p/word2vec/).

In [None]:
google_word2vec = KeyedVectors.load_word2vec_format('embeddings/GoogleNews-vectors-negative300.bin', binary=True)
save_embeddings(emb_model=google_word2vec, emb_file=embedding_path+'_google_word2vec.txt', vocab=vocab)

### 4.iv. Stanford’s GloVe

First, we download the pre-trained word embedding from [here](https://nlp.stanford.edu/projects/glove/).

In [None]:
# create word2vec file
glove_input_file = 'embeddings/glove.6B.300d.txt'
word2vec_output_file = 'embeddings/glove.6B.300d.bin'
glove2word2vec(glove_input_file, word2vec_output_file)
# load embeddings
glove = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
save_embeddings(emb_model=glove, emb_file=embedding_path+'_glove.txt', vocab=vocab)

### 4.v. Obtaining Better Static Word Embeddings Using Contextual Embedding Models.

## 5 Topic models

<font color='red'>INTRODUCTION TO MODEL ESTIMATION</font>

In [None]:
topics_list = [10, 20, 30, 40, 50]
embedding_list = ['cbow', 'skipgram', 'fasttext', 'google_word2vec', 'glove']

### 5.i. Embedded Topic Model (ETM)

<font color='red'>BRIEF DESCRIPTION OF THE MODEL</font><br>
<font color='blue'>MODEL ESTIMATION</font><br>

The following block allows to train *ETM*:

In [None]:
from src.main_ETM import main_ETM
main_ETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='results',
         emb_file='data/un-general-debates_embeddings.txt', model_file='ETM_K50_un-general-debates', batch_size=1000,
         mode='train', num_topics=50, train_embeddings=0, epochs=100, visualize_every=1000, tc=False, td=False)

The following block allows to evaluate *ETM*, i.e.,
- compute *topic coherence* on the top 10 words of each topic;
- compute *topic diversity* on the top 25 words of each topic,
- compute the ranking of the most used topics in the train corpus;
- compute the top `num_words` words per topic.

### 5.ii. Dynamic Embedded Topic Model (DETM)

<font color='red'>BRIEF DESCRIPTION OF THE MODEL</font><br>
<font color='blue'>MODEL ESTIMATION</font><br>

**Memory problems:** DETM rquires too much memory!

In [None]:
from src.main_DETM import main_DETM
main_DETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='data',
          emb_path='data/un-general-debates_embeddings.txt', mode='train', batch_size=1000,
          num_topics=50, train_embeddings=0, epochs=50, visualize_every=1000, tc=True)

## 6. Model comparison

<font color='red'>INTRODUCTION ON HOW WE WANT TO COMPARE MODELS</font>

**The idea is introduced very well in "Topic modeling in embedding spaces": let's copy from there!**

The following block allows to evaluate the model trained in the previous section; in particular, it saves a file, called `<model_name>_parameters.pt` containing:
- `tc` contains the topic coherence
- `td` contains the topic diversity
- `rho` contains the word embeddings (row=embedding)
- `model.alphas.weight` contains the topic embeddings (row=embedding)
- `beta` contains the topic-word distributions (row=distribution)
- `theta` contains the document-topic distribution (row=distribution)

In [None]:
from src.main_ETM import main_ETM
main_ETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='results',
         emb_file='data/un-general-debates_embeddings.txt', model_file='ETM_K50_un-general-debates', mode='eval',
         load_from='results/ETM_K50_un-general-debates',
         num_topics=50, train_embeddings=0, epochs=100, visualize_every=1000, num_words=10, tc=True, td=True)

Now we import the file:

In [None]:
from torch import load as torch_load
loaded = torch_load("results/ETM_K50_un-general-debates_parameters.pt")
print(loaded.keys())

In [None]:
print("TC  :", loaded['tc'])
print("TD  :", loaded['td'])
print("VxE :", loaded['rho'].shape)
print("TxE :", loaded['alpha'].shape)
print("TxV :", loaded['beta'].shape)
print("DxT :", loaded['theta'].shape)

In [None]:
import numpy as np
x = np.array([[1, 2, 3], [2, 3, 1]])
display(x)
#print(np.argsort(x, axis=1))
print(np.argsort(-1 * x, axis=1))

### 6.i. Quantitative analysis

<font color='blue'>COMPUTATION OF VARIOUS METRICS AND CONSTRUCTION OF GRAPHS</font>

### 6.ii. Qualitative analysis

<font color='blue'>INTERPRETATION OF TOPICS AND DOCUMENT REPRESENTATION</font>

## 7. Conclusion

<font color='red'>FINAL REMARKS</font>

## References

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2019). The dynamic embedded topic model. arXiv preprint arXiv:1907.05545. [Arxiv link](https://arxiv.org/abs/1907.05545)

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439-453. [ACM Anthology](https://aclanthology.org/2020.tacl-1.29/),  [Arxiv link](https://arxiv.org/abs/1907.04907)