# Programming Metodologies for Data Analysis

## Authors
- Lorenzo Dell'Oro
- Giovanni Toto
- Gian Luca Vriz

## 1. Introduction

<font color='red'>KEY IDEA AND OBJECTIVE OF THE PROJECT</font>

<font color='blue'>LIBRARIES</font>

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import csv                                   # 2
import string                                # 2
from src.preprocessing import preprocessing  # 2

<font color='red'>INTRODUCTION TO DATA</font>

## 2. Pre-processing of the corpus

<font color='blue'>PRE-PROCESSING OF TEXTS</font>

First we need to import the dataset from a file (txt/csv/...); this is an example with [UN General Debates corpus](https://www.kaggle.com/datasets/unitednations/un-general-debates):

In [None]:
N_DOCS = 3000
min_df = 10

# Data type
flag_split_by_paragraph = False  # whether to split documents by paragraph
    
# Read raw data (https://www.kaggle.com/datasets/unitednations/un-general-debates)
print('reading raw data...')
with open('./data/raw/un-general-debates.csv', encoding='utf-8-sig') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
    line_count = 0
    all_timestamps_ini = []
    all_docs_ini = []
    for row in csv_reader:
        # skip header
        if(line_count>0):
            all_timestamps_ini.append(row[1])
            all_docs_ini.append(row[3].encode("ascii", "ignore").decode())
        line_count += 1
        if line_count==N_DOCS-1:  ###########
            break                 ###########

if flag_split_by_paragraph:
    print('splitting by paragraphs...')
    docs = []
    timestamps = []
    for dd, doc in enumerate(all_docs_ini):
        splitted_doc = doc.split('.\n')
        for ii in splitted_doc:
            docs.append(ii)
            timestamps.append(all_timestamps_ini[dd])
else:
    docs = all_docs_ini
    timestamps = all_timestamps_ini

del all_docs_ini
del all_timestamps_ini

print('  number of documents: {}'.format(len(docs)))

Introduction to pre-processing with gensim: [link](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#from-strings-to-vectors)

The real pre-processing starts here: in short, we want to get strings without strange characters.

In [None]:
# Remove punctuation
print('removing punctuation...')
docs = [[w.lower().replace("’", " ").replace("'", " ").translate(str.maketrans('', '', string.punctuation + "0123456789")) for w in docs[doc].split()] for doc in range(len(docs))]
docs = [[w for w in docs[doc] if len(w)>1] for doc in range(len(docs))]
docs = [" ".join(docs[doc]) for doc in range(len(docs))]

Finally, we use `preprocessing` function contained in `src/preprocessing.py` module, which creates files compatible with *ETM* and *DETM*. We will use these files also to perform explorative analysis. Before launching the function, we need to import stopwords:

In [None]:
# Read stopwords
with open("./data/stops.txt", "r") as f:
    stopwords = f.read().split('\n')
# Pre-processing
preprocessing(data_path="data/un-general-debates", docs=docs, timestamps=timestamps, stopwords=stopwords,
              min_df=min_df, max_df=0.7, data_split=[0.85, 0.1, 0.05], seed=28)

**remark1:** `docs` and `timestamps` are two lists of strings containing the same number of elements; in particular, `docs` contains the documents of the corpus, `timestamps` their timestamps.

**remark2:** The function also divides the corpus into train, test and validation set: below we will consider the training set for exploratory analyses.

## 3. Exploratory analysis of the processed corpus

<font color='blue'>EXPLORATORY ANALYSIS OF THE **TRAIN** CORPUS</font>

Import vocabulary of the train set:

In [None]:
from src.file_io import load_vocab
word2id, id2word = load_vocab("data/un-general-debates/vocab.txt")

Here we consider the train corpus only, i.e. we are interested in the following files generated by `preprocessing` function:
- `bow_tr_tokens`: index of the different words observed in the documents of the train set;
- `bow_tr_counts`: occurrences of the different words observed in the documents of the train set;
- `bow_tr_timestamps`: timestamps of the documents of the train set;
- `timestamps.txt`: different observed timestamps;
- `vocab.txt`: vocabulary of the train set.

**remark:** the first 3 files exist also for test and validation set, however they are not relevant to exploratory analysis.

In [None]:
import os
from scipy.io import loadmat
path = os.path.join('data', 'un-general-debates', 'min_df_'+str(min_df))
bow_tr_tokens = loadmat(os.path.join(path, 'bow_tr_tokens'))['tokens'].squeeze()
bow_tr_counts = loadmat(os.path.join(path, 'bow_tr_counts'))['counts'].squeeze()
bow_tr_timestamps = loadmat(os.path.join(path, 'bow_tr_timestamps'))['timestamps'].squeeze()

In [None]:
print(len(bow_tr_timestamps))
print(len(bow_tr_counts))
print(len(bow_tr_tokens[0]))

In [None]:
"""print(bow_tr_timestamps)
for i in range(8):
    print(i, "\t", bow_tr_tokens[i].shape, "\t", bow_tr_counts[i].shape)"""

## 4. Embeddings and topic models

<font color='red'>INTRODUCTION TO MODEL ESTIMATION</font>

In [None]:
topics = 5

### 4.i. Fitting embeddings

<font color='red'>BRIEF DESCRIPTION OF THE DIFFERENT APPROACHES</font><br>
<font color='blue'>EMBEDDING FITTING</font><br>

We want the use the same embedding space for both *ETM* and *DETM*, so we first fit the word embeddings and then we provide them as input. In particular, we fit a simple *skipgram*; this implementation is an adaptation of this [code](https://github.com/adjidieng/ETM/blob/master/skipgram.py):

In [None]:
docs_bow = [docs[doc].split() for doc in range(len(docs))]  # list of list of strings (BoW representation)

# fit embeddings
from gensim.models import Word2Vec
skipgram = Word2Vec(sentences=docs_bow, min_count=100, sg=1, size=100, iter=5, workers=5, negative=10, window=4)

from src.file_io import save_embeddings
save_embeddings(emb_model=skipgram, emb_file='data/un-general-debates_embeddings.txt', vocab=list(word2id.keys()))

### 4.ii. Embedded Topic Model (ETM)

<font color='red'>BRIEF DESCRIPTION OF THE MODEL</font><br>
<font color='blue'>MODEL ESTIMATION</font><br>

We have that:
- `rho` contains the word embeddings (row=embedding)
- `model.alphas.weight` contains the topic embeddings (row=embedding)
- `beta` contains the topic-word distributions (row=distribution)
- `theta` contains the document-topic distribution (row=distribution)

**TO DO:** modify `main_ETM` function to make it return these quantities (when `mode='eval'`).

The following block allows to train *ETM*:

In [None]:
from src.main_ETM import main_ETM
main_ETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='data',
         emb_path='data/un-general-debates_embeddings.txt', mode='train',
         num_topics=50, train_embeddings=0, epochs=100, visualize_every=1000, tc=False, td=False)

The following block allows to evaluate *ETM*, i.e.,
- compute *topic coherence* on the top 10 words of each topic;
- compute *topic diversity* on the top 25 words of each topic,
- compute the ranking of the most used topics in the train corpus;
- compute the top `num_words` words per topic.

In [1]:
from src.main_ETM import main_ETM
main_ETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='data',
         emb_path='data/un-general-debates_embeddings.txt', mode='eval',
         load_from='data/etm_un-general-debates_K_50_Htheta_800_Optim_adam_Clip_0.0_ThetaAct_relu_Lr_0.005_Bsz_1000_RhoSize_300_trainEmbeddings_0',
         num_topics=50, train_embeddings=0, epochs=100, visualize_every=1000, num_words=10, tc=True, td=True)



=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
Training an Embedded Topic Model on UN-GENERAL-DEBATES
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
ckpt: data/etm_un-general-debates_K_50_Htheta_800_Optim_adam_Clip_0.0_ThetaAct_relu_Lr_0.005_Bsz_1000_RhoSize_300_trainEmbeddings_0
13265  The Vocabulary size is here 
model: ETM(
  (t_drop): Dropout(p=0.0, inplace=False)
  (theta_act): ReLU()
  (alphas): Linear(in_features=300, out_features=50, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=13265, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_features=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_f

### 4.iii. Dynamic Embedded Topic Model (DETM)

<font color='red'>BRIEF DESCRIPTION OF THE MODEL</font><br>
<font color='blue'>MODEL ESTIMATION</font><br>

In [None]:
from src.main_DETM import main_DETM
main_DETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='data',
          emb_path='data/un-general-debates_embeddings.txt',
          num_topics=10, train_embeddings=0, epochs=50, visualize_every=1000, tc=True)

## 5. Model comparison

<font color='red'>INTRODUCTION ON HOW WE WANT TO COMPARE MODELS</font>

**The idea is introduced very well in "Topic modeling in embedding spaces": let's copy from there!**

### 5.i. Quantitative analysis

<font color='blue'>COMPUTATION OF VARIOUS METRICS AND CONSTRUCTION OF GRAPHS</font>

### 5.ii. Qualitative analysis

<font color='blue'>INTERPRETATION OF TOPICS AND DOCUMENT REPRESENTATION</font>

## 6. Conclusion

<font color='red'>FINAL REMARKS</font>

## References

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2019). The dynamic embedded topic model. arXiv preprint arXiv:1907.05545. [Arxiv link](https://arxiv.org/abs/1907.05545)

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439-453. [ACM Anthology](https://aclanthology.org/2020.tacl-1.29/),  [Arxiv link](https://arxiv.org/abs/1907.04907)