# Programming Metodologies for Data Analysis

## Authors
- Lorenzo Dell'Oro
- Giovanni Toto
- Gian Luca Vriz

## 1. Introduction

<font color='red'>KEY IDEA AND OBJECTIVE OF THE PROJECT</font>

<font color='blue'>LIBRARIES</font>

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import csv                                   # 2
import string                                # 2
from src.preprocessing import preprocessing  # 2
from src.file_io import load_vocab           # 3
import os                                    # 3
from scipy.io import loadmat                 # 3
from gensim.models import Word2Vec           # 4
from src.file_io import save_embeddings      # 4
from gensim.models import LdaModel           # 4.i
from collections import Counter              # 4.ii
from gensim.models import ldaseqmodel        # 4.ii

<font color='red'>INTRODUCTION TO DATA</font>

## 2. Pre-processing of the corpus

<font color='blue'>PRE-PROCESSING OF TEXTS</font>

First we need to import the dataset from a file (txt/csv/...); this is an example with [UN General Debates corpus](https://www.kaggle.com/datasets/unitednations/un-general-debates):

In [None]:
N_DOCS = 9999999
min_df = 100

# Data type
flag_split_by_paragraph = False  # whether to split documents by paragraph
    
# Read raw data (https://www.kaggle.com/datasets/unitednations/un-general-debates)
print('reading raw data...')
with open('./data/raw/un-general-debates.csv', encoding='utf-8-sig') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
    line_count = 0
    all_timestamps_ini = []
    all_docs_ini = []
    for row in csv_reader:
        # skip header
        if(line_count>0):
            all_timestamps_ini.append(row[1])
            all_docs_ini.append(row[3].encode("ascii", "ignore").decode())
        line_count += 1
        if line_count==N_DOCS-1:  ###########
            break                 ###########

if flag_split_by_paragraph:
    print('splitting by paragraphs...')
    docs = []
    timestamps = []
    for dd, doc in enumerate(all_docs_ini):
        splitted_doc = doc.split('.\n')
        for ii in splitted_doc:
            docs.append(ii)
            timestamps.append(all_timestamps_ini[dd])
else:
    docs = all_docs_ini
    timestamps = all_timestamps_ini

del all_docs_ini
del all_timestamps_ini

print('  number of documents: {}'.format(len(docs)))

Introduction to pre-processing with gensim: [link](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#from-strings-to-vectors)

The real pre-processing starts here: in short, we want to get strings without strange characters.

In [None]:
# Remove punctuation
print('removing punctuation...')
docs = [[w.replace('\ufeff', '').lower().replace("’", " ").replace("'", " ").translate(str.maketrans('', '', string.punctuation + "0123456789")) for w in docs[doc].split()] for doc in range(len(docs))]
docs = [[w for w in docs[doc] if len(w)>1] for doc in range(len(docs))]
docs = [" ".join(docs[doc]) for doc in range(len(docs))]

Finally, we use `preprocessing` function contained in `src/preprocessing.py` module, which creates files compatible with *ETM* and *DETM*. We will use these files also to perform explorative analysis and the estimation of *LDA* and *DTM*. Before launching the function, we need to import stopwords:

In [None]:
# Read stopwords
with open("./data/stops.txt", "r") as f:
    stopwords = f.read().split('\n')
# Pre-processing
preprocessing(data_path="data/un-general-debates", docs=docs, timestamps=timestamps, stopwords=stopwords,
              min_df=min_df, max_df=0.7, data_split=[0.85, 0.1, 0.05], seed=28)

The function also divides the corpus into train, test and validation set: below we will consider the training set for exploratory analyses.

## 3. Exploratory analysis of the processed corpus

<font color='blue'>EXPLORATORY ANALYSIS OF THE **TRAIN** CORPUS</font>

Import vocabulary of the train set:

In [None]:
word2id, id2word = load_vocab("data/un-general-debates/min_df_" + str(min_df) +"/vocab.txt")

Here we consider the train corpus only, i.e. we are interested in the following files generated by `preprocessing` function:
- `bow_tr_tokens`: index of the different words observed in the documents of the train set;
- `bow_tr_counts`: occurrences of the different words observed in the documents of the train set;
- `bow_tr_timestamps`: timestamps of the documents of the train set;
- `timestamps.txt`: different observed timestamps;
- `vocab.txt`: vocabulary of the train set.

**remark:** the first 3 files exist also for test and validation set, however they are not relevant to exploratory analysis.

In [None]:
path = os.path.join('data', 'un-general-debates', 'min_df_'+str(min_df))
bow_tr_tokens = loadmat(os.path.join(path, 'bow_tr_tokens'))['tokens'].squeeze()
bow_tr_counts = loadmat(os.path.join(path, 'bow_tr_counts'))['counts'].squeeze()
bow_tr_timestamps = loadmat(os.path.join(path, 'bow_tr_timestamps'))['timestamps'].squeeze()

In [None]:
print(len(bow_tr_timestamps))
print(len(bow_tr_counts))
print(len(bow_tr_tokens[0]))

In [None]:
"""print(bow_tr_timestamps)
for i in range(8):
    print(i, "\t", bow_tr_tokens[i].shape, "\t", bow_tr_counts[i].shape)"""

## 4. Estimation of the topic models

<font color='red'>INTRODUCTION TO MODEL ESTIMATION</font>

In [None]:
topics = 5

We want the use the same embedding space for both *ETM* and *DETM*, so we first fit the word embeddings and then we provide them as input. In particular, we fit a simple *skipgram*; this implementation is an adaptation of this [code](https://github.com/adjidieng/ETM/blob/master/skipgram.py):

In [None]:
docs_bow = [docs[doc].split() for doc in range(len(docs))]
# fit embeddings
skipgram = Word2Vec(sentences=docs_bow, min_count=100, sg=1, size=100, iter=5, workers=5, negative=10, window=4)

In [None]:
save_embeddings(emb_model=skipgram, emb_file='data/un-general-debates_embeddings.txt', vocab=[])

### 4.i. Latent Dirichlet Allocation (LDA)

<font color='red'>BRIEF DESCRIPTION OF THE MODEL</font><br>
<font color='blue'>MODEL ESTIMATION</font><br>
*DO THE SAME FOR ALL TOPIC MODELS*

We report here a simple example: **instead of using `common_texts`, in the project we should use `docs` and `id2word`; the latter is calculated in `preprocessing` function**.

In [None]:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
id2word = dict(common_dictionary)

Estimate *LDA* with 5 topics. **\[check function arguments [here](https://radimrehurek.com/gensim/models/ldamodel.html)\]**

In [None]:
#lda = LdaModel(common_corpus, id2word=id2word, num_topics=topics, alpha='auto', eta='auto')

### 4.ii. Dynamic Topic Model (DTM)

Obtain `time_slice` argument of `ldaseqmodel.LdaSeqModel` function from `timestamps` variable.

In [None]:
"""
sorted_times = sorted(set(timestamps))
time_slice = Counter(timestamps)
time_slice = [time_slice[t] for t in sorted_times]
"""

# common_corpus example
time_slice = [4, 5]

Estimate the *DTM* with 5 topics. **\[check function arguments [here](https://radimrehurek.com/gensim/models/ldaseqmodel.html)\]**

In [None]:
#ldaseq = ldaseqmodel.LdaSeqModel(corpus=common_corpus, id2word=id2word, time_slice=time_slice, num_topics=topics)

### 4.iii. Embedded Topic Model (ETM)

In [3]:
from src.main_ETM import main_ETM
main_ETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='data',
         emb_path='data/un-general-debates_embeddings.txt', min_df=100,
         num_topics=5, train_embeddings=0, epochs=5, visualize_every=10, tc=True, td=True)



=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
Training an Embedded Topic Model on UN-GENERAL-DEBATES
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
3007  The Vocabulary size is here 
model: ETM(
  (t_drop): Dropout(p=0.0, inplace=False)
  (theta_act): ReLU()
  (alphas): Linear(in_features=300, out_features=5, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=3007, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_features=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=5, bias=True)
  (logsigma_q_theta): Linear(in_features=800, out_features=5, bias=True)
)


Visualizing model quality before tra

### 4.iv. Dynamic Embedded Topic Model (DETM)

In [5]:
from src.main_DETM import main_DETM
main_DETM(dataset='un-general-debates', data_path='data/un-general-debates', save_path='data',
          emb_path='data/un-general-debates_embeddings.txt', min_df=100,
          num_topics=5, train_embeddings=0, epochs=5, visualize_every=5, tc=False)

Getting vocabulary ...
Getting training data ...
idx: 0/2
Getting validation data ...
idx: 0/1
Getting testing data ...
idx: 0/1
idx: 0/1
idx: 0/1
Getting embeddings ...


=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
Training a Dynamic Embedded Topic Model on UN-GENERAL-DEBATES
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

DETM architecture: DETM(
  (t_drop): Dropout(p=0.0, inplace=False)
  (theta_act): ReLU()
  (q_theta): Sequential(
    (0): Linear(in_features=3012, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_features=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=5, bias=True)
  (logsigma_q_theta):

FileNotFoundError: [Errno 2] No such file or directory: 'data\\detm_un-general-debates_K_5_Htheta_800_Optim_adam_Clip_0.0_ThetaAct_relu_Lr_0.005_Bsz_1000_RhoSize_300_L_3_minDF_100_trainEmbeddings_0'

## 5. Model comparison

<font color='red'>INTRODUCTION ON HOW WE WANT TO COMPARE MODELS</font>

**The idea is introduced very well in "Topic modeling in embedding spaces": let's copy from there!**

### 5.i. Quantitative analysis

<font color='blue'>COMPUTATION OF VARIOUS METRICS AND CONSTRUCTION OF GRAPHS</font>

### 5.ii. Qualitative analysis

<font color='blue'>INTERPRETATION OF TOPICS AND DOCUMENT REPRESENTATION (LDA vs ETM, DTM vs DETM)</font>

## 6. Conclusion

<font color='red'>FINAL REMARKS</font>

## References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. [ACM Digital Library](https://dl.acm.org/doi/10.5555/944919.944937)

Blei, D. M., & Lafferty, J. D. (2006, June). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113-120). [ACM Digital Library](https://doi.org/10.1145/1143844.1143859)

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2019). The dynamic embedded topic model. arXiv preprint arXiv:1907.05545. [Arxiv link](https://arxiv.org/abs/1907.05545)

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439-453. [ACM Anthology](https://aclanthology.org/2020.tacl-1.29/),  [Arxiv link](https://arxiv.org/abs/1907.04907)