# Programming Metodologies for Data Analysis

Lorenzo Dell'Oro<br>Giovanni Toto<br>Gian Luca Vriz

## 1. Introduction

<font color='red'>KEY IDEA AND OBJECTIVE OF THE PROJECT</font>

<font color='blue'>LIBRARIES</font>

In [None]:
import csv
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pickle
import random
from scipy import sparse
import itertools
from scipy.io import savemat, loadmat
import string
import os
# ------------------------------------------ # 1
from src.preprocessing import preprocessing  # 2
from gensim.models import LdaModel           # 4.i
from collections import Counter              # 4.ii
from gensim.models import ldaseqmodel        # 4.ii

<font color='red'>INTRODUCTION TO DATA</font>

## 2. Pre-processing of the corpus

<font color='blue'>PRE-PROCESSING OF TEXTS</font>

First we need to import the dataset from a file (txt/csv/...); this is an example with [UN General Debates corpus](https://www.kaggle.com/datasets/unitednations/un-general-debates):

In [None]:
# Read stopwords
with open("./data/stops.txt", "r") as f:
    stopwords = f.read().split('\n')

# Data type
flag_split_by_paragraph = False  # whether to split documents by paragraph

# Read stopwords
with open("./data/stops.txt", "r") as f:
    stops = f.read().split('\n')
    
# Read raw data (https://www.kaggle.com/datasets/unitednations/un-general-debates)
print('reading raw data...')
with open('./data/raw/un-general-debates.csv', encoding='utf-8-sig') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
    line_count = 0
    all_timestamps_ini = []
    all_docs_ini = []
    for row in csv_reader:
        # skip header
        if(line_count>0):
            all_timestamps_ini.append(row[1])
            all_docs_ini.append(row[3])
        line_count += 1
        if line_count==11:  ###########
            break           ###########

if flag_split_by_paragraph:
    print('splitting by paragraphs...')
    docs = []
    timestamps = []
    for dd, doc in enumerate(all_docs_ini):
        splitted_doc = doc.split('.\n')
        for ii in splitted_doc:
            docs.append(ii)
            timestamps.append(all_timestamps_ini[dd])
else:
    docs = all_docs_ini
    timestamps = all_timestamps_ini

del all_docs_ini
del all_timestamps_ini

timestamps[0] = '1900'
timestamps[1] = '1900'

print('  number of documents: D={}'.format(len(docs)))

Introduction to pre-processing with gensim: [link](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#from-strings-to-vectors)

The real pre-processing starts here: in short, we want to get strings without strange characters.

In [None]:
# Remove punctuation
print('removing punctuation...')
docs = [[w.replace('\ufeff', '').lower().replace("’", " ").replace("'", " ").translate(str.maketrans('', '', string.punctuation + "0123456789")) for w in docs[doc].split()] for doc in range(len(docs))]
docs = [[w for w in docs[doc] if len(w)>1] for doc in range(len(docs))]
docs = [" ".join(docs[doc]) for doc in range(len(docs))]

Finally, we use `preprocessing` function contained in `src/preprocessing.py` module, which creates files compatible with *ETM* and *DETM*. We will use these files also to perform explorative analysis and the estimation of *LDA* and *DTM*.

In [None]:
preprocessing(data_path="data/un-general-debates", docs=docs, timestamps=timestamps, stopwords=stopwords,
              min_df=1, max_df=0.7, data_split=[0.85, 0.1, 0.05], seed=28)

The function also divides the corpus into train, test and validation set: below we will consider the training set for exploratory analyses.

## 3. Exploratory analysis of the processed corpus

<font color='blue'>EXPLORATORY ANALYSIS OF THE CORPUS</font>

## 4. Estimation of the topic models

<font color='red'>INTRODUCTION TO MODEL ESTIMATION</font>

In [None]:
topics = 5

### 4.i. Latent Dirichlet Allocation (LDA)

<font color='red'>BRIEF DESCRIPTION OF THE MODEL</font><br>
<font color='blue'>MODEL ESTIMATION</font><br>
*DO THE SAME FOR ALL TOPIC MODELS*

We report here a simple example: **instead of using `common_texts`, in the project we should use `docs` and `id2word`; the latter is calculated in `preprocessing` function**.

In [None]:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel  # gia' segnato sopra

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
id2word = dict(common_dictionary)

Estimate *LDA* with 5 topics. **\[check function arguments [here](https://radimrehurek.com/gensim/models/ldamodel.html)\]**

In [None]:
lda = LdaModel(common_corpus, id2word=id2word, num_topics=topics, alpha='auto', eta='auto')

### 4.ii. Dynamic Topic Model (DTM)

Obtain `time_slice` argument of `ldaseqmodel.LdaSeqModel` function from `timestamps` variable.

In [None]:
"""
sorted_times = sorted(set(timestamps))
time_slice = Counter(timestamps)
time_slice = [time_slice[t] for t in sorted_times]
"""

# common_corpus example
time_slice = [4, 5]

Estimate the *DTM* with 5 topics. **\[check function arguments [here](https://radimrehurek.com/gensim/models/ldaseqmodel.html)\]**

In [None]:
ldaseq = ldaseqmodel.LdaSeqModel(corpus=common_corpus, id2word=id2word, time_slice=time_slice, num_topics=topics)

### 4.iii. Embedded Topic Model (ETM)

### 4.iv. Dynamic Embedded Topic Model (DETM)

## 5. Model comparison

<font color='red'>INTRODUCTION ON HOW WE WANT TO COMPARE MODELS</font>

**The idea is introduced very well in "Topic modeling in embedding spaces": let's copy from there!**

### 5.i. Quantitative analysis

<font color='blue'>COMPUTATION OF VARIOUS METRICS AND CONSTRUCTION OF GRAPHS</font>

### 5.ii. Qualitative analysis

<font color='blue'>INTERPRETATION OF TOPICS AND DOCUMENT REPRESENTATION (LDA vs ETM, DTM vs DETM)</font>

## 6. Conclusion

<font color='red'>FINAL REMARKS</font>

## References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. [ACM Digital Library](https://dl.acm.org/doi/10.5555/944919.944937)

Blei, D. M., & Lafferty, J. D. (2006, June). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113-120). [ACM Digital Library](https://doi.org/10.1145/1143844.1143859)

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2019). The dynamic embedded topic model. arXiv preprint arXiv:1907.05545. [Arxiv link](https://arxiv.org/abs/1907.05545)

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439-453. [ACM Anthology](https://aclanthology.org/2020.tacl-1.29/),  [Arxiv link](https://arxiv.org/abs/1907.04907)