# Lab : Generative models at the lexical level

## Objectives:

Explore the two generative models seen in class (Naïve Bayes for classification, Latent Dirichlet Allocation for topic modeling) by applying them to a relatively small classification dataset - **20NewsGroup** - try to  look at how they perform on the classification task and understand how to interpret the topic models. 
1. Pre-process the data: clean it, understand the various possibilities for pre-processing steps.
2. Obtain representations: first, symbolic document representations: **BoW**, then **TF-IDF**
    - We will first implement our functions for doing so, then use ```sklearn```. 
3. Perform classification:
    - We will first implement our function for Naïve Bayes, then use ```sklearn```.
    - We will search for the best hyper-parameters using ```pipeline```.
4. Perform topic modeling:
    - We will quickly compare LSA and LDA and try to interpret them. 
    - We will implement simple metrics and look for the best hype-parameters maximizing them.

## Necessary dependancies

We will need the following packages:
- The Natural Language Toolkit : http://www.nltk.org/install.html
- The Machine Learning API Scikit-learn : http://scikit-learn.org/stable/install.html

In [1]:
import os.path as op
import re 
import numpy as np
from pprint import pprint

## Loading data

We retrieve the textual data in the variable *texts*.

The labels are retrieved in the variable $y$ - it contains *len(texts)* of them: $0$ indicates that the corresponding review is negative while $1$ indicates that it is positive.

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
ng_train = fetch_20newsgroups(subset='train',
                              remove=('headers', 'footers', 'quotes')
                              )

In [None]:
pprint(dir(ng_train))

In [None]:
pprint(ng_train.target_names)

Example of one document:

In [None]:
pprint(ng_train.data[0])
print("Target: ", ng_train.target_names[ng_train.target[0]])

## 1 - Document Preprocessing

You should use a pre-processing function you can apply to the raw text before any other processing (*i.e*, tokenization and obtaining representations). Some pre-processing can also be tied with the tokenization (*i.e*, removing stop words).

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
import unidecode
import string

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def clean_text(text: str,
               rm_numbers=True,
               rm_punct=True,
               rm_stop_words=True,
               rm_short_words=True):
    # make lowercase
    text = text.lower()

    # remove URLs
    URL_PATTERN = re.compile(r'\b(?:https?://|www\.)\S+\b', flags=re.IGNORECASE)
    #URL_PATTERN.sub('', text)

    # remove domain names
    DOMAIN_PATTERN = re.compile(r'\b(?:[a-z0-9-]+\.)+[a-z]{2,}\b', flags=re.IGNORECASE)
    #DOMAIN_PATTERN.sub('', text)
    
    # remove email addresses
    EMAIL_PATTERN = re.compile(r'\b[a-z0-9._%+-]+@(?:[a-z0-9-]+\.)+[a-z]{2,}\b', flags=re.IGNORECASE)
    #EMAIL_PATTERN.sub('', text)
                               
    # remove punctuation
    if rm_punct:
        text = text.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation)))

    # remove numbers
    if rm_numbers:
        text = re.sub(r'\d+', '', text)

    # replace linebreaks and strip
    text = ...
    text = ...

    # remove stopwords
    if rm_stop_words:
        stop_words = set(stopwords.words('english'))
        # Apply tokenization and filter stop words
        ...
        text_list = ...
        text = ' '.join(text_list)
        
    # remove short words
    if rm_short_words:
        # Apply tokenization and filter short words
        ...
        text_list = ...
        # Put text back together
        text = ' '.join(text_list)
    
    return text

In [None]:
pprint(clean_text(ng_train.data[0]))

The dataset contains 20 classes. However, **some of them are pretty close together. We aggregate them into 6 semantically coherent classes** which should not be easier to distinguish:

In [None]:
def aggregate_labels(label):
    # comp
    if label in [1,2,3,4,5]:
        new_label = 0
    # rec
    if label in [7,8,9,10]:
        new_label = 1
    # sci
    if label in [11,12,13,14]:
        new_label = 2
    # misc 
    if label in [6]:
        new_label = 3
    # pol
    if label in [16,17,18]:
        new_label = 4
    # rel
    if label in [0,15,19]:
        new_label = 5
    return new_label

We check that **we don't have any empty document**:

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
ng_train_text = ...
ng_train_labels = ...

In [None]:
ng_test = fetch_20newsgroups(subset='test',
                             remove=('headers', 'footers', 'quotes')
                            )

ng_test_text = ...
ng_test_labels = ...

We may apply a **lemmatizer**. We can get one from ```NLTK```.
If we want it to work, we need the **part-of-speech** information of the word: 
- *Meeting* will not have the same lemma if it's a verb or a noun ! 
    
For that, we can use ```NLTK``` tools:
- ```word_tokenize``` to cut the document into tokens,
- ```pos_tag``` to obtain part-of-speech tags,
- ```get_wordnet_pos``` is a mapping function that will allow us to get the full POS designation to be used by the lemmatizer.
    

In [None]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

In [None]:
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default solution

def preprocess_and_lemmatize(text):   
    tokens = word_tokenize(text)
    tagged_tokens = pos_tag(tokens)
    lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in tagged_tokens]
    return " ".join(lemmatized)

In [None]:
lemmatized_doc = preprocess_and_lemmatize(clean_text(ng_train.data[0]))
pprint(lemmatized_doc)

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
ng_train_text_lemma = ...

In [None]:
from sklearn.model_selection import train_test_split

ng_train_text_splt, ng_val_text, ng_train_labels_splt, ng_val_labels = train_test_split(ng_train_text_lemma, ng_train_labels, test_size=.2)

## 2 - Document representations 

Our statistical model, like most models applied to textual data, uses counts of word occurrences in a document. Thus, a very convenient way to represent a document is to use a Bag-of-Words (BoW) vector, containing the counts of each word (regardless of their order of occurrence) in the document. 

If we consider the set of all the words appearing in our $T$ training documents, which we note $V$ (Vocabulary), we can create **an index**, which is a bijection associating to each $w$ word an integer, which will be its position in $V$. 

Thus, for a document extracted from a set of documents containing $|V|$ different words, a BoW representation will be a vector of size $|V|$, whose value at the index of a word $w$ will be its number of occurrences in the document. 

We can use the **CountVectorizer** class from scikit-learn to better understand:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator, ClassifierMixin

In [None]:
corpus = ['I walked down down the boulevard',
          'I walked down the avenue',
          'I ran down the boulevard',
          'I walk down the city',
          'I walk down the the avenue']
vectorizer = CountVectorizer()

Bow = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
Bow.toarray()

We display the list containing the words ordered according to their index (Note that words of 2 characters or less are not counted).

The next function takes as input a list of documents (each in the form of a string) and returns, as in the example using ``CountVectorizer``:
- A vocabulary that associates, to each word encountered, an index
- A matrix, with rows representing documents and columns representing words indexed by the vocabulary. In position $(i,j)$, one should have the number of occurrences of the word $j$ in the document $i$.

The vocabulary, which was in the form of a *list* in the previous example, can be returned in the form of a *dictionary* whose keys are the words and values are the indices. Since the vocabulary lists the words in the corpus without worrying about their number of occurrences, it can be built up using a set (in python).
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def count_words(texts):
    """Vectorize text : return count of each word in the text snippets

    Parameters
    ----------
    texts : list of str
        The texts
    Returns
    -------
    vocabulary : dict
        A dictionary that points to an index in counts for each word.
    counts : ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
    """
    # Obtain the set of all words present in the data
    ...
    # Use it to create the vocabulary
    vocabulary = dict(...) 
    # Create the term document matrix
    counts = np.zeros((..., ...))
    # Fill it 
    ...
    
    return vocabulary, counts

In [None]:
Voc, X = count_words(corpus)
print(Voc)
print(X)

Now, if we want to represent text that was not available when building the vocabulary, we will not be able to represent **new words** ! Let's take a look at how CountVectorizer does it:

In [None]:
val_corpus = ['I walked up the street']
Bow = vectorizer.transform(val_corpus)
Bow.toarray()

Modify the ```count_words``` function to be able to deal with new documents when given a previously obtained vocabulary ! 
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def count_words(texts, voc = None):
    """Vectorize text : return count of each word in the text snippets

    Parameters
    ----------
    texts : list of str
        The texts
    Returns
    -------
    vocabulary : dict
        A dictionary that points to an index in counts for each word.
    counts : ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
    """
    if voc == None:
        ...
    else:
        vocabulary = voc
    ...
    
    return vocabulary, counts

In [None]:
voc, train_bow = count_words(ng_train_text_splt)
print(train_bow.shape)

Compare with the ```sklearn``` version:

In [None]:
vectorizer = CountVectorizer()
Bow = vectorizer.fit_transform(ng_train_text_splt)
train_bow_sk = Bow.toarray()
print(train_bow_sk.shape)

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Careful: check the size that the representations are going to have (given the way they are build). What does this imply for the memory use ? What ```CountVectorizer``` arguments allows to avoid the issue ? 

In [None]:
vectorizer = CountVectorizer(min_df=2, max_df=0.85)
Bow = vectorizer.fit_transform(ng_train_text_splt)
train_bow_sk = Bow.toarray()
print(train_bow_sk.shape)

In what comes next, we will mainly use the ```min_df``` and ```max_df``` arguments to affect pre-processing.

In [None]:
val_bow = vectorizer.transform(ng_val_text).toarray()
print(val_bow.shape)

Let's first look at the most frequent words. This will require some simple array manipulation:
- Retrieving the sum of all word occurences across documents,
- Sorting words according to their frequency,
- Plotting an histogram for the top words, using the count as value and the word as legend.

How can that influence our pre-processing ? 

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
frequency = ... # Total count of each word in the data
top_words = ... # Indexes sorted by frequency

In [None]:
# Get the vocabulary from the vectorizer using get_feature_names_out()
voc = dict(zip(vectorizer.get_feature_names_out(),range(len(vectorizer.get_feature_names_out()))))

In [None]:
import matplotlib.pyplot as plt

In [None]:
rev_voc = {i: w for w, i in voc.items()} # Reverse vocabulary
fig, ax = plt.subplots(figsize=(16,8))
ax.bar(range(15), frequency[top_words[:15]])
ax.set_xticks(range(15))
ax.set_xticklabels([rev_voc[i] for i in top_words[:15]], rotation='vertical')
plt.show()

**Improving those representations with TF-DF**: This method is usually used to measure the importance of a term $i$ in a document $j$ relative to the rest of the corpus, from a matrix of occurrences $ words \times documents$. Thus, for a matrix $\mathbf{T}$ of $|V|$ terms and $D$ documents:
$$\text{TF}(T, w, d) = \frac{T_{w,d}}{\sum_{w'=1}^{|V|} T_{w',d}} $$

$$\text{IDF}(T, w) = \log\left(\frac{D}{|\{d : T_{w,d} > 0\}|}\right)$$

$$\text{TF-IDF}(T, w, d) = \text{TF}(X, w, d) \cdot \text{IDF}(T, w)$$

TF-IDF is generally better suited to low-density matrices, since it will penalize terms that appear in a large part of the documents. 
Implement a function transforming the BOW representations we obtained as output of ```count_words``` into TF-IDF representations:
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
from sklearn.preprocessing import normalize

def tfidf_transform(bow):
    """
    Inverse document frequencies applied to our bag-of-words representations
    """
    # IDF
    ...
    idfs = ...
    # TF
    ...
    tfs = ...
    
    tf_idf = tfs * np.expand_dims(idfs,axis=0)
    return tf_idf

In [None]:
tfidf = tfidf_transform(train_bow_sk)
print(tfidf.shape)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

In [None]:
# Create and fit the vectorizer to the training data
tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.85)
Tfidf = tfidf_vectorizer.fit_transform(ng_train_text_splt)
tfidf_sk = Tfidf.toarray()
print(tfidf_sk.shape)

## 3 - Classification with Naive Bayesian 

We will implement a class ```NB``` that should correspond to a **scikit-learn model**. It will contain the following methods:

```python
def fit(self, X, y)
``` 
**Training**: will learn a statistical model based on the representations $X$ corresponding to the labels $y$.
Here, $X$ contains representations obtained as the output of ```count_words```. You can complete the function using the procedure detailed above. 

Note: the smoothing is not necessarily done with a $1$ - it can be done with a positive value $\alpha$, which we can implement as an argument of the class ```NB```.

```python
def predict(self, X)
```
**Testing**: will return the labels predicted by the model for other representations $X$.
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class NB(BaseEstimator, ClassifierMixin):
    # Class arguments allow to use sklearn methods 
    def __init__(self, alpha=1.0):
        # alpha is a smoothing parameter
        self.alpha = alpha

    def fit(self, X, y):
        # Compute the prior probabilities of classes
        ...
        # And the conditional probabilities of words given classes
        ...
        # Save them as model attributes
        self.log_prior_ = ...
        self.log_cond_prob_ = ...
        return self

    def predict(self, X):
        # Do prediction: compute the score of each document
        ...
        scores = ...
        # And return the classes maximizing those scores
        ...
        return ...
        
    def score(self, X, y):
        # Return accuracy
        return ...

In [None]:
clf_nb = NB()
clf_nb.fit(train_bow_sk, ng_train_labels_splt)
val_pred = clf_nb.predict(val_bow)

Besides accuracy, we can look at **F1**-measures, and display the *confusion matrix*.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

In [None]:
print(classification_report(ng_val_labels, val_pred))
cm = confusion_matrix(ng_val_labels , val_pred, normalize='true')
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=range(6))
disp.plot()
plt.show()

We can also use the scikit-learn ```MultinomialNB```. Experiment on this model too and compare the results.
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
from sklearn.naive_bayes import MultinomialNB
# Fit the model on the training data
...

In [None]:
# Do the prediction and evaluation
...

We want to **find the best hyper-parameters** for our model: in this case, it will mainly affect the pre-processing.
In what follows, use ```Pipeline``` to perform a series of quick experiments, and use the validation data to check which set of representations (depending on ```min_df```, ```max_df``` and using ```tf-idf``` or not):
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
from sklearn.metrics import accuracy_score, f1_score
from sklearn.pipeline import Pipeline

In [None]:
pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("clf", MultinomialNB())
])

pipeline_tfidf = Pipeline([
    ("vect", CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ("clf", MultinomialNB())
])

In [None]:
min_dfs = [1, 2, 3, 5, 10]
max_dfs = [0.5, 0.6, 0.7, 0.85, 1.0]

# Test the model for those pre-processing hyper-parameters
...

## 4 - Topic modeling with Latent Dirichlet Allocation

We will now investigate the use of Latent Semantic Analysis  and Latent Dirichlet Allocation for topic modeling.
Let's begin with a simple application of both methods with a reduced number of topics (*e.g*, ```n_topics = 20```) and try to interpret them. 
- We will use ```TruncatedSVD``` for LSA and ```LatentDirichletAllocation``` for LDA
- We will look at the most important words for each topic
- We will visualize the topics with ```pyLDAvis```
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# Let's take the best configuration obtained for classification
vectorizer = CountVectorizer(min_df=..., max_df=...)
Bow = vectorizer.fit_transform(ng_train_text_lemma)
train_bow_tm = Bow.toarray()
print(train_bow_tm.shape)

In [None]:
# Remove empty documents (with that pre-processing)
mask = (train_bow_tm.sum(axis=1) > 0)
train_bow_tm = train_bow_tm[mask]
print(train_bow_tm.shape)

In [None]:
lsa = TruncatedSVD(n_components = 20)
lsa_train_topics = lsa.fit_transform(train_bow_tm)

In [None]:
# Correspondances between documents and topics
print(lsa_train_topics.shape)
# Correspondances between topics and words
print(lsa.components_.shape)

In [None]:
voc = dict(zip(vectorizer.get_feature_names_out(),range(len(vectorizer.get_feature_names_out()))))
rev_voc = {i: w for w, i in voc.items()}

def most_important_words(n, reverse_vocabulary, topic_model):
    out = []
    for i, topic in enumerate(topic_model.components_):
        out.append([reverse_vocabulary[j] for j in topic.argsort()[:-n-1:-1]])
    return out

In [None]:
words = most_important_words(8, rev_voc, lsa)
for i, topic in enumerate(words[:]):
    print("Topic ", i+1, " : ", topic)

To use ```pyLDAvis```, we need **probability distributions**. We will need to adapt the result of LSA. However, it will be very easy with LDA ! 

How to perform this adaptation ? What do you think is the best way to transform real scores in a probability distribution in this context ? 
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
import pyLDAvis

In [None]:
# Distribution for topic / word correspondance
train_topic_term_abs = ...
train_topic_term_prob = train_topic_term_abs / train_topic_term_abs.sum(axis=1)[:, None]

# Distribution for document / topic correspondance 
train_doc_topic_abs = ...
train_doc_topic_prob = train_doc_topic_abs / train_doc_topic_abs.sum(axis=1)[:, None]

In [None]:
prepared_data = pyLDAvis.prepare(
    topic_term_dists=train_topic_term_prob,
    doc_topic_dists=train_doc_topic_prob,
    doc_lengths=train_bow_tm.sum(axis=1),
    vocab=vectorizer.get_feature_names_out(),
    term_frequency=train_bow_tm.sum(axis = 0)
)

pyLDAvis.display(prepared_data)

We do the same for ```LDA```:

In [None]:
lda = LatentDirichletAllocation(n_components = 20)
lda_train_topics = lda.fit_transform(train_bow_tm)

In [None]:
words = most_important_words(8, rev_voc, lda)
for i, topic in enumerate(words[:]):
    print("Topic ", i+1, " : ", topic)

In [None]:
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.lda_model.prepare(lda, Bow, vectorizer)
# Look at https://nbviewer.org/github/bmabey/pyLDAvis/blob/master/notebooks/LDA%20model.ipynb for an example of 
# application to a sklearn LDA model. Look at the different multidimensional scaling options

We can now implement two (imperfect) metrics to try to check how our topic models are behaving:
- **Topic diversity**, looking at how redundant top-words in our topics are,
    - Let's define it as the *proportion of unique words* in top words of topics 
- **Topic coherence**, looking at how top-words in our topics actually co-occur in the data.
    - We will look at the proportion of documents which containt pairs of top-words 

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def topic_diversity(components, top_n=10):
    top_words = []
    for topic in components:
        # Index of top_n words in that topic
        top_indices = ...
        top_words.extend(top_indices)
    # Compute the proportion of unique words
    return ...

What are the range of values taken by this measure ? How to interpret it ?
<div class='alert alert-block alert-warning'>
            Question:</div>

In [None]:
print(topic_diversity(lsa.components_))
print(topic_diversity(lda.components_))
# Value between 0 and 1, more diverse when close to 1.

Topic coherence is applied to the **binary** term-document matrix:  

In [None]:
train_bow_binary = (train_bow_tm > 0).astype(int)

In [None]:
def umass_coherence(components, bow_binary, top_n=10):
    scores = []
    
    for topic in components:
        top_words = topic.argsort()[-top_n:]
        score = 0
        for i in range(1, len(top_words)):
            for j in range(i):
                D_wi_wj = np.sum(bow_binary[:, top_words[i]] * bow_binary[:, top_words[j]])
                D_wj = np.sum(bow_binary[:, top_words[j]])
                score += np.log((D_wi_wj + 1) / D_wj)
        scores.append(score)
    return np.mean(scores)

What are the range of values taken by this measure ? How to interpret it ?
<div class='alert alert-block alert-warning'>
            Question:</div>

In [None]:
print("UMass coherence:", umass_coherence(lsa.components_, train_bow_binary))
print("UMass coherence:", umass_coherence(lda.components_, train_bow_binary))
# Negative value, should get close to 0 when there is perfect co-occurence of the top words in each topic.
# Very negative -> topics are "more separated".

Vary the number of topics ```n_topics``` for the LDA model, and find out which seems to be giving the best results:
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
...

Investigate using the document representations in **topic space** for the classification task. Search for the best number of topics, performance-wise. Is it the same than before ? 
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
...

In [None]:
# Expect those results to change with pre-processing