# Document vectors
The first thing we're going to do, as usual, is begin by importing libraries and modules we're going to use today. We're introducing a new library, called ```datasets```, which is part of the ```huggingface``` universe. 

```datasets``` provides easy access to a wide range of example datasets which are widely-known in the NLP world, it's worth spending some time looking around to see what you can find. For example, here are a collection of [multilabel classification datasets](https://huggingface.co/datasets?task_ids=task_ids:multi-class-classification&sort=downloads).

We'll be working with the ```huggingface``` ecosystem more and more as we progress this semester.

In [2]:
# data processing
import pandas as pd
import numpy as np

# huggingface datasets
from datasets import load_dataset

# scikit learn tools
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression

# plotting tools
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


## Load data
We're going to be working with actual text data data, specifically a subset of the well-known [GLUE Benchmarks](https://gluebenchmark.com/). These benchmarks are regularly used to test how well certain models perform across a range of different language tasks. We'll work today specifically with the Stanford Sentiment Treebank 2 (SST2) - you can learn more [here](https://huggingface.co/datasets/glue) and [here](https://nlp.stanford.edu/sentiment/index.html).

The dataset we get back is a complex, hierarchical object with lots of different features. I recommend that you dig around a little and see what it contains. For today, we're going to work with only the training dataset right now, and we're going to split it into sentences and labels.

In [3]:
# load the sst2 dataset
dataset = load_dataset("glue", "sst2")
# select the train split
train_data = dataset["train"]
X = train_data["sentence"]
y = train_data["label"]

Downloading builder script: 100%|██████████| 28.8k/28.8k [00:00<00:00, 171kB/s]
Downloading metadata: 100%|██████████| 28.7k/28.7k [00:00<00:00, 4.16MB/s]
Downloading readme: 100%|██████████| 27.9k/27.9k [00:00<00:00, 3.45MB/s]
Downloading data: 100%|██████████| 7.44M/7.44M [00:00<00:00, 7.50MB/s]
Generating train split: 100%|██████████| 67349/67349 [00:00<00:00, 117611.43 examples/s]
Generating validation split: 100%|██████████| 872/872 [00:00<00:00, 106745.85 examples/s]
Generating test split: 100%|██████████| 1821/1821 [00:00<00:00, 116240.70 examples/s]


Let's split the data into a training and a test set. We will later train a simple classifier to start looking at what one can do with vector representations of text, that's why we need a set of documents that are left aside. For now, let's simply focus on the training set to estimate our document-term model.

In [4]:
import random
train_idx = random.sample(range(len(X)), k=int(len(X)*.7)) # we are sampling 70% as training set
train_X, test_X, train_y, test_y = [], [], [], []
for i in train_idx:
    train_X.append(X[i])
    train_y.append(y[i])
for i in set(range(len(X))) - set(train_idx):
    test_X.append(X[i])
    test_y.append(y[i])

In [5]:
list(zip(train_X[:10], train_y[:10]))

[('is less a documentary and more propaganda by way of a valentine sealed with a kiss ',
  0),
 ('a cellophane-pop remake of the punk classic ladies and gentlemen , the fabulous stains ',
  0),
 ('exploitation flick ', 0),
 ('ultimately succumbs to cliches and pat storytelling ', 0),
 ('half-baked setups ', 0),
 ('are both superb , while huppert ... is magnificent . ', 1),
 ('powerful sequel ', 1),
 ('flagrantly fake thunderstorms ', 0),
 ('a witty , low-key romantic comedy ', 1),
 ('chance to shine ', 1)]

In [6]:
print('Number of training examples: ', len(train_X))
print('Number of test examples: ', len(test_X))

Number of training examples:  47144
Number of test examples:  20205



## Create document representations
We're going to work with a bag-of-words model (like the ones we talked about in class), which we can create quite simply using the ```CountVectorizer()``` class available via ```scikit-learn```. You can read more about the default parameters of the vectorizer [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

After we initialize the vectorizer, we first _fit_ this vectorizer to our data (the model learns parameters such as which words to include in the vocabulary, based on the statistics of the text and the parameters passed to  `CountVectorizer`) and then _transform_ the original data into the bag-of-words representation.

Let's start by fitting a model where default constraints are placed on vocabulary size.

In [7]:
simple_vectorizer = CountVectorizer()
X_vect = simple_vectorizer.fit_transform(train_X)

This is the number of words the vectorizer uses as features (i.e., words that are *not* excluded because too frequent, or too infrequent)

In [8]:
len(simple_vectorizer.vocabulary_)

13517

In [9]:
print(X_vect.shape)
print(X_vect.toarray())


(47144, 13517)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


As you can see, the resulting matrix has dimensions `[n_documents, n_words]`.
Note that there is a simple way to get a term-term matrix (in how many documents two words co-occur) by computing the dot product of the term-document matrix and its transpose.

In [10]:
np.dot(X_vect.T, X_vect).toarray() # the diagonal essentially indicates how often a term occurs overall.

array([[ 5,  5,  0, ...,  0,  0,  0],
       [ 5, 71,  0, ...,  0,  0,  0],
       [ 0,  0, 14, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ...,  2,  0,  0],
       [ 0,  0,  0, ...,  0,  3,  0],
       [ 0,  0,  0, ...,  0,  0,  3]])

What happens to dimensionality if manipulate input parameters, e.g., `min_df`? Try to play with `CountVectorizer` parameters to get familiar with the function.

### Dimensionality reduction
Our current matrix is fairly sparse. Could we apply what we have learned during the lecture to convert it to a dense and more compact matrix? Let's apply the `SVD` algorithm we discussed in class.

In [11]:
svd = TruncatedSVD(n_components=500)
svd.fit(X_vect)
X_svd = svd.transform(X_vect)

How does our vector space look like?

In [12]:
X_svd

array([[ 1.07860777e+00,  8.78617121e-01, -7.34886545e-01, ...,
        -1.14414114e-02, -1.37544519e-02, -3.08068188e-02],
       [ 2.17339571e+00, -3.33566426e-01, -6.67490519e-01, ...,
         4.64134498e-02,  9.37524735e-03, -5.89933998e-03],
       [ 3.58771233e-03,  7.22997741e-04, -5.05448149e-04, ...,
         4.12277811e-02, -1.71937786e-02,  1.49904653e-02],
       ...,
       [ 6.14849939e-01,  4.43647488e-01,  9.01547697e-01, ...,
        -4.87577840e-02, -3.26260350e-02, -8.41784501e-02],
       [ 1.20978738e-02,  1.45212143e-02,  5.86002694e-03, ...,
         3.68550742e-02,  2.51769235e-02, -3.81754341e-02],
       [ 2.63093095e-04, -3.61713956e-04,  7.01637868e-04, ...,
         2.07748064e-02,  1.47585939e-02,  9.13916201e-03]])

### Classifying sentiment

Congratulations! You have created your first document representation. 

We will dive deeper into classification in the coming weeks, but to demonstrate what we can do with these representations, let's go through an example.

As we saw earlier, our documents have labels indicating the sentiment of each of the document. Can we predict sentiment on the basis of bag of words representations of our documents?
Let's use a simple `scikit-learn` classifier to learn to predict sentiment from text. We will learn more about this later on, for now all you need to know is that the classifier estimates a relation between input and output such that it is able to predict the output (in this case, the sentiment of the sentence, which is `0` for negative sentences, `1` for positive) from the input.

We will use a `LogisticRegression` classifier (not necessarily best, but one the fastest), but you can experiment with multiple classifiers (e.g., https://scikit-learn.org/stable/modules/svm.html).

In [13]:
classifier = LogisticRegression(max_iter=2000).fit(X_vect, train_y)


Let's transform the test data, which we need for evaluation.

In [14]:
X_vect_test = simple_vectorizer.transform(test_X)

And finally, let's compute how often the model predictions match the true labels.

In [15]:
print('Model accuracy: ', np.mean(classifier.predict(X_vect_test) == test_y))

Model accuracy:  0.8946300420687948


That's pretty good: let's take a look at a couple of examples.

In [16]:
list(zip(test_X, classifier.predict(X_vect_test)))

[('contains no wit , only labored gags ', 0),
 ('on the worst revenge-of-the-nerds clichés the filmmakers could dredge up ',
  0),
 ("a depressed fifteen-year-old 's suicidal poetry ", 0),
 ("the part where nothing 's happening , ", 0),
 ('excruciatingly unfunny and pitifully unromantic ', 0),
 ('enriched by an imaginatively mixed cast of antic spirits ', 1),
 ("which half of dragonfly is worse : the part where nothing 's happening , or the part where something 's happening ",
  0),
 ('the plot is nothing but boilerplate clichés from start to finish , ', 0),
 ('will find little of interest in this film , which is often preachy and poorly acted ',
  0),
 ('sometimes dry ', 0),
 ('as a fringe feminist conspiracy theorist ', 0),
 ('a muddle splashed with bloody beauty as vivid as any scorsese has ever given us . ',
  1),
 ("poor ben bratt could n't find stardom if mapquest emailed him point-to-point driving directions . ",
  0),
 ('far less sophisticated and ', 0),
 ('rich veins of funny 

### Some optional tasks
- Does performance change if we use a `TfidfVectorizer`?
- Can you write your own version of `CountVectorizer()`? In other words, a function that takes a corpus of documents and creates a bag-of-words representation for every document?
- What about `TfidfVectorizer()`? Look over the formulae in the slides from Tuesday.