## Loading and vectorizing texts with sklearn

Scikit-learn has methods to transform a collection of documents into matrices of "**bag of words**" representations of these documents.

These matrices use the scipy.sparse type, which is appropriate for **sparse matrices**.

These modules have 3 methods:
- fit : builds the vocabulary and the correspondance between word forms and word ids
- transform : transforms the documents into matrices of counts
- fit_transform : performs both actions

In [3]:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer
# a French corpus (to see what is going on with diacritics)
train_corpus = [
     'Ceci est un document.',
     'Ce document est encore un document à moi.',
     'Et voilà le troisième.',
     'Le premier document est-il le plus intéressant?',
 ]
vectorizer = CountVectorizer()
# the vectorizer is empty : this generates an error
#print(vectorizer.vocabulary_)
#print(vectorizer.get_feature_names())

# we can fill it using the training set
# and transform the training set into a matrix
X_train = vectorizer.fit_transform(train_corpus)

# the matrix is sparse
print("type of X_train", type(X_train))
print("shape of X_train", X_train.shape)
print(X_train)
#print(type(X_train))

# here it is as a standard matrix
print(X_train.toarray()) 


type of X_train <class 'scipy.sparse._csr.csr_matrix'>
shape of X_train (4, 15)
  (0, 1)	1
  (0, 4)	1
  (0, 13)	1
  (0, 2)	1
  (1, 4)	1
  (1, 13)	1
  (1, 2)	2
  (1, 0)	1
  (1, 3)	1
  (1, 9)	1
  (2, 5)	1
  (2, 14)	1
  (2, 8)	1
  (2, 12)	1
  (3, 4)	1
  (3, 2)	1
  (3, 8)	2
  (3, 11)	1
  (3, 6)	1
  (3, 10)	1
  (3, 7)	1
[[0 1 1 0 1 0 0 0 0 0 0 0 0 1 0]
 [1 0 2 1 1 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 1 0 0 0 1 0 1]
 [0 0 1 0 1 0 1 1 2 0 1 1 0 0 0]]


In [9]:
# here is the mapping between word forms and ids (our "w2i" in previous lab session)
print(vectorizer.vocabulary_)
# the list of word forms (our i2w)
print(vectorizer.get_feature_names_out())

#QUESTIONS: 
# What is the size of the vocabulary
print(f"\n Size of vocab: {len(vectorizer.vocabulary_)}")
# What does the 3rd column of X.train.toarray() represent ? 
# What is printed when printing the sparse matrix ?


{'ah': 0, 'un': 7, 'nouveau': 6, 'document': 2, 'et': 5, 'ceci': 1, 'est': 4, 'encore': 3}
['ah' 'ceci' 'document' 'encore' 'est' 'et' 'nouveau' 'un']

 Size of vocab: 8


1. The vocabulary is composed of 15 words. Each word has a unique id from 0 to n - 1. These are the values in the `vectorizer.vocabulary_` dictionary. 
2. The 3rd column represents the number of occurrences of the term `document` in each document of the `train_corpus`. The word `document` occurs two times in the second document therefore the value of the cell in the second row and third column of `X_train` is 2.
3. Printing the sparse matrix shows all the non-zero components of the matrix. The tuple represent the cell (row i.e. document, column i.e. word/feature) while the integer is its corresponding value.

In [5]:
test_corpus = [ 'Ah un nouveau document.',
              'Et ceci est encore un document.']
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_corpus)
X_test = vectorizer.fit_transform(test_corpus)
print("shape of X_test", X_test.shape)

# What happened to the words in test_corpus that are not present in train_corpus?
# Compare to vectorizer.fit_transform

print(f"X_train:\n{X_train.toarray()}\n")
print(f"X_test:\n{X_test.toarray()}")
print(f"\ntest vocabularyusing fit_transform: {vectorizer.vocabulary_}")

shape of X_test (2, 8)
X_train:
[[0 1 1 0 1 0 0 0 0 0 0 0 0 1 0]
 [1 0 2 1 1 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 1 0 0 0 1 0 1]
 [0 0 1 0 1 0 1 1 2 0 1 1 0 0 0]]

X_test:
[[1 0 1 0 0 0 1 1]
 [0 1 1 1 1 1 0 1]]

test vocabularyusing fit_transform: {'ah': 0, 'un': 7, 'nouveau': 6, 'document': 2, 'et': 5, 'ceci': 1, 'est': 4, 'encore': 3}


1. The words in `test_corpus` that are not present in `train_corpus` are ignored. This is made apparent when printing both matrices as their row vectors are of equal size (size of vocabulary).
2. When calling `fit_transform` on the test corpus this generates new ids for the test set vocab which is clearly smaller than that of the train corpus.

In [10]:
train_corpus = [
     'Ceci est un document .',
     'Ce document est encore un document à moi .',
     'Et voilà le troisième .',
     'Le premier document est -il le plus intéressant ?',
 ]

# QUESTIONS:

# How can you change the tokenization that the CountVectorizer will use (see its constructor)?
# in particular, how to split on spaces only
#  (which corresponds to supposing texts were already tokenized)
# Indications: study 
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# to see all the members of the instance, 
# and deduce which member to modify:
print("\nMEMBERS:\n", "\n".join([ str(x) for x in vectorizer.__dict__.items()]))


# Which parameters can you modify to switch to bigram and trigram of characters features ?

# Search what is a TF.IDF weighting (very famous)

# Study the TfidfVectorizer class
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# dans deduce how to obtain TF.IDF weigthed vector representations of the documents



MEMBERS:
 ('input', 'content')
('encoding', 'utf-8')
('decode_error', 'strict')
('strip_accents', None)
('preprocessor', None)
('tokenizer', None)
('analyzer', 'word')
('lowercase', True)
('token_pattern', '(?u)\\b\\w\\w+\\b')
('stop_words', None)
('max_df', 1.0)
('min_df', 1)
('max_features', None)
('ngram_range', (1, 1))
('vocabulary', None)
('binary', False)
('dtype', <class 'numpy.int64'>)
('fixed_vocabulary_', False)
('_stop_words_id', 9488912)
('stop_words_', set())
('vocabulary_', {'ah': 0, 'un': 7, 'nouveau': 6, 'document': 2, 'et': 5, 'ceci': 1, 'est': 4, 'encore': 3})


1. We can change `CountVectorizer`'s tokenization by modifying the value of it's `tokenizer` parameter (default `None`). We can define a new tokenizer which only tokenizes based on whitespace using the `re` module below.

In [21]:
import re
def my_tokenizer(text):
  # split based on whitespace
  return re.split("\\s+",text)

my_vectorizer = CountVectorizer(tokenizer=my_tokenizer)

2. Two represent features using bigrams or trigrams we can modify the `ngram_range` parameter. An `n_gram_range` of `(2, 2)` considers only bigrams and `(3, 3)` only trigrams while `(1, 3)` considers n-grams with n <= 3.

In [32]:
# getting tf-idf weighted vector representations of the documents
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_corpus)
print(vectorizer.get_feature_names_out())

# printing new matrix tf-idf vectors
print(X.toarray())

# printing the sparse matrix
print(X)

['ce' 'ceci' 'document' 'encore' 'est' 'et' 'il' 'intéressant' 'le' 'moi'
 'plus' 'premier' 'troisième' 'un' 'voilà']
[[0.         0.64065543 0.40892206 0.         0.40892206 0.
  0.         0.         0.         0.         0.         0.
  0.         0.5051001  0.        ]
 [0.4203817  0.         0.53664838 0.4203817  0.26832419 0.
  0.         0.         0.         0.4203817  0.         0.
  0.         0.33143376 0.        ]
 [0.         0.         0.         0.         0.         0.52547275
  0.         0.         0.41428875 0.         0.         0.
  0.52547275 0.         0.52547275]
 [0.         0.         0.23622136 0.         0.23622136 0.
  0.37008641 0.37008641 0.58356075 0.         0.37008641 0.37008641
  0.         0.         0.        ]]
  (0, 2)	0.4089220628888078
  (0, 13)	0.5051001005334584
  (0, 4)	0.4089220628888078
  (0, 1)	0.6406554311067799
  (1, 9)	0.42038169507735806
  (1, 3)	0.42038169507735806
  (1, 0)	0.42038169507735806
  (1, 2)	0.536648381033003
  (1, 13)	0.33