### 0. Questions

 - What is embedding?
 - 

### 1. Import packages

First, let's import needed modules and, random seed (we'll use it if needed) and create some auxiliary functions.

In [235]:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

SEED=42

In [13]:
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

### 2. Data preparation

I'll be using dataset from [Spooky Author Identification](https://www.kaggle.com/c/spooky-author-identification/overview) competition

#### 2.1 Loading the data

In [14]:
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

#### 2.2 Data Fields

* id - a unique identifier for each sentence
* text - some text written by one of the authors
* author - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley)

Let's look at the data

In [15]:
train_df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


For now, I'm going to look only at column text to look at the ways text representation can be done

#### 2.3 Data Splitting

But nevertheless let's split the data into training and validation sets.  
As soon as we have almost $20 000$ rows in `train_df` test size will be limited to $10\%$

In [16]:
train, val = train_test_split(train_df, test_size=0.1, random_state=SEED)

### 3. Text embeddings

#### 3.1 Bag of words

##### 3.1.1 One-hot vectors 

The simplest way of word representation is **one-hot vectors**. For the i-th word in the vocabulary, the vector has 1 on the i-th dimension and 0 on the rest.   
Let's do this using sklearn's `CountVectorizer` with `binary=True`

In [18]:
count_vect = CountVectorizer(binary=True)
X_train_oh = count_vect.fit_transform(train['text'])
print(f"The size of the train dataset is {X_train_oh.shape}")

The size of the train dataset is (17621, 24069)


By default, we are not limiting the vocabulary of the model and the length of the vector for every sentence will be $24$ $069$ - number of words in our vocab.
Although, it can be done by setting parameter `max_features` to, for example, $10000$. By doing this, vocabulary will be built considering only the top `max_features` ordered by term frequency across the corpus.

In [19]:
X_train_oh

<17621x24069 sparse matrix of type '<class 'numpy.int64'>'
	with 386417 stored elements in Compressed Sparse Row format>

It is also worth to mention, that due to the sparsity of representation (most values in word vectors will be zeros) we can save a lot of memory by only storing the non-zero parts of the feature vectors in memory. `scipy.sparse` matrices are data structures that do exactly this and they are used in `sklearn` package.

In [20]:
17621*24069

424119849

In our case only $386$ $417$ of elements out of $424$ $119$ $849$ are non-zero.

In [21]:
count_vect_lim_vocab = CountVectorizer(binary=True, max_features=10_000)
X_train_oh = count_vect_lim_vocab.fit_transform(train['text'])
X_train_oh.shape

(17621, 10000)

Now, the length of the vector is $10000$.

Let's look at the vector for the first text in our train corpus.

In [23]:
first_sentence = val['text'][6148]

In [24]:
one_hot_vector = count_vect.transform([first_sentence])
one_hot_vector

<1x24069 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [25]:
one_hot_vector = one_hot_vector.toarray()
one_hot_vector

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

It is a sparse vector with $24$ $069$ elements with only $16$ elements that are not equal to zero. 
Let's find out which are they.

In [26]:
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
indices

array([  614,   813,  1548,  2128,  5189,  9205, 10643, 12834, 14023,
       14557, 20543, 21197, 21300, 21556, 22713, 23613], dtype=int64)

These are the indices of words which are present in our sentence.  
Now we are going to create index-to-word dictionary to check the result of the work of `CountVectorizer`

In [27]:
index_to_word = {index : word for word, index in count_vect.vocabulary_.items()}
dict(take(10, index_to_word.items()) )

{20160: 'still',
 14787: 'others',
 10869: 'including',
 11721: 'joe',
 10157: 'himself',
 9894: 'have',
 21222: 'theories',
 21548: 'too',
 23652: 'wild',
 813: 'and'}

In [28]:
[(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices]

[(614, 'all', 1),
 (813, 'and', 1),
 (1548, 'available', 1),
 (2128, 'bewildered', 1),
 (5189, 'dazzled', 1),
 (9205, 'gigantic', 1),
 (10643, 'immediately', 1),
 (12834, 'magnitude', 1),
 (14023, 'nature', 1),
 (14557, 'of', 1),
 (20543, 'sum', 1),
 (21197, 'the', 1),
 (21300, 'thought', 1),
 (21556, 'topic', 1),
 (22713, 'upon', 1),
 (23613, 'who', 1)]

Indeed, these are the indices and corresponding words from our sentence

Because we've created the `CountVectorizer` with `binary=True`. The elements are really ones and zeros.  

##### 3.1.2 One-hot vectors with counts

A little improvement over that will be using vectorizer with `binary=False`, because this way we will take counts into account.

In [29]:
count_vect = CountVectorizer(binary=False)
X_train_cv = count_vect.fit_transform(train['text'])

one_hot_vector = count_vect.transform([first_sentence]).toarray()
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
index_to_word = {index : word for word, index in count_vect.vocabulary_.items()}

[(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices]


[(614, 'all', 1),
 (813, 'and', 2),
 (1548, 'available', 1),
 (2128, 'bewildered', 1),
 (5189, 'dazzled', 1),
 (9205, 'gigantic', 1),
 (10643, 'immediately', 1),
 (12834, 'magnitude', 1),
 (14023, 'nature', 1),
 (14557, 'of', 1),
 (20543, 'sum', 1),
 (21197, 'the', 4),
 (21300, 'thought', 1),
 (21556, 'topic', 1),
 (22713, 'upon', 1),
 (23613, 'who', 1)]

We can see that now the value for 'the' is 4 and for 'two' it is 2. It doesn't only show that these words are present in the sentence, but also indicate how many times they occur in the sentence.

In [32]:
tokenizer = count_vect.build_tokenizer()
tokenized_sentence = tokenizer(first_sentence.lower())
tokenized_sentence.count('the')

4

##### 3.1.3 N-grams

We can take into account not only words, but collocations using parameter `ngram_range` to preserve some local ordering, because by using unigrams we don't capture even that.  

In [33]:
bigram_count_vect = CountVectorizer(binary=False, ngram_range=(1,2))
X_train_bigram = bigram_count_vect.fit_transform(train['text'])
print(f"The size of the train dataset is {X_train_bigram.shape}")
print(X_train_bigram.count_nonzero)
one_hot_vector = bigram_count_vect.transform([first_sentence]).toarray()
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
index_to_word = {index : word for word, index in bigram_count_vect.vocabulary_.items()}

print([(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices])

The size of the train dataset is (17621, 230440)
<bound method _data_matrix.count_nonzero of <17621x230440 sparse matrix of type '<class 'numpy.int64'>'
	with 813907 stored elements in Compressed Sparse Row format>>
[(4324, 'all', 1), (4855, 'all who', 1), (7667, 'and', 2), (8116, 'and bewildered', 1), (12202, 'and the', 1), (19515, 'available', 1), (25467, 'bewildered', 1), (45745, 'dazzled', 1), (75382, 'gigantic', 1), (93018, 'immediately', 1), (112540, 'magnitude', 1), (112541, 'magnitude and', 1), (124876, 'nature', 1), (124942, 'nature of', 1), (131308, 'of', 1), (134836, 'of the', 1), (181040, 'sum', 1), (186670, 'the', 4), (189413, 'the gigantic', 1), (192991, 'the sum', 1), (193291, 'the topic', 1), (199054, 'thought', 1), (199173, 'thought upon', 1), (203935, 'topic', 1), (209892, 'upon', 1), (210124, 'upon the', 1), (222194, 'who', 1), (222471, 'who thought', 1)]


By including bigrams into calculations the size of the vector increased from $24$ $069$ to $230$ $440$, but we still can limit it using `max_features` parameter, so it is not a big deal.

##### 3.1.4 Stopwords

Removing or not removing stopwords is a controversial topic...

##### 3.1.5 Advantages and disadvantages

$"+"$:
1. Fast to train (actually, there is no training done - just counting)

$"-"$:
1. Vector dimensionality is equal to vocabulary size (`max_features`) 
2. Longer documents will have higher average count values than shorter documents, even though they might talk about the same topics
3. **One-hot vectors don't capture meaning**

#### 3.2 TF-IDF

To address second issue of bag of words approach and in attempt of trying to deal with common words (stopwords) without deleting them the next approach can be used.  

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency   

$$ \large tf-idf(t,d) = tf(i,d) * idf(t)$$  
Term frequency is $$ \large tf(i,d) = \frac{wordCount(t,d)}{length(d)}$$ 
where 
* `wordCount(t,d)` is the number of occurrences of term *t* in the document *d*
* `length(d)` is the number of words in the document *d*  

Inverse document-frequency is $$ \large idf(t, c) = \frac{size(c)}{docCount(t, c)}$$ 
where 
* `size(c)` is the number of documents in the corpora *c*
* `docCount(t, c)` is the number of documents in corpora *c* containing the term *t*  

In real life tf-idf is computed differently:
$$ \large tf(i,d) = \log{ \left(1 + \frac{wordCount(t,d)}{length(d)}\right)}$$ 
$$ \large idf(t, c) = \log{\left(1 + \frac{1 + size(c)}{1 + docCount(t, c)}\right)}$$ 

The reason to use `log` here is this. 

In [225]:
tfidf_vect = TfidfVectorizer()
X_train_tf_idf = tfidf_vect.fit_transform(train['text'])
print(f"The size of the train dataset is {X_train_tf_idf.shape}")

The size of the train dataset is (17621, 24069)


In [226]:
X_train_tf_idf

<17621x24069 sparse matrix of type '<class 'numpy.float64'>'
	with 386417 stored elements in Compressed Sparse Row format>

In [227]:
tfidf_vect.transform([first_sentence])

<1x24069 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [228]:
index_to_word = {index : word for word, index in tfidf_vect.vocabulary_.items()}
dict(take(10, index_to_word.items()) )

{20160: 'still',
 14787: 'others',
 10869: 'including',
 11721: 'joe',
 10157: 'himself',
 9894: 'have',
 21222: 'theories',
 21548: 'too',
 23652: 'wild',
 813: 'and'}

In [229]:
tfidf_vector = tfidf_vect.transform([first_sentence])
tfidf_vector

<1x24069 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [230]:
tfidf_vector = tfidf_vector.toarray()
tfidf_vector

array([[0., 0., 0., ..., 0., 0., 0.]])

In [231]:
indices = np.where(np.any(tfidf_vector!=0, axis=0))[0]
indices

array([  614,   813,  1548,  2128,  5189,  9205, 10643, 12834, 14023,
       14557, 20543, 21197, 21300, 21556, 22713, 23613], dtype=int64)

In [232]:
[(ind, index_to_word[ind], tfidf_vector[0, ind]) for ind in indices]

[(614, 'all', 0.13031453897504672),
 (813, 'and', 0.12214845066990851),
 (1548, 'available', 0.3924242175008267),
 (2128, 'bewildered', 0.3114996887796031),
 (5189, 'dazzled', 0.36544937459375215),
 (9205, 'gigantic', 0.2695648987205094),
 (10643, 'immediately', 0.23647294743920957),
 (12834, 'magnitude', 0.32608143594739347),
 (14023, 'nature', 0.2049931029681656),
 (14557, 'of', 0.059811003957834306),
 (20543, 'sum', 0.3009169879304377),
 (21197, 'the', 0.20393197647953104),
 (21300, 'thought', 0.18711801923161203),
 (21556, 'topic', 0.2973766872628445),
 (22713, 'upon', 0.14506693111726507),
 (23613, 'who', 0.16239686486865726)]

##### 3.2.1 TF-IDF implementation

In [233]:
tfidf_tokenizer = tfidf_vect.build_tokenizer()
first_sentence_tokenized = tfidf_tokenizer(first_sentence.lower())
print(first_sentence_tokenized)
first_sentence_tokenized_set = list(set(first_sentence_tokenized))
print(first_sentence_tokenized_set)

['the', 'gigantic', 'magnitude', 'and', 'the', 'immediately', 'available', 'nature', 'of', 'the', 'sum', 'dazzled', 'and', 'bewildered', 'all', 'who', 'thought', 'upon', 'the', 'topic']
['of', 'dazzled', 'nature', 'thought', 'magnitude', 'gigantic', 'topic', 'bewildered', 'upon', 'all', 'the', 'who', 'and', 'available', 'sum', 'immediately']


In [239]:
def tf(tokenized_text, term):
    return tokenized_text.count(term) / len(tokenized_text)

def idf(texts, term, tokenizer):
    size = len(texts)
    doc_count = sum([1 if term in tokenizer(text.lower()) else 0 for text in texts])
    
    return np.log((1 + size) / (1 + doc_count)) + 1

def tf_idf(tf, idf):
    return tf * idf

In [240]:
tf_idf_first_sentece = []
for word in first_sentence_tokenized_set:
    tf_ = tf(first_sentence_tokenized, word)
    idf_ = idf(train['text'], word, tfidf_tokenizer)
    tf_idf_first_sentece.append(tf_idf(tf_, idf_))

In [241]:
tf_idf_first_sentece = np.array(tf_idf_first_sentece)
l2_norm = np.sqrt(np.sum(tf_idf_first_sentece**2))
tf_idf_first_sentece = tf_idf_first_sentece / l2_norm
tf_idf_first_sentence_dict = dict(zip(first_sentence_tokenized_set, tf_idf_first_sentece))

In [242]:
[(ind, index_to_word[ind], tf_idf_first_sentence_dict[index_to_word[ind]]) for ind in indices]

[(614, 'all', 0.13031453897504672),
 (813, 'and', 0.12214845066990852),
 (1548, 'available', 0.3924242175008267),
 (2128, 'bewildered', 0.3114996887796031),
 (5189, 'dazzled', 0.36544937459375215),
 (9205, 'gigantic', 0.2695648987205094),
 (10643, 'immediately', 0.23647294743920957),
 (12834, 'magnitude', 0.32608143594739347),
 (14023, 'nature', 0.2049931029681656),
 (14557, 'of', 0.05981100395783431),
 (20543, 'sum', 0.3009169879304377),
 (21197, 'the', 0.20393197647953104),
 (21300, 'thought', 0.18711801923161206),
 (21556, 'topic', 0.2973766872628445),
 (22713, 'upon', 0.1450669311172651),
 (23613, 'who', 0.16239686486865726)]

### TODO: 
* Finish item 3.1.4 about using stopwords 
* Add the reason to use logarithms for tf-idf (sublinear tf, etc.)
* Write down questions
    * bag of words
    * tf-idf
* refactor tf-idf implementation (make class for it, move it to its own file)

In [None]:
np.log2(1_000_000)

In [None]:
np.log2(2_000_000)