### 0. Questions

 - What is embedding?
 - 

### 1. Import packages

First, let's import needed modules and, random seed (we'll use it if needed) and create some auxiliary functions.

In [1]:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

### 2. Data preparation

I'll be using dataset from [Spooky Author Identification](https://www.kaggle.com/c/spooky-author-identification/overview) competition

#### 2.1 Loading the data

In [3]:
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

#### 2.2 Data Fields

* id - a unique identifier for each sentence
* text - some text written by one of the authors
* author - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley)

Let's look at the data

In [4]:
train_df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


For now, I'm going to look only at column text to look at the ways text representation can be done

#### 2.3 Data Splitting

But nevertheless let's split the data into training and validation sets.  
As soon as we have almost $20 000$ rows in `train_df` test size will be limited to $10\%$

In [5]:
train, val = train_test_split(train_df, test_size=0.1)

### 3. Text embeddings

#### 3.1 Bag of words

##### 3.1.1 One-hot vectors 

The simplest way of word representation is **one-hot vectors**. For the i-th word in the vocabulary, the vector has 1 on the i-th dimension and 0 on the rest.   
Let's do this using sklearn's `CountVectorizer` with `binary=True`

In [20]:
count_vect = CountVectorizer(binary=True)
X_train_oh = count_vect.fit_transform(train['text'])
print(f"The size of the train dataset is {X_train_cv.shape}")

The size of the train dataset is (17621, 24036)


By default, we are not limiting the vocabulary of the model and the length of the vector for every sentence will be $24066$ - number of words in our vocab.
Although, it can be done by setting parameter `max_features` to, for example, $10000$. By doing this, vocabulary will be built considering only the top `max_features` ordered by term frequency across the corpus.

In [16]:
X_train_oh

<17621x24036 sparse matrix of type '<class 'numpy.int64'>'
	with 386331 stored elements in Compressed Sparse Row format>

It is also worth to mention, that due to the sparsity of representation (most values in word vectors will be zeros) we can save a lot of memory by only storing the non-zero parts of the feature vectors in memory. `scipy.sparse` matrices are data structures that do exactly this and they are used in `sklearn` package.

In [17]:
17621*24036

423538356

In our case only $386$ $331$ of elements out of $423$ $538$ $356$ are non-zero.

In [7]:
count_vect_lim_vocab = CountVectorizer(binary=True, max_features=10_000)
X_train_oh = count_vect_lim_vocab.fit_transform(train['text'])
X_train_oh.shape

(17621, 10000)

Now, the length of the vector is $10000$.

Let's look at the vector for the first text in our train corpus.

In [8]:
first_sentence = train['text'][0]

In [9]:
count_vect.transform([first_sentence])

<1x24036 sparse matrix of type '<class 'numpy.int64'>'
	with 34 stored elements in Compressed Sparse Row format>

In [10]:
one_hot_vector = count_vect.transform([first_sentence]).toarray()
one_hot_vector

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

It is a sparse vector with $24066$ elements with only $34$ elements that are not equal to zero. 
Let's find out which are they.

In [11]:
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
indices

array([  436,   809,  1247,  1260,  1596,  1979,  3579,  5902,  6629,
        7902, 10360, 11618, 12856, 13128, 13145, 13371, 13907, 14215,
       14529, 14779, 15414, 15950, 16439, 17826, 18751, 18890, 19538,
       21188, 21276, 21488, 22367, 23257, 23499, 23722], dtype=int64)

These are the indices of words which are present in our sentence.  
Now we are going to create index-to-word dictionary to check the result of the work of `CountVectorizer`

In [12]:
index_to_word = {index : word for word, index in count_vect.vocabulary_.items()}
dict(take(10, index_to_word.items()) )

{14301: 'nothing',
 4678: 'could',
 7595: 'exceed',
 21188: 'the',
 12650: 'love',
 809: 'and',
 17728: 'respect',
 23517: 'which',
 23964: 'younger',
 4668: 'cottagers'}

In [13]:
[(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices]

[(436, 'afforded', 1),
 (809, 'and', 1),
 (1247, 'as', 1),
 (1260, 'ascertaining', 1),
 (1596, 'aware', 1),
 (1979, 'being', 1),
 (3579, 'circuit', 1),
 (5902, 'dimensions', 1),
 (6629, 'dungeon', 1),
 (7902, 'fact', 1),
 (10360, 'however', 1),
 (11618, 'its', 1),
 (12856, 'make', 1),
 (13128, 'me', 1),
 (13145, 'means', 1),
 (13371, 'might', 1),
 (13907, 'my', 1),
 (14215, 'no', 1),
 (14529, 'of', 1),
 (14779, 'out', 1),
 (15414, 'perfectly', 1),
 (15950, 'point', 1),
 (16439, 'process', 1),
 (17826, 'return', 1),
 (18751, 'seemed', 1),
 (18890, 'set', 1),
 (19538, 'so', 1),
 (21188, 'the', 1),
 (21276, 'this', 1),
 (21488, 'to', 1),
 (22367, 'uniform', 1),
 (23257, 'wall', 1),
 (23499, 'whence', 1),
 (23722, 'without', 1)]

Indeed, these are the indices and corresponding words from our sentence

Because we've created the `CountVectorizer` with `binary=True`. The elements are really ones and zeros.  

##### 3.1.2 One-hot vectors with counts

A little improvement over that will be using vectorizer with `binary=False`, because this way we will take counts into account.

In [14]:
count_vect = CountVectorizer(binary=False)
X_train_cv = count_vect.fit_transform(train['text'])

one_hot_vector = count_vect.transform([first_sentence]).toarray()
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
index_to_word = {index : word for word, index in count_vect.vocabulary_.items()}

[(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices]


[(436, 'afforded', 1),
 (809, 'and', 1),
 (1247, 'as', 1),
 (1260, 'ascertaining', 1),
 (1596, 'aware', 1),
 (1979, 'being', 1),
 (3579, 'circuit', 1),
 (5902, 'dimensions', 1),
 (6629, 'dungeon', 1),
 (7902, 'fact', 1),
 (10360, 'however', 1),
 (11618, 'its', 1),
 (12856, 'make', 1),
 (13128, 'me', 1),
 (13145, 'means', 1),
 (13371, 'might', 1),
 (13907, 'my', 1),
 (14215, 'no', 1),
 (14529, 'of', 3),
 (14779, 'out', 1),
 (15414, 'perfectly', 1),
 (15950, 'point', 1),
 (16439, 'process', 1),
 (17826, 'return', 1),
 (18751, 'seemed', 1),
 (18890, 'set', 1),
 (19538, 'so', 1),
 (21188, 'the', 4),
 (21276, 'this', 1),
 (21488, 'to', 1),
 (22367, 'uniform', 1),
 (23257, 'wall', 1),
 (23499, 'whence', 1),
 (23722, 'without', 1)]

We can see that now the value for 'the' is 4. It doesn't only show that this article is present in the sentence, but also indicate how many times it occurs in the sentence.

In [15]:
tokenizer = count_vect.build_tokenizer()
tokenized_sentence = tokenizer(first_sentence.lower())
tokenized_sentence.count('the')

4

##### 3.1.3 N-grams

We can take into account not only words, but collocations using parameter `ngram_range` to preserve some local ordering, because by using unigrams we don't capture even that.  

In [26]:
bigram_count_vect = CountVectorizer(binary=False, ngram_range=(1,2))
X_train_bigram = bigram_count_vect.fit_transform(train['text'])
print(f"The size of the train dataset is {X_train_cv.shape}")
print(X_train_bigram.count_nonzero)
one_hot_vector = bigram_count_vect.transform([first_sentence]).toarray()
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
index_to_word = {index : word for word, index in count_vect.vocabulary_.items()}

print([(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices])


The size of the train dataset is (17621, 230401)
<bound method _data_matrix.count_nonzero of <17621x230401 sparse matrix of type '<class 'numpy.int64'>'
	with 813688 stored elements in Compressed Sparse Row format>>
[(2705, 'afforded', 1), (2713, 'afforded me', 1), (7701, 'and', 1), (11504, 'and return', 1), (16642, 'as', 1), (17111, 'as might', 1), (17638, 'ascertaining', 1), (17640, 'ascertaining the', 1), (19859, 'aware', 1), (19863, 'aware of', 1), (24230, 'being', 1), (24255, 'being aware', 1), (36131, 'circuit', 1), (36132, 'circuit and', 1), (50328, 'dimensions', 1), (50336, 'dimensions of', 1), (54676, 'dungeon', 1), (54677, 'dungeon as', 1), (63613, 'fact', 1), (63651, 'fact so', 1), (91067, 'however', 1), (91069, 'however afforded', 1), (101503, 'its', 1), (101621, 'its circuit', 1), (112791, 'make', 1), (112843, 'make its', 1), (115046, 'me', 1), (115371, 'me no', 1), (115739, 'means', 1), (115779, 'means of', 1), (117354, 'might', 1), (117461, 'might make', 1), (122535, 'my

By including bigrams into calculations the size of the vector increased from $24$ $066$ to $230$ $401$, but we still can limit it using `max_features` parameter, so it is not a big deal.

##### 3.1.4 Stopwords

Removing or not removing stopwords is a controversial topic...

##### 3.1.5 Advantages and disadvantages

$"+"$:
1. Fast to train (actually, there is no training done - just counting)

$"-"$:
1. Vector dimensionality is equal to vocabulary size (`max_features`) 
2. Longer documents will have higher average count values than shorter documents, even though they might talk about the same topics
3. **One-hot vectors don't capture meaning**

#### 3.2 TF-IDF

To address second issue of bag of words approach and in attempt of trying to deal with common words (stopwords) without deleting them the next approach can be used.  

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency   

$$ \large tf-idf(t,d) = tf(i,d) * idf(t)$$  
Term frequency is $$ \large tf(i,d) = \frac{wordCount(t,d)}{length(d)}$$ 
where 
* `wordCount(t,d)` is the number of occurrences of term *t* in the document *d*
* `length(d)` is the number of words in the document *d*  

Inverse document-frequency is $$ \large idf(t, c) = \frac{size(c)}{docCount(t, c)}$$ 
where 
* `size(c)` is the number of documents in the corpora *c*
* `docCount(t, c)` is the number of documents in corpora *c* containing the term *t*  

In real life tf-idf is computed differently:
$$ \large tf(i,d) = \log{ \left(1 + \frac{wordCount(t,d)}{length(d)}\right)}$$ 
$$ \large idf(t, c) = \log{\left(1 + \frac{1 + size(c)}{1 + docCount(t, c)}\right)}$$ 

The reason to use `log` here is this. 

### TODO: 
* Finish item 3.1.4 about using stopwords 
* Add the reason to use logarithms for tf-idf
* Write down questions
    * bag of words
    * tf-idf

In [4]:
np.log2(1_000_000)

19.931568569324174

In [7]:
np.log2(2_000_000)

20.931568569324174