### 0. Questions

### 1. Import packages

First, let's import needed modules and, random seed (we'll use it if needed) and create some auxiliary functions.

In [1]:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

### 2. Data preparation

I'll be using dataset from [Spooky Author Identification](https://www.kaggle.com/c/spooky-author-identification/overview) competition

#### 2.1 Loading the data

In [3]:
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

#### 2.2 Data Fields

* id - a unique identifier for each sentence
* text - some text written by one of the authors
* author - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley)

Let's look at the data

In [4]:
train_df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


For now, I'm going to look only at column text to look at the ways text representation can be done

#### 2.3 Data Splitting

But nevertheless let's split the data into training and validation sets.  
As soon as we have almost $20 000$ rows in `train_df` test size will be limited to $10\%$

In [5]:
train, val = train_test_split(train_df, test_size=0.1)

### 3. Text embeddings

#### 3.1 One-hot vectors 

The simplest way of word representation is **one-hot vectors**. For the i-th word in the vocabulary, the vector has 1 on the i-th dimension and 0 on the rest.   
Let's do this using sklearn's `CountVectorizer` with `binary=True`

In [6]:
count_vect = CountVectorizer(binary=True)
X_train_cv = count_vect.fit_transform(train['text'])
X_train_cv.shape

(17621, 24113)

By default, we are not limiting the vocabulary of the model and the length of the vector for every sentence will be $24066$ - number of words in our vocab.
Although, it can be done by setting parameter `max_features` to, for example, $10000$. By doing this, vocabulary will be built considering only the top `max_features` ordered by term frequency across the corpus.

In [7]:
count_vect_lim_vocab = CountVectorizer(binary=True, max_features=10_000)
X_train_cv = count_vect_lim_vocab.fit_transform(train['text'])
X_train_cv.shape

(17621, 10000)

Now, the length of the vector is $10000$.

Let's look at the vector for the first text in our train corpus.

In [8]:
first_sentence = train['text'][0]

In [9]:
count_vect.transform([first_sentence])

<1x24113 sparse matrix of type '<class 'numpy.int64'>'
	with 34 stored elements in Compressed Sparse Row format>

In [10]:
one_hot_vector = count_vect.transform([first_sentence]).toarray()
one_hot_vector

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

It is a sparse vector with $24066$ elements with only $34$ elements that are not equal to zero. 
Let's find out which are they.

In [11]:
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
indices

array([  441,   808,  1242,  1255,  1597,  1989,  3581,  5892,  6613,
        7893, 10358, 11622, 12864, 13138, 13155, 13379, 13925, 14240,
       14557, 14809, 15448, 15993, 16469, 17857, 18782, 18923, 19573,
       21228, 21314, 21529, 22410, 23320, 23565, 23793], dtype=int64)

These are the indices of words which are present in our sentence.  
Now we are going to create index-to-word dictionary to check the result of the work of `CountVectorizer`

In [12]:
index_to_word = {index : word for word, index in count_vect.vocabulary_.items()}
dict(take(10, index_to_word.items()) )

{20786: 'swear',
 9912: 'he',
 4843: 'cried',
 2845: 'by',
 21228: 'the',
 20584: 'sun',
 808: 'and',
 2323: 'blue',
 19389: 'sky',
 14557: 'of'}

In [13]:
[(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices]

[(441, 'afforded', 1),
 (808, 'and', 1),
 (1242, 'as', 1),
 (1255, 'ascertaining', 1),
 (1597, 'aware', 1),
 (1989, 'being', 1),
 (3581, 'circuit', 1),
 (5892, 'dimensions', 1),
 (6613, 'dungeon', 1),
 (7893, 'fact', 1),
 (10358, 'however', 1),
 (11622, 'its', 1),
 (12864, 'make', 1),
 (13138, 'me', 1),
 (13155, 'means', 1),
 (13379, 'might', 1),
 (13925, 'my', 1),
 (14240, 'no', 1),
 (14557, 'of', 1),
 (14809, 'out', 1),
 (15448, 'perfectly', 1),
 (15993, 'point', 1),
 (16469, 'process', 1),
 (17857, 'return', 1),
 (18782, 'seemed', 1),
 (18923, 'set', 1),
 (19573, 'so', 1),
 (21228, 'the', 1),
 (21314, 'this', 1),
 (21529, 'to', 1),
 (22410, 'uniform', 1),
 (23320, 'wall', 1),
 (23565, 'whence', 1),
 (23793, 'without', 1)]

Indeed, these are the indices and corresponding words from our sentence

Because we've created the `CountVectorizer` with `binary=True`. The elements are really ones and zeros.  
A little improvement over that will be using vectorizer with `binary=False`, because this way we will take counts into account.

In [14]:
count_vect = CountVectorizer(binary=False)
X_train_cv = count_vect.fit_transform(train['text'])

one_hot_vector = count_vect.transform([first_sentence]).toarray()
indices = np.where(np.any(one_hot_vector!=0, axis=0))[0]
index_to_word = {index : word for word, index in count_vect.vocabulary_.items()}

[(ind, index_to_word[ind], one_hot_vector[0, ind]) for ind in indices]


[(441, 'afforded', 1),
 (808, 'and', 1),
 (1242, 'as', 1),
 (1255, 'ascertaining', 1),
 (1597, 'aware', 1),
 (1989, 'being', 1),
 (3581, 'circuit', 1),
 (5892, 'dimensions', 1),
 (6613, 'dungeon', 1),
 (7893, 'fact', 1),
 (10358, 'however', 1),
 (11622, 'its', 1),
 (12864, 'make', 1),
 (13138, 'me', 1),
 (13155, 'means', 1),
 (13379, 'might', 1),
 (13925, 'my', 1),
 (14240, 'no', 1),
 (14557, 'of', 3),
 (14809, 'out', 1),
 (15448, 'perfectly', 1),
 (15993, 'point', 1),
 (16469, 'process', 1),
 (17857, 'return', 1),
 (18782, 'seemed', 1),
 (18923, 'set', 1),
 (19573, 'so', 1),
 (21228, 'the', 4),
 (21314, 'this', 1),
 (21529, 'to', 1),
 (22410, 'uniform', 1),
 (23320, 'wall', 1),
 (23565, 'whence', 1),
 (23793, 'without', 1)]

We can see that now the value for 'the' is 4. It doesn't only show that this article is present in the sentence, but also indicate how many times it occurs in the sentence.

In [15]:
tokenizer = count_vect.build_tokenizer()
tokenized_sentence = tokenizer(first_sentence.lower())
tokenized_sentence.count('the')

4

##### 3.1.1 Advantages

##### 3.1.2 Disadvantages
