### Bag Of Words Model
Bag of words model is very very simple. We will represent a tweet via a feature vector whose length is equal to the number of words in an english dictionary. Specifically, if a tweet contains the $i^\textrm{th}$ word of the dictionary, then we will set $x_i=1$; otherwise we let $x_i=0$. For instance, the vector

$$x = \begin{pmatrix}1\\0\\0\\ \vdots \\1 \\ \vdots \\0
\end{pmatrix} \quad \begin{pmatrix}\textrm{a}\\ \textrm{aardvark} \\ \textrm{aardwolf}\\ \vdots \\ \textrm{buy} \\ \vdots \\ \textrm{zygmurgy}
\end{pmatrix} $$

is used to represent a tweet that contains the words “a” and “buy,” but not “aardvark,” “aardwolf” or “zygmurgy.”

As a note, rather than using an english dictionary, it is typical to look through the training set and only use the words that occur in the training set. This is also good, because not all words in a tweet will be in a dictionary, in particular slang words. This method will also help in reduceing the dimensionality of the dataset, which will help reduce computational and space requirements.

The set of words used in the feature vector is known as the vocabulary, and the dimension of the dataset is equal to the size of the vocabulary.

### Beyond Bag of Words - Word2Vec
One issue that the bag of word model has is that there is no notion of distance. Meaning, if we consider a bag of word model with 3 words `King, Queen, Man`. Then bag of word model will encode
$$\textrm{King}=(1, 0, 0) \quad \textrm{Queen}=(0, 1, 0) \quad \textrm{Man}=(0, 0, 1)$$

One problem with such a simple system is that there is no meaning encoded in these vectors. In particular, we know that

$$\textrm{King} - \textrm{Man} = \textrm{Queen}$$

Meaning that if you removed the man from the `King` you should get a `Queen`. But in the bag of words model you have

$$\textrm{King} - \textrm{Man} = (1, 0, 0) - (0, 1, 0) = (1, -1, 0) \neq \textrm{Queen}$$

More advanced models, like Word2Vec, begin to incorporate such ideas of relationships that exists between words. Another example of relationships that exist between words may be
$$\textrm{Concrete} + \textrm{Bricks} = \textrm{House}$$

Other relationships may also be contextual/definition relationships. For example we expect the word `red` to be closer to `green` than it is to `King`. That is, we want

$$|\textrm{Red} - \textrm{Green}| < |\textrm{Red} - \textrm{King}|$$
Meaning, the distance for red to green is smaller than the distance from red to king. This is what is given by Word2Vec

But using Bag of words, we have that
$$|\textrm{Red} - \textrm{Green}| = |\textrm{Red} - \textrm{King}|$$



I'll note, however, Word2Vec is quite advanced. But it is the state of the art when you want to convert words into vectors (or numbers). In fact, the latest and greatest language model GPT-3, uses Word2Vec.

<hr style="height:2px;border-width:0;color:black;background-color:black">

### What is Token(isation)

Tokenisation is the process of breaking up a string, in this case a tweet, into indiviual words. So if we had the sentence 

`There is an old Polish proverb that says that when your socks are not in your shoes don't look for them in heaven`

Tokenisation will then split up the sentence into the idividual words, and will remove any punctuation (e.g. `!, ?, ., ;`) and capitals:

`['there', 'is', 'an', 'old', 'polish', ..., 'for', 'them', 'in', 'heaven']`

Tokenisation, however, actually goes further in this. In particular, there are useless words that have no use in analysis, known as stop words. Words like `is`, `and`, etc. get removed in the tokenisation process. So really you get

`['there', 'old', 'polish', 'proverb', 'says', 'your', 'socks', 'shoes', 'dont', 'look', 'them', 'heaven']`

Actually, tokenisation goes further. If we consider the words `socks` and `sock` then in the Bag of Word model these two would be considered as two different words:
$$ 
\begin{pmatrix}\vdots \\ \textrm{sock} \\ \textrm{socks} \\ \vdots \end{pmatrix}
$$
But we know that there is no difference between the words `socks` and `sock`. So what we do now is to replace similar words (like `socks` and `sock`; or `shoes` and `shoe`) as the same words. THat is

$$ 
\begin{pmatrix}\vdots \\ \textrm{sock} \\ \textrm{socks}\\ \vdots \\ \textrm{camp} \\ \textrm{camps} \\ \textrm{camping} \\ \textrm{camped} \\ \vdots  \end{pmatrix} \implies \begin{pmatrix}\vdots \\ \textrm{sock} \\ \vdots \\ \textrm{camp} \\ \vdots \end{pmatrix}
$$

In total, tokenisation will then convert the sentence

`there is an old polish proverb that says that when your socks are not in your shoes don't look for them in heaven`

to

`['there', 'old', 'polish', 'proverb', 'say', 'your', 'sock', 'shoe', 'dont', 'look', 'them', 'heaven']`