# Intro to NLP

## What is NLP?

> Natural language processing, or NLP, is the use of human languages by a computer

With language being the primary mode of human communication, there are an abundance of use-cases for NLP.

Examples include:
- Sentiment analysis
- Question answering
- Translation
- Generating text

_Note that speech data is not language, it is numerical audio waveforms that represent language data. Audio can be turned into language (text data by transcription) or processed directly by an end-to-end model, but that's out of scope here_

## What makes text processing different and hard?

- Language has lots of nuance
    - Words can have different meanings in different contexts
- Language data is not naturally numerical, and so can't immediately be processed mathematically

## Essential Tools and Concepts within NLP

### What is a corpus?

A corpus is a body of text that represents your data. One classic example would be the [Gutenberg corpus](https://zenodo.org/record/2422561#.Y8NpV-zP06E), which contains the text of over 50000 books.

### What is a token?

A token is an atomic unit of text. In most cases, you can think of tokens as individual words, but in many cases tokens may be something like a common part of a word, like a suffix, in other cases a token might be an individual character.

Example tokens:
- the word "probable"
- the character "h"
    - You won't typically see this, as most NLP is done on the word level
- the sequence of tokens "ing", which is a common word suffix
    - You will commonly see this represented as a token "##ing", where the "##" indicates that this token is preceded by other characters
    - Similarly, you can have tokens which appear at the start of words, like "pre##", or that appear in the middle of words like "##ab##"
    - The tokens you end up with, depend on how you turn your raw text into these tokens, through the process known as _tokenisation_

Going forward, you can think about tokens as individual words, which is what they are in most cases. Note that you will probably see the words "token" and "word" used interchangably.

### What is a tokeniser?

A tokeniser is a function that takes in raw text and turns it into a sequence of tokens.
A tokeniser performs tokenisation on raw text to produce tokens.

![](./images/Group4093.png)

### What is a vocab?

A vocab is an assignment of an integer index to each token. 
If you imagine a list of tokens, the index of each token is the position of that token in the list.

![](./images/Vocab.png)

## How do we represent words?

### Word indexes

As indicated by the vocab, we can represent each token mathematically by assigning it an integer index.

An alternative way to represent that index is using a vector that is as long as the number of tokens in your corpus, which contains zeros everywhere except in the position of the index corresponding to the word.

![](./images/One-hot%20Vector.png)

We call this vector a _1-hot vector_, or a _one-hot encoding_ of the token.

### Why does the 1-hot representation make more sense mathematically than the index?

The one-hot vector makes more sense mathematically than the index, because the index indicates that the words are somehow on the same number line, which they are not. The token with index=2 is not necessarily between tokens 1 & 3. The token represented by the index 100 is not bigger than token 1.

### The problem with 1-hot encodings

#### Similar words do not have similar representations

- Similar words do not have similar representations
- in fact, all vectors are orthogonal

#### The length of your 1-hot encodings increases with every new token
- 1-hot encodings contain an element for every possible word, so for larger corpora, with more tokens, they are longer

![](./images/One-hot%20Word%20Embeddings.png)

> Overall, we want to avoid using 1-hot encodings to represent our words and try something else... word embeddings

### Word embeddings

Word embeddings are vector representations of tokens that contain a meaningful representation of what the word means. 

![](./images/Dense%20Word%20Embeddings.png)

Where 1-hot encodings are "sparse", containing mostly zeros, word embeddings are "dense".

![](./images/One-hot%20vs%20Dense%20Word%20Embeddings.png)

## Word embeddings can be learnt in a number of ways

### Directly maximise the vector similarity between words that appear in similar context - The Word2Vec algorithm

Word2vec was the original algorithm used to create meaningful vector representations of words.

It is based on the assumption that similar words appear in similar contexts. 

The famous quote from 1957 that highlights this assumption was: "you shall know a word by the company it keeps".

At a high level, this is how it works:
- Initialise random embeddings
- Take pairs of words that appear close together (within a threshold distance)
- Calculare their cosine similarity
- Maximise this objective using gradient descent

### They can be learnt for a specific problem

Alternatively, word representations can be learnt from scratch by solving a specific downstream task, such as sentiment analysis.

In this setup, the embeddings are simply a part of the model parameters, like the other weights and biases. The input to the model is the integer indexes of the tokens, the first layer of the model is an embedding layer which indexes out the row to use as a word embedding, and the output is whatever is required for the task, such as a classification for sentiment analysis.

Learning word representations in this way can produce problem-specific representations. For example, words might have different representations in a translation task compared to those in a sentiment analysis task.

Note that however you learn embeddings, what they represent will be determined by the data they are learnt from.

The most common of these is BERT, which learns representations of words based on a domain agnostic language modelling problem.

## Pre-trained word embeddings

Learning meaningful word representations can take a lot of time and compute. 
Thankfully, we can take the embeddings learnt by others straight off the shelf.

One of the most influential machine learning models 

It's not important to understand BERT at this point, but for now:
- BERT stands for Bidirectional Encoding Representations using Transformers
- It is trained to fill in the missing word in text
- It contains the word embeddings within its first layer's parameters. These BERT embeddings are widely used as a good starting point for word embeddings.

We will talk about BERT more later, but we can already start using it.

Open the practical [notebbook](https://colab.research.google.com/github/AI-Core/Practicals-Public-dev/blob/main/Content/units/Deep%20Learning%20for%20NLP/0.%20Intro%20to%20AI%20for%20Text%20Data/0.%20Intro%20to%20NLP/Practicals/0.%20Visualising%20BERT%20Embeddings/Practical.ipynb) associated with this lesson to explore embeddings in BERT.