# Natural Language Processing (NLP)

Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken and written.

Language is considered **unstructured** data. Unstructured data is information that is not arranged according to a preset data model or schema, and therefore cannot be stored in a traditional relational database (think excel files).
Almost most of the data generated and collected is unstructured.

<img src="./img/lab_10_nlp_history.png">


A main challenge in NLP is how to represent text as data that is consumable by the computer understands.

-----


There are two main phases to natural language processing: **data representation** and **algorithm development**.

## Data representation
In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors. There are mulitple approaches this can be done, including:

* **Tokenization**: This is when text is broken down into smaller units to work with.

<img src="./img/lab_10_tokenization.png">

After you decided on your tokenization strategy, you have to preprocess the tokens. Here are a few common preprocessing approaches.

* **Lowercasing**: lowercase all the text data
* **Stop Word Removal**:This is when common words are removed from text so unique words that offer the most information about the text remain.
* **Lemmatization & Stemming**: This is when words are reduced to their root forms to process.
* **Part-of-Speech (POS) Tagging**: This is when words are marked based on the part-of speech they are -- such as nouns, verbs and adjectives.


### Still, how can we turn these tokens to numbers that retain their meaning?

#### **one-hot encoding**:
 on-hot encode each word in the sentence. The steps are as follow:
    * First, create a list with the size of our vocabulary.
    * Assign 1 one for words that exists in the sentence.

|        | chase | dog | person | word | ... |
|--------|-------|-----|--------|------|-----|
| dog    | 0     | 1   | 0      | 0    | 0   |
| chase  | 1     | 0   | 0      | 0    | 0   |
| person | 0     | 0   | 1      | 0    | 0   |

We converted "Dog chase person" to a matrix!

What are the issues with this approach?
- This representation does not convey any relationships between words
- The generated matrix is high-dimensional and sparse


#### **Bag-of-Words**:
BoW is a simple document embedding technique based on word frequency.
* Create a vector whose length is equal to the size of the vocabulary
* Place a value to represent the frequency in which the word appears in the given document

Let's look at a new example `My dog is chasing his dog`. You can create a BoW representation like this:

| chase | cat | dog | his | person | my | word | ... |
|-------|-----|-----|-----|--------|----|------|-----|
| 1     | 0   | 2   | 1   | 0      | 1  | 0    | 0   |

The output vector is `[1, 0, 2, 1, 0, 1, 0, 0, ...]`

* This approach captures `shallow` semantics i.e. If two sentences have similar vocabulary, the two vectors that represent them are close in the vector space and they might have similar meanings.
* The generated matrix is less sparse compared to one-hot encoding.

Still it is sparse, doesn't fully capture the semantics (`My dog is chasing his dog` vs `His dog is chasing my dog`)

#### **Word Embeddings**: 

A technique to represent words in low-dimensional dense vectors while capturing the relationship between the words in the vector space.

There are many approaches to generate word embeddings like `word2vec`, `GloVe`, etc.




## Algorithms
Natural language processing applies algorithms to understand the meaning and structure of sentences. These algorithms include:

* **Word sense disambiguation**. This derives the meaning of a word based on context.
* **Named entity recognition**. This determines words that can be categorized into groups.
* **Natural language generation**. This is used to determine semantics behind words and generate new text.
* **Text classification**. This involves assigning tags to texts to put them in categories. This can be useful for sentiment analysis, which helps the natural language processing algorithm determine the sentiment, or emotion behind a text. 
* **Text extraction**. This involves automatically summarizing text and finding important pieces of data.
* **Machine translation**. This is the process by which a computer translates text from one language, such as English, to another language, such as French, without human intervention.
-----

## What is Sentiment Analysis

# What is BERT?

<img src="./img/lab_12_bert.jpg">

**BERT** stands for Bidirectional Encoder Representations from Transformers. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2,500M words) and BooksCorpus (800M words) and achieved the best accuracies for some of the NLP tasks in 2018. 


There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture.

# What is Hugging Face 🤗

Hugging Face is an open-source provider of natural language processing (NLP) technologies. It has a large open-source community, in particular around the Transformers library.

🤗/Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:

`pip install transformers`


more here: https://blog.tensorflow.org/2019/11/hugging-face-state-of-art-natural.html


The advantage of using `Transformers` lies in the straight-forward model-agnostic API. Loading a pre-trained model, along with its tokenizer can be done in a few lines of code. Here is an example of loading the BERT TensorFlow models as well as their tokenizers.

# Fine-tuning a pretrained model

We are gping to use the IMDB dataset: the task is to classify whether movie reviews are positive or negative. For more infromation you can check Datasets [documentation](https://huggingface.co/docs/datasets/).

In [None]:
!pip install datasets transformers tensorflow -q

In [None]:
from datasets import load_dataset

# download and cache the dataset:
raw_datasets = load_dataset("imdb")

raw_datasets

In [None]:
from pprint import pprint

To preprocess our data, we will need a tokenizer. If you plan on using a pretrained model, it’s important to use the associated pretrained tokenizer: it will split the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence token to index (that we usually call a vocab) as during pretraining.

In [None]:
from transformers import AutoTokenizer

#  automatically download the vocab used during pretraining or fine-tuning a given model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
# example
encoded_input = tokenizer("I love machine learning!")
print(encoded_input)

This returns a dictionary string to list of ints. The input_ids are the indices corresponding to each token in our sentence. We will see below what the attention_mask is used for and in the next section the goal of token_type_ids.

In [None]:
tokenizer.decode(encoded_input["input_ids"])

As you can see, the tokenizer automatically added some special tokens that the model expects. Now let's tokenize our data:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# This will make all the samples have the maximum length the model can accept (here 512),
# either by padding or truncating them. Note that we are applying the preprocessing step to
# all splots of the raw dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
# full_train_dataset = tokenized_datasets["train"]
# full_eval_dataset = tokenized_datasets["test"]

In [None]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

In [None]:
model.summary()

Since we are going to train our model natively in TensorFlow, we need to convert our datasets to standard `td.data.Dataset`.

In [None]:
tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")

In [None]:
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)

eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
eval_tf_dataset = eval_tf_dataset.batch(8)

In [None]:
train_features

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)

In [None]:
# save the fine-tuned model for future use
model.save_pretrained("./my_imdb_model")

There are many more examples for different tasks such as text classification, question answering, etc [here](https://github.com/huggingface/transformers/tree/master/examples/tensorflow)

## optional - HugginFace approach

compute_metrics function takes predictions and labels and computes and returns a dictionary with string items (the metric names) and float values (the metric values).

The 🤗 Datasets library provides an easy way to get the common metrics used in NLP with the load_metric function. here we simply use accuracy. Then we define the compute_metrics function that just convert logits to predictions (remember that all 🤗 Transformers models return the logits) and feed them to compute method of this metric.

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

To define our Trainer, we will need to instantiate a TrainingArguments. This class contains all the hyperparameters we can tune for the Trainer or the flags to activate the different training options it supports. Let’s begin by using all the default arguments

In [None]:
!pip install pip install transformers[torch]

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")

In [None]:
from transformers import Trainer

# instantiate a Trainer
trainer = Trainer(
    model=model,
    args=training_args, 
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics

)

To fine-tune our model, we just need to call



In [None]:
trainer.train()