# NLP Exercise
## Maxwell Ernst
#### 18/04/2023

Description
You should now have a basic theoretical understanding of the elements in an NLP pipeline. And you should have some practice with several tutorials treating these new topics and methods.

Now the time is ripe to play with one of the state-of-the-art NLP pre-trained models, such as BERT (and derivatives such as ALBERT and RoBERTa), GPT-2,

1. Fine-tune a Tranformer model (use very little examples ~10-100): https://huggingface.co/docs/transformers/main/en/trainingLinks to an external site.

2. Apply the proper pipeline for inference and experiment with prompts: https://huggingface.co/docs/transformers/main/en/pipeline_tutorialLinks to an external site..

Deliverables
A link/HTML/PDF of your (colab) notebook, make sure to document the code and explain your own steps. For extra kudos, you can change the model and task of the given tutorials. But beware, playing around with these large models is challenging and fine-tuning large models requires a lot of time and GPU power.

## Data

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full to C:/Users/maxwe/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

ConnectionError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Read timed out.

## Tokenizer

Usinng the datasets map method to apply a preprocessing function over the dataset.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

## Training

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

## Evaluate

The 🤗 Evaluate library provides a simple accuracy function you can load with the evaluate.load

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

![Image of a graph](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Bar_chart.svg/1200px-Bar_chart.svg.png)

![Feature vector visualization](https://i.imgur.com/1w9XQJf.png)