# Week 1 Practical

## Hugging Face

### What are some common natural language processing tasks?

Let's explore some of the interesting pre-trained models on Hugging Face hub.

Make sure the `transformers` library is installed. If the next cell works, you're good. Otherwise
`conda install transformers` or `pip install transformers`. You can do a few exercises
while you wait.

In [1]:
import transformers

Look up https://huggingface.co/docs/transformers/en/main_classes/pipelines and find the 
documentation for the `pipeline` constructor.

Make a list of tasks that look relevant to natural language processing. Randomly pick one 
and look up more information about it.

In [2]:
text-classification
MiniLMv2-toxic-jigsaw

NameError: name 'text' is not defined

Go to the Hugging Face hub and find a model that implements the task that you saw.

### Running some example models

Look up the following dataset:

- sst2

(You can look it up on Hugging Face, or just search in general.)

What is it about? What are the features (columns) of this data set?

In [3]:
idx, sentence, label

NameError: name 'idx' is not defined

Look up the following models:

- distilbert-base-uncased-finetuned-sst-2-english

You can probably guess what dataset it has been fine-tuned on! What sort of task is it?

You can use it with `transformers.pipeline(` *name-of-the-task* `,model="` *name-of-the-model* `)`

Try the following:

- Write a grumpy sentence and see what it says.

- Write a cheery and happy sentence

- Write a neutral sentence but include the name of developed nation or popular city

- Write a neutral sentence but include the name of undeveloped nation or a city that has a bad reputation


In [4]:
from transformers import pipeline

  torch.utils._pytree._register_pytree_node(


In [12]:
model = transformers.pipeline('text-classification',model = 'distilbert-base-uncased-finetuned-sst-2-english')

In [13]:
print(model('I am grumpy'))
print(model('I am happy'))
print(model('I have been to India'))
print(model('I live in Spain but the S is silent'))

[{'label': 'NEGATIVE', 'score': 0.9994398951530457}]
[{'label': 'POSITIVE', 'score': 0.9998801946640015}]
[{'label': 'POSITIVE', 'score': 0.9952318072319031}]
[{'label': 'NEGATIVE', 'score': 0.9954913258552551}]


Look up the  **conll03** dataset.

We can create a pipeline with the "ner" task. The default model has been trained on **conll03**

Copy and paste a news article from today's news and do ner (named entity recognition) on it.

In [20]:
ner = transformers.pipeline('ner')


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
print(ner(news))

[{'entity': 'I-LOC', 'score': 0.99503386, 'index': 8, 'word': 'Sydney', 'start': 50, 'end': 56}]


In [22]:
news = 'Police officer facing murder charges over missing Sydney couple'

Look up the **SQuAD** dataset.

The question-answering task is a little different. The pipeline takes two arguments:

- question (the question you want to ask)

- context (the information that it has to answer questions about)

Ask a question about your news article.

In [23]:
QA = transformers.pipeline('question-answering')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [24]:
d = {"question" : "Who is facing murder charges ?", "context" : "Spidermans dad is a police"}
QA(d)

{'score': 0.6904413104057312,
 'start': 0,
 'end': 14,
 'answer': 'Spidermans dad'}

Summarise your news article using the default model for the "summarize" task.

In [25]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [29]:
print(summarizer('''A Black high school students months-long punishment by his Texas school district for refusing to change his hairstyle does not violate a new state law that prohibits race-based hair discrimination, a judge ruled on Thursday.
Darryl George, 18, has not been in his regular Houston-area high school classes since August 31 because the district, Barbers Hill, says the length of his hair violates its dress code.
The district filed a lawsuit arguing Georges long hair, which he wears in tied and twisted locs on top of his head, violates its policy because it would fall below his shirt collar, eyebrows or earlobes when let down.''', max_length=20))

Your min_length=56 must be inferior than your max_length=20.


[{'summary_text': ' Darryl George, 18, has not been in his regular Houston-area high'}]


Translate it into another language. The name of the task will be something like
"translation_en_to_de" ("de" is the language code for Germany). Pick your favourite
language.

Note that you will have to specify a model. Helsinki-NLP is a good place to start searching
for on. You might have to install some supporting packages like `sentencepiece`
(and then you will have to restart the jupyter notebook kernel).

OpenAI have a remarkably good speech-to-text model called "whisper".

The task is called "automatic-speech-recognition". There is a good model called "openai/whisper-small".

#### Speech generation

ElevenLabs.io just raised $80m and have a billion dollar valuation turning text into speech!

The quality of the default model in Hugging Face isn't quite that good though.

The name of the task is "text-to-audio". Make up a sentence and run it. It outputs a dictionary with two keys:

- `audio` (a 2D numpy array, but because it's mono audio data there's only one real axis)

- `sampling_rate` just a number

If you install the `soundfile` package, you'll be able to run

```python
import soundfile as sf
sf.write('output.wav', audio_out['audio'].T, audio_out['sampling_rate'])
```

Assuming your computer has a speaker, you can then find the `output.wav` file and play it.



# PyTorch

In the lectures, we finished up talking about how PyTorch doesn't have a built-in `.fit()` or `.train()`
method.

The code below has a training loop, but it has a fixed number of iterations.

Change it so that it has a validation dataset, and that it stops training when the accuracy
on the validation dataset is no longer improving.

To do this, you will make use of 
- `model.eval()`  This doesn't evaluate anything! It just tells the model that you are in
   evaluation mode.
- `model.train()` Likewise, doesn't train, it just tells the model that you are 
   switching back to training mode.
- `with torch.no_grad():` speeds up calculations by not trying to get gradients (which would
   be irrelevant for the test set).
- `loss.item()` get a float out for the loss
- `model()` this is how you do inference on a model: call it like a function.

For your own sanity you will probably want to split this code up into a few cells, print out
some messages in each batch (or at least each epoch).

Optional bonus exercises:

- According to your model, what was the probability that a 3rd-class male passenger would die? What about a 1st-    class female passenger?

- Visualise how the weights of the first layer of this model change over time.

- Make the model work better. It's particularly inaccurate at the moment.

In [None]:
import torch
import pandas as pd
import sklearn.model_selection
import sklearn.preprocessing

df = pd.read_csv('titanic.csv')
df.dropna(inplace=True)
X = df.drop('Survived', axis=1).select_dtypes(include=['float64', 'int64']).values
y = df['Survived'].values
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)
scaler = sklearn.preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

# Create DataLoader instances
train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

# Neural network architecture
class TitanicNN(torch.nn.Module):
    def __init__(self, input_size):
        super(TitanicNN, self).__init__()
        self.layer1 = torch.nn.Linear(input_size, 64)
        self.layer2 = torch.nn.Linear(64, 32)
        self.layer3 = torch.nn.Linear(32, 2)
        # That last one is a binary classifier
        # Don't need softmax, CrossEntropyLoss handles it

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = torch.relu(self.layer2(x))
        x = self.layer3(x)
        return x
    
# We want to capture 
evolution = []
    
# Instantiate the model, loss function, and optimizer
model = TitanicNN(X_train.shape[1])
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 10000
for epoch in range(epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()