# Handling Local Data
To load datasets that are stored either on your laptop or on a remote server, we can still use the `load_dataset()` function. This time, we just need to specify the type of loading script in the `load_dataset()` function, along with a `data_files=''` argument that specifies the path to one or more files.

!["load_dataset()"](data/chapter_5/load_dataset.png "load_dataset()")

### Loading a local dataset

| Data format | Loading script | Example |
|-------------|----------------|---------|
| CSV & TSV |`csv`|`load_dataset("csv", data_files="my_file.csv")`|
| Text files |`text`|`load_dataset("text", data_files="my_file.txt")`|
| JSON & JSON Lines |`json`|`load_dataset("json", data_files="my_file.json")`|
| Pickled DataFrames |`pandas`|`load_dataset("pandas", data_files="my_dataframe.pkl")`|

For this example, let's use the [SQuAD-it](https://github.com/crux82/squad-it/) dataset, which is a large-scale **json** dataset for question answering in Italian. It's hosted on GitHub, let's first download it in our `data/chapter_5` dir using `wget` and then decompress these compressed files `SQuAD_it-train.json.gz`, `SQuAD_it-test.json.gz` using `gzip`:

In [None]:
!cd data/chapter_5 && wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!cd data/chapter_5 && wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

!cd data/chapter_5 && gzip -dkv SQuAD_it-*.json.gz

Now that we have our data in the `JSON` format, we can simply use the `load_dataset()` function, we just need to know if we’re dealing with **ordinary JSON** (*similar to a nested dictionary*) or **JSON Lines** (*line-separated JSON*). Like many question answering datasets, **SQuAD-it** uses the *nested format*, with all the text stored in a **data field**. This means we can load the dataset by specifying the `field='data'` argument:

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="data/chapter_5/SQuAD_it-train.json", field="data")

squad_it_dataset

As we can see, by default, loading local files creates a `DatasetDict` object with only a **train** split. But, what we really want is to include both the **train** and **test** splits in a single `DatasetDict` object so we can apply `Dataset.map()` functions across both splits at once. To do this, we can provide a dictionary to the 
```python
data_files={"train":"path to the training data", "test":"path to the testing data"}
```
argument that maps each split name to a file associated with that split:

In [None]:
data_files = {
    "train":"data/chapter_5/SQuAD_it-train.json",
    "test":"data/chapter_5/SQuAD_it-test.json"
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

The loading scripts in Datasets actually support automatic decompression of the input files, so we could have skipped the use of gzip by pointing the `data_files` argument directly to the compressed files:
```python
data_files = {
    "train": "data/chapter_5/SQuAD_it-train.json.gz", 
    "test": "data/chapter_5/SQuAD_it-test.json.gz"
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
```
This can be useful if you don’t want to manually decompress many `GZIP` files. The automatic decompression also applies to other common formats like `ZIP` and `TAR`, so you just need to point `data_files` to the compressed files.

> The `data_files` argument is also quite flexible and can be either *a single file path*, *a list of file paths*, or *a dictionary* that maps split names to file paths. You can also *glob files* that match a *specified pattern* according to the rules used by the `Unix shell` (e.g., you can glob all the `JSON` files in a directory as a single split by setting `data_files="*.json"`). See the [Datasets documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) for more details.

### Loading a remote dataset

Fortunately, loading *remote files* is just as simple as loading *local* ones!
<br />
Instead of providing a path to *local files*, we point the `data_files` argument to **one or more URLs** where the *remote files* are stored.

In [None]:
url =  "https://github.com/crux82/squad-it/raw/master/"

data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

# Data Manipulation

The `DatasetDict` object comes with a lot of functionalities to manipulate the original dataset.
<br />
For this example, we’ll use the [Drug Review Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29) that’s hosted on the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient’s satisfaction.

In [None]:
!cd data/chapter_5/ && wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!cd data/chapter_5/ && unzip drugsCom_raw.zip

As we can see, this the data is in the `TSV` format which is a variant of `CSV` that uses tabs instead of commas as the separator. So, when loading these files using `load_dataset()`, we use the specify `csv` as the *loading script* and most importantly the `delimiter=\t` argument:

In [None]:
from datasets import load_dataset

data_files = {
    "train" : "data/chapter_5/drugsComTrain_raw.tsv",
    "test" : "data/chapter_5/drugsComTest_raw.tsv"
}

drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Now that we have the `DatasetDict` object, we can create a random sample to get a quick feel for the type of data you’re working with and to do so we simply have to chain the `Dataset.shuffle()` and `Dataset.select()` function to first randomly shuffle the data  (we can also pass the `seed` argument to later use the same shuffle) and select/see the first *n* data elements:

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))

drug_sample[:3]

From above we can see before passing this data to the model or even for tokenisation we need to perform few pre-processing steps:
  + The `Unnamed: 0` column needs to be renamed to `patient_id`.
  + The `condition` column includes a mix of *uppercase* and *lowercase* labels.
  + The `reviews` are of varying length and contain a mix of Python line separators `(\r\n)` as well as HTML character codes like `&\#039;`.

So, we can use the in-built functions like the, `rename_column()` - to rename the column name, `map()` and `filter()` - to map all the `condition` column values to lowercase, and also filter out the special characters.

In [None]:
import html

# rename the column name
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0",
    new_column_name="patient_id"
)

# map conditon column values to lowercase
def lowercase_condition(data):
    return {"condition": [row.lower() for row in data["condition"]]}
    # return {"condition": data["condition"].lower()} # if not using batched=True in the map() function
    

# let's first remove all the rows with null values, otherwise the above
# function will throw an error
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

# map lowercasse
drug_dataset = drug_dataset.map(lowercase_condition, batched=True)


# unescape all the HTML special characters in our corpus
drug_dataset =  drug_dataset.map(
    lambda x: {"review": [html.unescape(row) for row in x["review"]]},
    batched=True
)


drug_dataset["train"][:2]

>In Python, `lambda` functions are small functions that you can define without explicitly naming them. They take the general form `lambda <arguments> : <expression>`,
where `lambda` is one of Python’s special keywords, `<arguments>` is a list/set of *comma-separated values* that define the *inputs* to the function, and `<expression>` represents the operations you wish to execute. For example, we can define a simple lambda function that squares a number as follows: `lambda x : x * x`
To apply this function to an input, we need to wrap it and the input in parentheses:
`(lambda x: x * x)(3) -> 9`

### From Datasets to DataFrames and back

We can use the the `set_format()` function of the `DatasetDict` object to convert it into a different dataframe such as *Pandas*, *NumPy*, *PyTorch*, *TensorFlow*, and *JAX*. To convert it back to the `DatasetDict` object, we simply need to call the `reset_format()` function

In [None]:
drug_dataset.set_format("pandas")

drug_dataset["train"][:3]

In [None]:
drug_dataset.reset_format()

drug_dataset["train"][:3]

### Creating a validation set
The `DatasetDict` object also provides a `Dataset.train_test_split()` function that is based on the famous functionality from `scikit-learn` which can be used to further split the data into a train-validation-test format.


In [None]:
# 80-20 percent train-validation split on the training dataset
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=41)

# name the 20% split data as the validation
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")

# Add the orignal test dataset
drug_dataset_clean["test"] = drug_dataset["test"]

drug_dataset_clean

### Saving a dataset
To save a dataset to disk:

| Data format | Function |
|-------------|----------|
|*Arrow*|`Dataset.save_to_disk()`|
|*CSV*|`Dataset.to_csv()`|
|*JSON*|`Dataset.to_json()`|

For example, let’s save our cleaned dataset in the Arrow format:

In [None]:
drug_dataset_clean.save_to_disk("data/chapter_5/drug-reviews")

!ls data/chapter_5/drug-reviews/*

Once the dataset is saved, we can load it by using the load_from_disk() function as follows:

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("data/chapter_5/drug-reviews")
drug_dataset_reloaded

For the **CSV** and **JSON** formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the `DatasetDict` object. This saves each split in JSON Lines format, where each row in the dataset is stored as a single line of JSON.

In [None]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"data/chapter_5/drug-reviews-{split}.jsonl")

And to load the data we can simply use the `load_dataset()` function:

In [None]:
data_files = {
    "train": "data/chapter_5/drug-reviews-train.jsonl",
    "validation": "data/chapter_5/drug-reviews-validation.jsonl",
    "test": "data/chapter_5/drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

drug_dataset_clean = drug_dataset_reloaded

drug_dataset_clean

## Example

Let's train a classifier that can predict the patient condition based on the drug review.
### 1. Download the data.
We are re-downloaing the data because we want to clean it more deeply this time:

In [None]:
from datasets import load_dataset

# download the data
!cd data/chapter_5/ && curl -O "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!cd data/chapter_5/ && unzip -o drugsCom_raw.zip


# load the data
data_files = {
    "train" : "data/chapter_5/drugsComTrain_raw.tsv",
    "test" : "data/chapter_5/drugsComTest_raw.tsv"
}

drug_dataset = load_dataset(
    "csv",
    data_files=data_files,
    delimiter='\t'
)

drug_dataset

### 2. Merge the split togther

In [None]:
from datasets import concatenate_datasets

# pop the testing data out of the drug dataset
testing_data = drug_dataset.pop("test")
# merge the splits
drug_dataset["train"] = concatenate_datasets([drug_dataset["train"], testing_data])

drug_dataset

### 3. Intial data filteration phase, where we are:

+ Changing the column name from `Unnamed: 0` to `patient_id`
+ Removing the rows that does not having anything in thier `condition` column.
+ Setting all the values inside the `condition` column to *lowecase*.
+ Converting the html characters in the `condition` column into readable format, i.e., `unescape`.
+ Remove the rows with `review` column length less than a certain number.

In [None]:
import html

# rename the column
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0",
    new_column_name="patient_id"
)


# removing out the empty condition rows
drug_dataset = drug_dataset.filter(
    lambda batch: [condition is not None for condition in batch["condition"]],
    batched=True,
    desc="Removing empty Condition rows"
)


## lowercase function
def lowercase_condition(data):
    return {"condition": [row.lower() for row in data["condition"]]}

## map lowercase
drug_dataset = drug_dataset.map(
    lowercase_condition,
    batched=True,
    desc="Mapping Condition values to lowercase"
)


# unescape all the special characters
drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(review_row) for review_row in x["review"]]},
    batched=True,
    desc="Mapping HTML Unescape over Condition values"
)

drug_dataset

Now, whenever we are dealing with *customer reviews*, it is a good practice to check the *number of words* in each *review*. A *review* might be just a *single word* like *“Great!”* or a *full-blown essay with thousands of words*, and depending on the use case you’ll need to handle these extremes differently.
<br />
In our case, some *reviews* containing just a single word, which, although it may be okay for **sentiment analysis**, would not be informative when predicting a *condition*. So, to compute the number of words in each review, we’ll use a rough heuristic based on splitting each text by whitespace and use the `filter()` function to remove reviews that contain fewer than **30 words**:

In [None]:
# returns a new column with row's review corresponding length
def compute_review_length(data):
    return {"review_length": [len(row.split()) for row in data["review"]]}

# map the review_length column
drug_dataset  = drug_dataset.map(
    compute_review_length,
    batched=True,
    desc="Mapping review_length column"
)

# filter out rows that has review_length length less than and qual to 30
drug_dataset = drug_dataset.filter(
    lambda batch: [review_length >= 30 for review_length in batch["review_length"]],
    batched=True,
    desc="Removing rows with review length less than 30"
)

drug_dataset 

### 4. Setting up the `labels` column.

Let's first see how macny unique *conditions* there are in the dataset, using the in-built `unique()` function:


In [None]:
print(f"There are {len(drug_dataset.unique("condition")["train"])} unique conditions in the dataset")

As we can see, there are `853` unique conditions in the `condition` column. Let's have a look at thier distribution and only select the first 5 conditions that occurs the most as the *labels* for this *multi-class classification task* and remove all of the others.
<br />
We can use the `Counter()` method from the `collections` class to get the distribution over the `condition` column.

In [None]:
from collections import Counter

train_counts = Counter(drug_dataset["train"]["condition"])

print(f"Conditions distribution:\n\t{train_counts}")

We can see that, `birth control`, `depression`, `acne`, `anxiety`, and `pain` are the top 5 conidtions that occurs the most in our dataset. So let's now, filter out all the rows where is *condition* is not that and then, rename the `condition` column name to `labels` (because that is something that will be required by our model):


In [None]:
allowed_conditions = ['birth control', 'depression', 'pain', 'anxiety', 'acne']

drug_dataset = drug_dataset.filter(
    lambda batch: [condition in allowed_conditions for condition in batch["condition"]],
    batched=True
)

drug_dataset = drug_dataset.rename_column(
    original_column_name="condition",
    new_column_name="labels"
)

conditions_label = drug_dataset.unique("labels")["train"]
print(f"Now there are only {len(conditions_label)} conditions label, which are:\n\t{conditions_label}")

drug_dataset

### 5. Encoding the labels into ClassLabels
Now, since this task is a *Multi-label classification* task, therefore we need to convert the text values in the `labels` columns, `birth control`, `depression`, `pain`, `anxiety` and `acne` into discreet numerical values i.e., `ClassLabels`, to represent them as **labels** for the model. Luckily, the `DatasetDict` object has `class_encode_column()` function to handle this task for us in-place:

In [None]:
# encode the labels to the right form
drug_dataset = drug_dataset.class_encode_column("labels")

print(drug_dataset["train"].features["labels"])

label_features = drug_dataset["train"].features["labels"]
label_names = label_features.names

for label in label_names:
    print(f"{label} -> {label_features.str2int(label)}")

### 6. Splitting the Dataset
Now, that we have `67844` *ClassLabels encoded* data in total, let's split the dataset into `train`, `validation`, `test` with a 70-20-10 percentage ratio, respectively. However, we need to follow a rulewhen splitting:
<br />
In Machine Learning, **stratification** refers to the practice of ensuring that the distribution of labels is consistent across the `train`, `validation`, and `test` datasets. This means that if the *training* dataset contains `60%` of label `x` and `40%` of label `y` (e.g., `6` rows of `x` and `4` rows of `y` out of `10` total), then the *validation* and *test* sets should also maintain the same proportions - `60% x` and `40% y`, respectively.


For splitting the data we will use the in-built `train_test_split()` method, where we can also specify using the `stratify_by_column="labels"` argument, to stratify the splits based on the `labels` column `ClassLabels`.

In [None]:
import datasets

# Split off 70% train, 30% temporary (for validation + test)
train_valtest = drug_dataset["train"].train_test_split(
    test_size=0.3,
    seed=41,
    stratify_by_column="labels"
)

# Split 30% temporary into 20% validation and 10% test
val_test = train_valtest["test"].train_test_split(
    test_size=1/3,  # 1/3 of 30% = 10%
    seed=41,
    stratify_by_column="labels"
)


# Recombine into a final DatasetDict
drug_dataset_final = datasets.DatasetDict(
    {
        "train" : train_valtest["train"],
        "validation" : val_test["train"],
        "test" : val_test["test"]
    }
)
drug_dataset_final

As we have quite a lot of data, `65k+` in total. Let's just for the sake of making the training process faster, only take `10%` randomly shuffled sample of each split for the training and evaluating the model.
> Note if you would like to train the model on the whole data, simply skip the cell below.

In [None]:
# to only take 10% of the data per split 
pct = 0.1

drug_dataset_final["train"] = drug_dataset_final["train"].shuffle(seed=42).select(range(int(pct*len(drug_dataset_final["train"]))))
drug_dataset_final["validation"] = drug_dataset_final["validation"].shuffle(seed=42).select(range(int(pct*len(drug_dataset_final["validation"]))))
drug_dataset_final["test"] = drug_dataset_final["test"].shuffle(seed=42).select(range(int(pct*len(drug_dataset_final["test"]))))

drug_dataset_final

Let's look if the ClassLabels distribution is correct amongst the splits:

In [None]:
def print_distributions(split, dataset):
    print(f"{split} ClassLabels distribution:")

    # get the distribution numbers
    class_labels_counts = Counter(dataset[split]["labels"])

    # distribution dictionary
    dist = {}
    # for every label 
    for label_id, count in class_labels_counts.items():
        # get the label name using the label id
        label_name = dataset[split].features["labels"].int2str(label_id)
        # compute the percentage
        pct = round((count/len(drug_dataset_final[split]))*100, 2)

        # add it to the distribution dict
        dist[label_id] = (label_name, pct)
    
    # sort the distribution dict on label id and print the data
    sorted_dist = dict(sorted(dist.items()))
    for label_id, name_pct in sorted_dist.items():
        print(f"\t id - {label_id} {name_pct[0]} : {name_pct[1]}% ")


print_distributions("train", drug_dataset_final)

print_distributions("validation", drug_dataset_final)

print_distributions("test", drug_dataset_final)

### 7. Initialise the Model and other config

Now that we have our final dataset, let's now:
+ gather the tokeniser and the model, 
+ tokenise the data and refine it all for once. 

> Note: When initialising the model, we also have to specify the `num_labels=5` arguments because we are training the model for a multi-class classification task and there are `5` labels in total:

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from pprint import pprint

checkpoint = "bert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(checkpoint)
# initialise the model and also specify the number of labels
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=6)
data_collator = DataCollatorWithPadding(tokenizer=tokeniser)

def tokenisation_function(data):
    return tokeniser(data['review'], truncation=True)

tokenised_datasets = drug_dataset_final.map(
    tokenisation_function,
    batched=True
)


pprint(tokenised_datasets)


tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=['patient_id', 'drugName', 'review', 'rating', 'date', 'usefulCount', 'review_length']
)

tokenised_datasets.set_format("torch")

pprint(tokenised_datasets)

### 8. Set up the dataloader

> Note: if you didn't take the 10% sample of the data at `step 6` and in total there are still `65k+` data. It could be worth setting the `train_batch_size` <= `16`, if your GPU does not have a lot of memory, otherwise, the training will take time; also similarly for `eval_batch_size` and `test_batch_size` we will set it to `64`, otherwise the evaluation stage will take a lot of time.

In [None]:
train_batch_size = 16
eval_batch_size = min(64, len(tokenised_datasets["validation"]))
test_batch_size = min(64, len(tokenised_datasets["test"]))

train_dataloader = DataLoader(
    dataset=tokenised_datasets["train"],
    batch_size=train_batch_size,
    shuffle=True,
    collate_fn=data_collator
)

eval_dataloader = DataLoader(
    dataset=tokenised_datasets["validation"],
    batch_size=eval_batch_size,
    collate_fn=data_collator
)

test_dataloader = DataLoader(
    dataset=tokenised_datasets["test"],
    batch_size=test_batch_size,
    collate_fn=data_collator
)


print(f"So there are,\n\t{len(train_dataloader)} batches of size {train_batch_size} in the training dataset,\n\t{len(eval_dataloader)} batches of size {eval_batch_size} in the evaluation dataset, and\n\t {len(test_dataloader)} batches of size {test_batch_size} in the test dataset")

### 9. Setup the *accelerator*, *optimisor* and *learning rate scheduler* object:

In [None]:
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import get_scheduler

# optimiser
optimiser = AdamW(
    params=model.parameters(),
    lr=2e-5
)

# accelerator
accelerator = Accelerator()
# preparing accelerator objects
train_dl, eval_dl, test_dl, model, optimiser = accelerator.prepare(
    train_dataloader,
    eval_dataloader,
    test_dataloader,
    model,
    optimiser
)

num_epochs = 5
num_training_steps = num_epochs * len(train_dl)
# 10% warmup
num_warmup_steps = int(.1 * num_training_steps)

lr_schedular = get_scheduler(
    name="linear",
    optimizer=optimiser,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

print(f"Total training steps {num_training_steps}")

### 10. Evaluation Metric Setup

Now this time since there is no pre-evaluation metric present, therefore we have to define which metrics to use when evaluating our model.
<br />
For a *Classification* task,  the best metrics to evalute a model are **Accuracy**, **Precision**, **Recall** and **F1 Score**. The latter three metrics are detrived from a **Confusion Matrix**, which is basically a `N X N matrix`, where `N` is the *number of classes or categories* that are to be predicted. The values inside the *confusion matrix* represents one of these 4 values:
+ **True Positives (TP)** : It is the case where we predicted Yes and the real output was also Yes.
+ **True Negatives (TN)**: It is the case where we predicted No and the real output was also No.
+ **False Positives (FP)**: It is the case where we predicted Yes but it was actually No.
+ **False Negatives (FN)**: It is the case where we predicted No but it was actually Yes. 

For example, suppose there is a problem which is a binary classification with labels as `Yes` or `No`. So, here `N = 2`, therfore we will get a `2 X 2` *confusion matrix*. Now let's say we tested our model with 165 samples and the results using *confusion matrix* looks like this:

|              |Predicted No|Predited Yes|
|--------------|------------|------------|
|**Actual No** |50|10|
|**Actual Yes**|5|100|

Therefore, out of the 165 predictions, `100` predictions were **TP** (bottom right), `50` were **TN** (top left), `10` were **FP** (top right), and `5` were *FN* (bottom left).


Now, how these values are useful because we can use them to calculate **Precision**, **Recall** and **F1 Score**:
+ **Precision**: It measures how many of the positive predictions made by the model are actually correct. It's useful when the cost of false positives is high such as in medical diagnoses where predicting a disease when it’s not present can have serious consequences. Therefore, *Precision* helps ensure that when the model predicts a positive outcome, it’s likely to be correct.
$$
\text{Precision} = \frac{TP}{TP+FP}
$$
+ **Recall**: *Recall* or *Sensitivity measures* how many of the actual positive cases were correctly identified by the model. It is important when missing a positive case (*false negative*) is more costly than false positives (like disease detection).
$$
\text{Recall} = \frac{TP}{TP+FN}
$$
+ **F1 Score**: The *F1 Score* is the *harmonic mean* of *precision* and *recall*. It is useful when we need a balance between *precision* and *recall*, as it combines both into a single number. A *high F1 score* means the model performs well on both metrics, i.e., the model is performing well. Its range is `[0,1]`:
$$
\text{F1 Score}=2\times\frac{Precision+Recall}{Precision×Recall} 
$$
Now, when you have multiple classes, you still often want a single precision/recall/F1 number—but how you combine per-class scores depends on whether you care more about rare classes, common classes, or every example equally and there you have to use a *averaging strategy*. Here’s what each averaging strategy does:

+ **Weighted**: Compute each class’s score, then average them but weight by how many true examples each class has - so common labels count more.

+ **Micro**: Pool all true/false positives and negatives across every example, then compute one overall score - every prediction is equal (large classes dominate).

+ **Macro**: Compute each class’s score and then take the simple average—every class counts the same, no matter how many examples it has.

> NOTE: **Lower recall** and **higher precision** gives us **great accuracy** but then it misses a large number of instances and that's why **accuracy** alone is not a good metric when evaluating a model and using **Recall**, **Precision** and **F1 score** if possible is a good practice.

Luckily, the `evaluate` lib provides `combine()` method, where you can specify which metrics to use for the evaluation, and also when calling the `compute()` we can pass the `average` argument to specify which averaging strategy to use:

In [None]:
import evaluate
import torch

def perform_evaluation():
    """
    Perform evaluation on the validation set
    """
    # Set model to evaluation mode
    model.eval()

    eval_epoch_loss = []

    # initialising evaluation metrics
    ## accuracy
    eval_acc_metric = evaluate.load("accuracy") 
    ## f1 score
    eval_f1_metric = evaluate.load("f1")
    ## precision & recall
    eval_specific_metric = evaluate.combine(
        evaluations=[
            "precision",
            "recall"
        ]
    )

    # for every validation batch
    for batch in eval_dl:
        # Disable gradient computation for evaluation (saves memory and computation)
        with torch.no_grad():
            # pass the input to the model
            outputs = model(**batch)
            # Store loss inside no_grad for memory efficiency
            eval_epoch_loss.append(outputs.loss.item())

            # Get predictions for metrics (logits already created without gradients)
            logits = outputs.logits
            refs = batch["labels"]
            preds = torch.argmax(logits, dim=-1)

            # Add preds and refs to evaluation metrics
            ## accuracy
            eval_acc_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
            ## f1 score
            eval_f1_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
            ## precision & recall
            eval_specific_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
    
    # compute the average loss
    eval_avg_loss = sum(eval_epoch_loss) / len(eval_epoch_loss)

    # dict to store the metrics stats
    eval_pred_stats = {}
    # compute accuracy and add it to the dict
    eval_pred_stats.update(eval_acc_metric.compute())
    # compute the f1 score, with 'weighted' as the averaging strategy
    # and update the dict with the metric
    eval_pred_stats.update(
        eval_f1_metric.compute(
            average="weighted",
            labels= list(range(len(label_names))) # ClassLabels
        )
    )
    # compute precision and recall, with 'weighted' as the averaging strategy
    # and update the dict with the metric. 
    eval_pred_stats.update(
        eval_specific_metric.compute(
            average="weighted",
            zero_division=0, # when there is a zero in the denominator, replace the result with 0
            labels= list(range(len(label_names))) # ClassLabels
        )
    )

    return eval_avg_loss, eval_pred_stats

We would also like to evaluate our model performance on the test data after it is totally trained because testing on untouched data gives a true measure of how our model will perform on new examples and prevents us from overfitting by tuning to the same data we used to train it. 
<br />
So, let's write the evaluation function on the test data, and this time we can also ask for the *confusion_matrix* from the `evalute.compute()` function along with other metrics to further evalute the model on the test data.

> Note: when evaluating the model on the test data we don't need to look at the loss value

In [None]:
def test_evaluation():
    """
    Perform evaluation on the test set
    """
    # Set model to evaluation mode
    model.eval()

    # initialising evaluation metrics
    ## accuracy
    test_acc_metric = evaluate.load("accuracy")
    ## f1 score
    test_f1_metric = evaluate.load("f1")
    ## precision & recall
    test_specific_metric = evaluate.combine(
        evaluations=[
            "precision",
            "recall"
        ]
    )
    ## confusion metrix
    test_cm_metric = evaluate.load("confusion_matrix")

    # for every test batch
    for batch in test_dl:
        # Disable gradient computation for evaluation (saves memory and computation)
        with torch.no_grad():
            # pass the input to the model
            outputs = model(**batch)

            # Get predictions for metrics (logits already created without gradients)
            logits = outputs.logits
            refs = batch["labels"]
            preds = torch.argmax(logits, dim=-1)

            # Add preds and refs to evaluation metrics
            ## accuracy
            test_acc_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
            ## f1 score
            test_f1_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
            ## precision & recall
            test_specific_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
            ## confusion metrix
            test_cm_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
    
    # dict to store the metrics stats
    test_pred_stats = {}
    # compute accuracy and add it to the dict
    test_pred_stats.update(test_acc_metric.compute())
    # compute the f1 score, with 'weighted' as the averaging strategy
    # and update the dict with the metric
    test_pred_stats.update(
        test_f1_metric.compute(
            average="weighted",
            labels= list(range(len(label_names))) # ClassLabels
        )
    )
    # compute precision and recall, with 'weighted' as the averaging strategy
    # and update the dict with the metric. 
    test_pred_stats.update(
        test_specific_metric.compute(
            average="weighted",
            labels= list(range(len(label_names))), # ClassLabels
            zero_division=0  # when there is a zero in the denominator, replace the result with 0
        )
    )
    # compute the confusion matrix 
    test_pred_stats.update(
        test_cm_metric.compute(
            labels= list(range(len(label_names)))  # ClassLabels
        )
    )

    return test_pred_stats

let's also write a plot function so that we can visualise the confusion matrix and other metrics  from the test evaluation:

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def plot_confusion_matrix(confusion_matrix, label_names):
    disp = ConfusionMatrixDisplay(confusion_matrix, display_labels=label_names)
    disp.plot(cmap=plt.cm.Blues, values_format='d')  # Use '.2f' for float
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()


def plot_metrics(metrics_dict):
    names = list(metrics_dict.keys())
    values = list(metrics_dict.values())

    plt.figure(figsize=(8, 4))
    plt.barh(names, values, color='skyblue')
    plt.xlabel("Score")
    plt.title("Evaluation Metrics")
    plt.xlim(0, 1)  # If all values are between 0 and 1
    plt.grid(True, axis='x', linestyle='--', alpha=0.6)
    plt.tight_layout()
    plt.show()


### 11. Training

Let's write the training function first:

In [None]:
from livelossplot import PlotLosses
from tqdm.notebook import tqdm

# training progress bar
progress_bar = tqdm(range(num_training_steps))

def training_function():
    # initialise the plotter for the learning curve
    plotter = PlotLosses(mode='notebook')

    # for every epoch
    for epoch in range(num_epochs):
        # ensure model is in training mode
        model.train()

        # store loss per batch 
        train_epoch_loss = []

        # metrics for training data
        ## accuracy
        train_acc_metric = evaluate.load("accuracy")
        ## f1 score
        train_f1_metric = evaluate.load("f1")
        ## precision & recall
        train_specific_metric = evaluate.combine(
            evaluations=[
                "precision",
                "recall"
            ]   
        )

        # for every bacth in the training set
        for batch in train_dl:
            # Forward Pass (keep gradient attached)
            outputs = model(**batch)
            ## get the loss
            loss = outputs.loss

            # Backward Pass (while gradients are still attached)
            ## compute gradients
            accelerator.backward(loss)
            ## gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            ## Nudges the weights ("knobs") in the right direction based on the gradient
            optimiser.step()
            ## Update the learning-rate scheduler
            lr_schedular.step()
            ## Reset gradients to zero so they don’t accumulate
            ## into the next batch.
            optimiser.zero_grad()


            # metric computation
            with torch.no_grad():
                # detach loss for metric computation 
                train_epoch_loss.append(loss.detach().item())

                # detach logits for metric computation
                logits = outputs.logits.detach()
                # no need to detach labels (they don't have gradients)
                refs = batch['labels']
                preds = torch.argmax(logits, dim=-1)

                # add preds and refs to the train matrics
                ## accuracy
                train_acc_metric.add_batch(
                    predictions=accelerator.gather(preds),
                    references=accelerator.gather(refs)
                )
                ## f1 score
                train_f1_metric.add_batch(
                    predictions=accelerator.gather(preds),
                    references=accelerator.gather(refs)
                )
                ## precision and recall
                train_specific_metric.add_batch(
                    predictions=accelerator.gather(preds),
                    references=accelerator.gather(refs)
                )
            
            # update the progress bar by 1 step
            progress_bar.update(1)

        # training average loss
        tain_avg_loss = sum(train_epoch_loss)/len(train_epoch_loss)

        ## dict to store the metrics stats
        train_pred_stats = {}
        # compute accuracy and add it to the dict
        train_pred_stats.update(train_acc_metric.compute())
        # compute the f1 score, with 'weighted' as the averaging strategy
        # and update the dict with the metric
        train_pred_stats.update(
            train_f1_metric.compute(
                average="weighted",
                labels= list(range(len(label_names))) # ClassLabels
            )
        )
        # compute precision and recall, with 'weighted' as the averaging strategy
        # and update the dict with the metric. 
        train_pred_stats.update(
            train_specific_metric.compute(
                average="weighted",
                labels= list(range(len(label_names))), # ClassLabels
                zero_division=0 # when there is a zero in the denominator, replace the result with 0
            )
        )


        # evaluation phase
        eval_avg_loss, eval_pred_stats = perform_evaluation()

        # update the learning curve
        plotter.update({
            'loss': tain_avg_loss,
            'val_loss': eval_avg_loss,
            'acc': train_pred_stats['accuracy'],
            'val_acc': eval_pred_stats['accuracy'],
            'precision': train_pred_stats['precision'],
            'val_precision': eval_pred_stats['precision'],
            'recall': train_pred_stats['recall'],
            'val_recall': eval_pred_stats['recall'],
            'f1': train_pred_stats['f1'],
            'val_f1': eval_pred_stats['f1'],
        })
        plotter.send() 


    print("\n\n\n################# Test dataset Evaluation:\n\n")
    # After the model is totally trained, perform evaluation on the test dataset
    test_pred_stats = test_evaluation()

    confusion_matrix = test_pred_stats.pop("confusion_matrix")
    # plot the metrics
    plot_confusion_matrix(confusion_matrix, label_names)
    plot_metrics(test_pred_stats)

Finally, let's launch the training loop with `num_processes=1`, as my machine has only 1 dedicated gpu:

In [None]:
from accelerate import notebook_launcher

# launch the accelrator based training funcrion with one gpu
notebook_launcher(training_function, num_processes=1)

### 12. Performance Analysis

As we can see, in the initial training phase, it was observed that validation metrics (accuracy, F1, precision, recall) were unexpectedly higher than training metrics during the first epoch. This behavior, though uncommon, can occur due to factors such as *dropout* being applied only during training, the use of a *pretrained model* that already performs well on validation data, or inconsistencies in metric aggregation (e.g., batch-wise vs full set). As training progressed, the model quickly improved on the training set, with metrics surpassing validation scores by the second epoch. However, validation performance *plateaued* and the validation *loss* began to rise slightly after epoch 2, suggesting early signs of *overfitting*. This indicates that employing **early stopping** (around epoch 2–3) and **stronger regularization** or **data augmentation** strategies may help maintain generalization.


On the held-out test set, the model demonstrated strong and consistent performance across all key metrics (accuracy, precision, recall, and F1 score), all approaching or above 0.93. The confusion matrix further supports this, showing high classification accuracy across categories like birth control, depression, and anxiety, with only minor misclassifications—most notably, some confusion between acne and birth control, and between depression and anxiety. Overall, the model generalizes well to unseen data and handles class separation effectively, confirming the effectiveness of the training approach despite the early metric inversion.


# Managing Big Data
Nowadays, it's common for datasets used to train models from scratch to range from multiple gigabytes to several terabytes. In such cases, even loading the data can be challenging - especially when hardware is limited, such as having restricted RAM or GPU memory.
<br />
Fortunately, the Hugging Face datasets library is designed to handle these challenges:

- It addresses memory management issues by treating datasets as memory-mapped files, enabling efficient access without loading the entire dataset into RAM.

- It also offers a streaming feature that allows you to access data on-the-fly. This is especially useful when you can’t store a large dataset locally - data is downloaded and processed one sample at a time, without requiring the full dataset to be downloaded first.

You don’t need to do anything special to benefit from memory-mapping - it works automatically in all the examples we’ve seen so far.

So in this section, we’ll focus on how the streaming feature works in practice.  

### Streaming dataset

To enable dataset streaming you just need to pass the `streaming=True` argument to the `load_dataset()` function.
<br />
For example, let’s in *streaming* mode load the [*HuggingFace FineWeb*](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset, it is a 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. Its total size is 108 TB.

> Note: we are using this dataset only for example purposes. It is mainly used to train LLM models and the data processing pipeline is optimized for LLM performance and ran on the  [datatrove](https://github.com/huggingface/datatrove/) library (a large scale data processing library) and not on `datasets`.



In [None]:
from datasets import load_dataset

fineweb_dataset = load_dataset(
    "HuggingFaceFW/fineweb",
    streaming=True
)

Now, when we have `streaming=True` in the `load_dataset()` function instead of the usual `DatasetDict` object it returns an `IterableDataset` object. As the name suggests, to access the elements of an `IterableDataset` we need to iterate over it using `iter()` enclosed inside a `next()`(to get the next present value in the iteration):

In [None]:
next(iter(fineweb_dataset['train']))

Now to process the data inside the `IterableDataset` object, for example, during *pre-processing* or *tokenisation*, we can the `IterableDataset.map()`. The process is exactly the same as the one we used to tokenize our `DatasetDict` dataset prevously, with the only difference being that outputs are returned one by one, but we can also pass `batched=True` here, and it will process the examples batch by batch; the default batch size is 1,000 and can be specified with the `batch_size` argument.

In [None]:
from transformers import AutoTokenizer
from pprint import pprint 

checkpoint = "distilbert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(checkpoint)

tokenised_datasets = fineweb_dataset.map(
    lambda data: tokeniser(data["text"], truncation=True),
    batched=True,
)

pprint(next(iter(tokenised_datasets['train'])))

We can also shuffle a streamed dataset using `IterableDataset.shuffle()`, but unlike `Dataset.shuffle()` this only shuffles the elements in a predefined `buffer_size`:

> Note: In this example, we selected a random example from the first `10,000` examples in the buffer. Once an example is accessed, its spot in the buffer is filled with the next example in the corpus (i.e., the `10,001`st example in the case above)

In [None]:
shuffled_dataset = fineweb_dataset.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset["train"]))

To select the first `n` examples we can call the `IterableDataset.take()` function:

In [None]:
fineweb_dataset_head = fineweb_dataset["train"].take(5)
list(fineweb_dataset_head)

And similary, we can use the `IterableDataset.skip()` function to skip n examples and combined both `take()` and `skip()` to even create splits:

In [None]:
# Skip the first 1,000 examples and include the rest in the training set
train_split_dataset = shuffled_dataset["train"].skip(1000)
# Take the first 1,000 examples for the validation set
validation_split_dataset =  shuffled_dataset["train"].take(1000)

pprint(train_split_dataset)

pprint(validation_split_dataset)

Lastly, we can also combine multiple datasets together to create a single corpus using the `interleave_datasets()` function. It converts a list of IterableDataset objects into a single IterableDataset, where the elements of the new dataset are obtained by alternating among the source examples.
Let's combine the above dataset with [**FineWeb2**](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) dataset:

In [None]:
fineweb2_dataset = load_dataset(
    "HuggingFaceFW/fineweb-2",
    "aai_Latn",
    streaming=True
)

let’s now combine the  datasets with the interleave_datasets() function:

In [None]:
from itertools import islice
from datasets import interleave_datasets

fineweb_train_combined_dataset = interleave_datasets([fineweb_dataset["train"], fineweb2_dataset["train"]])
list(islice(fineweb_train_combined_dataset, 2))

Here we’ve used the `islice()` function from Python’s `itertools` module to select the first two examples from the combined dataset, and we can see that they match the first examples from each of the two source datasets.

# Semantic search with FAISS
Let's use the [*lewtun/github-issues*](https://huggingface.co/datasets/lewtun/github-issues) dataset, a corpus of GitHub issues associated with [**HuggingFace Datasets**](https://github.com/huggingface/datasets/issues) branch, where each issues contains a title, a description, and a set of labels that characterize the issue, and use this information to build a search engine that can help us find answers to our most pressing questions about the library!

<img src=data/chapter_5/datasets-issues-single.png width="800"/> 


### Using embeddings for semantic search

As we saw earlier, Transformer-based language models first turn each *token* into an *integer ID*, and then look up a *fixed-size input embedding vector* for that ID. After the tokens pass through all the *self-attention layers*, we get a set of **contextual embeddings*—one vector per token that now knows about the whole sentence.
<br />
By *pooling* these *contextual token embeddings*—for example, by taking the mean, grabbing the special `[CLS]` vector, or another strategy - we collapse them into a single vector that represents an entire sentence, paragraph, or even a whole document.

> Example: The sentence “Bug fixed.” is tokenised as `[CLS] bug fixed [SEP]`. Each token is turned into an ID (e.g., `101`, `12572`, `2196`, `102`) and looked up in BERT’s embedding table, yielding four separate `768-dimensional vectors` — *one per token*. These are still “raw” (they only identify the token itself). The vectors then pass through BERT’s 12 *self-attention layers*, becoming *contextual* — now each vector encodes information about the whole sentence. When we need a single vector for the entire issue, we choose a **pooling strategy**. Mean pooling averages all contextual vectors, but an even simpler option is `[CLS]` *pooling*: we keep only the vector at the first position. During BERT’s pre-training the model is explicitly told that this `[CLS]` position will be used for downstream classification tasks, so the network learns to funnel **sentence-level information** into that spot. Grabbing it is fast (no averaging) and often works well because it was trained to act as a ready-made summary.

<br />
Once we have these pooled embeddings, we can compare documents by measuring how similar their vectors are—for example, with a dot-product (cosine) score. Documents whose vectors are most alike are considered the most semantically similar.

!["semantic search"](data/chapter_5/semantic_search.png "semantic_search")

Let's build a semantic search engine that uses these embeddings. Unlike traditional keyword search, semantic search looks at meaning rather than exact word matches, so it can retrieve relevant documents even when they don’t share the same keywords as the query.

### Loading and preparing the dataset

In [None]:
from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")

issues_dataset

Normally, when we fetch the issues from the Github API it also returns all the *pull_request* along with the *issues*. So, let's now filter out the rows that corresponds to *pull_request*.
<br />
Luckily, this dataset comes with an additional column `is_pull_request`, to specify if a particular row contains *pull request* data. While we’re at it, let’s also filter out rows with no comments, since these provide no answers to user queries:

In [None]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)

print(issues_dataset)

issues_dataset.num_columns

We can see that there are a lot of column in out dataset (28), most of which we don’t need to build our search engine. From a search perspective, the most informative columns are `title`, `body`, and `comments`, while `html_url` provides us with a link back to the source issue. Let’s use the `remove_columns()` function to drop the rest:

In [None]:
columns =  issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_rm = set(columns_to_keep).symmetric_difference(columns)

issues_dataset = issues_dataset.remove_columns(column_names=columns_to_rm)

issues_dataset

To create our embeddings we’ll augment each comment with the issue’s title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to "**explode**” the column so that each row consists of an (html_url, title, body, comment) tuple. In Pandas we can do this with the `DataFrame.explode()` function, which creates a new row for each element in a list-like column, while replicating all the other column values. To see this in action, let’s first switch to the Pandas DataFrame format:

In [None]:
from pprint import pprint
issues_dataset.set_format("pandas")
issues_df = issues_dataset[:]

issues_df["comments"][0].tolist()

We can see from above that the issue at index 0 have two comments. Now, when we explode `issues_df` based on the `comments` column, we expect to get one row for each of these comments. Let’s check if that’s the case:

In [None]:
comment_df = issues_df.explode("comments", ignore_index=True)
comment_df.head(3)

Great, we can see the rows have been replicated, with the comments column containing the individual comments! Now that we’re finished with `Pandas`, we can quickly switch back to a `Dataset` by loading the DataFrame in memory:

In [None]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comment_df)

comments_dataset

Let's now change the `comments` column name to `comment` and also add a new column called `comment_length` and using it filter out row with comment length less than 15:

In [None]:
comments_dataset = comments_dataset.rename_column(
    original_column_name="comments",
    new_column_name="comment"
)


comments_dataset = comments_dataset.map(
    lambda batch: 
    {
        "comment_length" : [
            len(comment.split()) for comment in batch["comment"]
        ]
    },
    batched=True
)

comments_dataset = comments_dataset.filter(
    lambda batch : [
        comment_length > 15 for comment_length in batch["comment_length"]
    ],
    batched=True
)

comments_dataset

Having cleaned up our dataset a bit, let’s concatenate the issue title, description, and comments together in a new text column. As usual, we’ll write a simple function that we can pass to Dataset.map():

In [None]:
def concatenate_text(batch):

    text = [
        f"{title} \n {body} \n {comment}"
        for title, body, comment in zip(batch["title"], batch["body"], batch["comment"])
    ]
    return {"text" : text}


comments_dataset = comments_dataset.map(
    concatenate_text,
    batched=True
)

comments_dataset

### Creating text embeddings

Since we know to get the embedding representation of a input all we need is the last hidden state output of a model.. Fortunately, there’s a library called `sentence-transformers` that contains all the right models which are dedicated for creating embeddings. As described in the `sentence-transformers` library’s [documentation](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search), our use case is an example of **asymmetric semantic search** because *we have a short query whose answer we’d like to find in a longer document, like a an issue comment*.
<br /> 
The handy `sentence-transformers` library’s documentation [model overview table](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview) indicates that the `all-mpnet-base-v2` checkpoint has the best performance for *semantic search*, so we’ll use that for our application. We’ll also load the tokenizer using the same checkpoint:


In [None]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = 'sentence-transformers/all-mpnet-base-v2'

tokeniser = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

To speed up the embedding process, it helps to place the model and inputs on a GPU device, so let’s do that now:
> Note: No need to run the cell below if your device does not have a GPU or if there is a TPU specify that first.

In [None]:
import torch

device = torch.device("cuda")
model.to(device)

Now, we need one vector for each GitHub issue.As we mentioned earlier, the easiest way is `[CLS]` *pooling*.
<br />
One popular approach is to perform `[CLS]` *pooling* on our *model’s outputs*, where we simply collect the *last hidden state* for the special `[CLS]` *token*. The following function does the trick for us:

In [None]:
def cls_pooling(model_output):
    # returns the [CLS] vector which is at index 0
    # for every input passed to the model
    return model_output.last_hidden_state[:, 0]

Next, we’ll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs:

In [None]:
def get_embeddings(text_list):
    tokenised_input = tokeniser(
        text_list,
        padding=True,
        truncation=True,
        return_tensors="pt"
    )

    tokenised_input = {
        k : v.to(device) for k, v in tokenised_input.items()
    }

    model_output = model(**tokenised_input)

    return cls_pooling(model_output)

We can test the function works by feeding it the first text entry in our corpus and inspecting the output shape:

In [None]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

Great, we’ve converted the first entry in our corpus into a 768-dimensional vector!
<br />
We can use `Dataset.map()` to apply our `get_embeddings()` function to each row in our corpus, so let’s create a new embeddings column as follows:

> Note: We detach the embedding to remove it from the autograd graph, move it to the CPU because .numpy() only works on CPU tensors, and finally convert it to a NumPy array so the Datasets library can hand it off to FAISS (Facebook AI Similarity Search)) for indexing.

In [None]:
embeddings_dataset = comments_dataset.map(
    lambda batch : {
        "embedding" : [
            get_embeddings(text).detach().cpu().numpy()[0]  # 1-D array for each text
            for text in batch["text"]
        ]
    },
    batched = True
)

embeddings_dataset

### Using FAISS for efficient similarity search

Now that we have a dataset of *embeddings*, we need some way to search over them. To do this, we’ll use a special data structure in `Datasets` called a **FAISS index**. [FAISS](https://faiss.ai/) (short for Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.
<br />
The basic idea behind *FAISS* is to create a special data structure called an **index** that allows one to find which *embeddings are similar to an input embedding*. Creating a *FAISS index* in `Datasets` is simple — we use the `Dataset.add_faiss_index()` function and specify which column of our dataset we’d like to index:

In [None]:
embeddings_dataset.add_faiss_index(column="embedding")

So first we need to convert the question query into embedding vector, i.e., just like with the embedded dataset, we need to have a 768-dimensional vector representing the query:

In [None]:
question_query = "How can I load a dataset offline?"

question_query_embedding = get_embeddings(question_query).detach().cpu().numpy()
question_query_embedding.shape

Now that we have the question query in its embedding form, we can compare it against the whole corpus to find the most similar embeddings doing a nearest neighbor lookup with the `Dataset.get_nearest_examples()` function, with the main parameter being `k` which specify how many nearest neighbor to lookup.
The `Dataset.get_nearest_examples()` function returns a tuple of *scores* that rank the overlap between the query and the document, and a corresponding set of *samples*. So the total of the returned tuple will be `k`.

In [None]:
# compare 6 nearest neighbors
k = 6

scores, samples = embeddings_dataset.get_nearest_examples(
    index_name="embedding",
    query=question_query_embedding,
    k=k
)

Let’s collect these in a `pandas.DataFrame` so we can easily sort them and visualise them using its `head()` function:

In [None]:
import pandas as pd

# convert the sample dict to pd
samples_df = pd.DataFrame.from_dict(samples)
# add the corresponding scores
samples_df["score"] = scores
# sort based on the scores
samples_df.sort_values("score", ascending=False, inplace=True)

# visualise in a table form
samples_df.head(k)

# The End!