# Tutorial 4: Advanced modeling

After preparing a basic model with the features from audio data and gaining an understanding of the metrics for this challenge it's time to use a more state-of-the-art approach. The model we are about to use is bigger and demands more computation power. The purpose of this tutorial is to help you understand the code that we will be using in the next notebook where we will fine-tune the model on the whole dataset. As we want only to get familiar with the model and its preprocessing, we will use a small sample of data in this notebook.  So if the results here seem to be discouraging please be patient, you will see it will pay off in the next part of the tutorials.

For this notebook, we'll use the [ðŸ¤— (Hugging face) library](https://huggingface.co/). With its help, we will create the train, validation, and test datasets, preprocess the data, load a pre-trained model, and fine-tune it. By the end of this notebook, you will have a basic understanding of the huggingface library and you will be able to train a much more powerful model than Random Forest.

For those of you who encounter the ðŸ¤— library for the first time let's ask ChatGPT for a short explanation of what it is:
>"Hugging Face is an open-source software library for natural language processing (NLP) tasks, such as text classification, machine translation, and question-answering. It provides easy-to-use interfaces to pre-trained language models, including state-of-the-art models such as GPT-3 and BERT, allowing developers to quickly build and deploy NLP applications. The library also includes a range of tools for fine-tuning and adapting pre-trained models to specific NLP tasks, as well as for training new models from scratch. Hugging Face has become popular in the NLP community due to its ease of use, flexibility, and strong community support. It is widely used in industry and academia for a variety of NLP applications."

Hmm... that's a good start, but honestly, a more accurate explanation would include information that apart from NLP-oriented models the library provides state-of-the-art solutions for a variety of different Data Science tasks, such as Computer Vision, Reinforcement Learning aaaaand **Audio Processing**!

Nevertheless, backed up by the ChatGPT promise of ðŸ¤— simplicity let's try to use the library to create an audio dataset, preprocess it, and finally fine-tune one of many available models on our classification task. The list of all available audio classifiers can be found [here](https://huggingface.co/models?pipeline_tag=audio-classification&sort=downloads), for starters we'll try to use one of the most popular ones - [Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/main/en/model_doc/audio-spectrogram-transformer#audio-spectrogram-transformer).

For those of you who were not convinced by ChatGPT's assurance of ðŸ¤— simplicity and want to have a good understanding of the library first here is an excellent introductory [course](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt) that can help you with your first steps.

But don't be scared, this notebook will guide you through the basic concepts of the library, so that at the end you will have ready your first hugging face model. Let's get started!

**Note: For this notebook, you do need a GPU instance!**

## Setup

First, we need to import required libraries and functions.

In [3]:
#line to render the plots under the code cell that created it
%matplotlib inline
import json  # for working with json files
import sys  # Python system library needed to load custom functions
import numpy as np  # for performing calculations on numerical arrays
import pandas as pd  # home of the DataFrame construct, _the_ most important object for Data Science
import torch  # library to work with PyTorch tensors and to figure out if we have a GPU available
import os     # for changing the directory

from datasets import load_dataset, Audio  # required tools to create, load and process our audio dataset
from transformers import ASTFeatureExtractor, ASTForAudioClassification, TrainingArguments, Trainer  # required classes to perform the model training

sys.path.append('../src')  # add the source directory to the PYTHONPATH. This allows to import local functions and modules.
from gdsc_utils import download_directory, PROJECT_DIR # function to download the needed files from the official GDSC s3 bucket and our root directory
from config import DEFAULT_BUCKET  # S3 bucket with the GDSC data
from preprocessing import calculate_stats, preprocess_audio_arrays  # functions to calculate dataset statistics and preprocess the dataset with ASTFeatureExtractor
from gdsc_eval import make_predictions, compute_metrics  # functions to create predictions and evaluate them
os.chdir(PROJECT_DIR) # changing our directory to root

## Downloading data

Next we need to download the official data for the GDSC from the S3 bucket. The S3 bucket is structured as follows:

```
S3_bucket/
    â””â”€â”€ data/
        |â”€â”€ labels.json
        â””â”€â”€ train/
            |â”€â”€ train_file_1.wav
            |â”€â”€ train_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
        â””â”€â”€ val/
            |â”€â”€ val_file_1.wav
            |â”€â”€ val_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
        â””â”€â”€ test/
            |â”€â”€ test_file_1.wav
            |â”€â”€ test_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
    â””â”€â”€ data_small/
        |â”€â”€ labels.json
        â””â”€â”€ train/
            |â”€â”€ train_file_1.wav
            |â”€â”€ train_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
        â””â”€â”€ val/
            |â”€â”€ val_file_1.wav
            |â”€â”€ val_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv


```

In the official S3 bucket, you can find 2 folders:

- *data* - it contains the complete dataset for the challenge. We already downloaded it in the 2nd tutorial (EDA).
- *data_small* - this folder contains a small sample of the training and validation datasets.

For the purpuse of this tutorial we need to download *data_small* directory. We can make use of the ```download_directory``` function to accomplish that. Let's store in our *data* folder.

In [4]:
download_directory('data_small/', 'data', DEFAULT_BUCKET)

## Creating the datasets

After having imported the required libraries it's about time to create a ðŸ¤— dataset object that will allow us to handle our audio files during preprocessing and training. This will be also the first "proof" for ease of use of the ðŸ¤— library.

The ðŸ¤— datasets module has a neat way to load the audio data type with which we are working. The only thing we need is the paths to the folders with audio and metadata files.

In [5]:
# paths for the train and validation datasets
train_path = 'data/data_small/train'
val_path = 'data/data_small/val'

Let's see what is the structure of the metadata files stored in those paths.

In [6]:
train_meta_df = pd.read_csv(f"{train_path}/metadata.csv")
val_meta_df = pd.read_csv(f"{val_path}/metadata.csv")

In [7]:
train_meta_df.head()

In [8]:
val_meta_df.head()

The files with metadata have a simple structure - they consist only of two columns: *file_name* and *label*.

Once we establish where ðŸ¤— needs to look for the audio files and their metadata we can use the *load_dataset* function and create an AudioFolder object which is designed to work with audio data. We encourage you to read more about the AudioFolder builder [here](https://huggingface.co/docs/datasets/audio_load#audiofolder) and [here](https://huggingface.co/docs/datasets/audio_dataset#audiofolder).

We will also use the shuffle method on the train set to avoid inputting sorted data points to our model, which might negatively affect its convergence. We use a random seed of 42, to ensure the reproducibility of the output. This also allows you to cache the dataset, so that you can load it without the need of recomputing the shuffle.

In [9]:
# our first interaction with Hugging Face datasets!
train_dataset = load_dataset("audiofolder", data_dir=train_path).get('train').shuffle(seed = 42)  # load the dataset and shuffle the examples
val_dataset = load_dataset("audiofolder", data_dir=val_path).get('train')                         # load the validation dataset. But why do we have "get('train')" at the end of the line? :)

Seems that the dataset was loaded. Let's inspect the train_dataset and val_dataset variables.

In [10]:
train_dataset, val_dataset

So clearly we've created some kind of dataset object. We can see that it has two features: 'audio' and 'label'. Let's see if we can unpack a bit more this vague looking object and see what exactly the data looks like.

In [11]:
train_dataset[0], val_dataset[0]

We can see the first record of the train and validation sets. 

The 'audio' feature consists of a dictionary with keys: 
1. *path* - the path to the file including the folders
2. *array* - the loaded audio sample and consists of the amplitudes
3. *sampling rate* - information about the number of points that make up a second of recording

The label feature is a simple integer that denotes the class of the recorded species. If you wonder about what class is assigned to a given integer, please inspect the *labels.json* file in the *data* directory.

Great! We've just completed the first step of creating a dataset in no time. Now we need to slightly preprocess the data, so that it will have the form required by the model that we are about to use. We've mentioned at the beginning that we will use the [Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/main/en/model_doc/audio-spectrogram-transformer#transformers.ASTForAudioClassification.forward.example). In the next section, we'll try to understand what we need to do with our dataset to successfully fine-tune the model.

**Key insights:**
* The ðŸ¤— datasets audio folder consists of two columns:
    * audio - compound column with path, the amplitude array and the sampling rate
    * label - an integer indicating the class of the file

**Exercise time:**
* Can you try to inspect the cell in which we load the dataset and try to figure out why we use the get method with a key equal to "train"?
* What kind of object do we get without it?
* What is this object? Inspect the ðŸ¤— documentation
* Can we avoid using the 'train' field in the above cell? Post your findings on the Teams channel!

## Data preprocessing

As we've mentioned at the beginning we'll use the Audio Spectrogram Transformer (or AST for short) model, which was trained on a dataset called [AudioSet](https://research.google.com/audioset/). While this is the same type of data, it is very different from what we are working with. The AudioSet consists of clips from YouTube, which are nowhere close to insects' sound recordings. This is why we need to perform some preprocessing steps in order to reliably fine-tune the model for our purpose. Let's see what the ðŸ¤— [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/audio-spectrogram-transformer#overview) of the AST model tells us.

If we look closely at the *tips* section of the documentation, we'll learn that *"itâ€™s recommended to take care of the input normalization (to make sure the input has a mean of 0 and std of 0.5). ASTFeatureExtractor takes care of this. Note that it uses the AudioSet mean and std by default"*.

This poses a first challenge - we need to calculate the mean and standard deviation of the input that we are going to plug into our model to use instead of the AudioSet stats. The input to the AST model is a spectrogram, so we need to calculate the stats not from the "raw" amplitude arrays that we've just loaded, but from their respective spectrograms. Luckily the *ASTFeatureExtractor* does just that - it extracts the features from the audio data that are needed for the model, which are the spectrogram arrays. Setting the *do_normalize* argument to *False* will return the spectrograms without performing the normalization on them, so that we can calculate the relevant stats.

Okay, so we can start to code it, right? Well... not exactly yet. If we look at the [ASTFeatureExtractor documentation](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor) we will learn that the default sampling rate is set to 16 000 Hz. This happens to be the sampling rate of the AudioSet dataset on which the model was pre-trained. As we know from inspecting our dataset in the EDA phase and from the information displayed from looking at the first row of the train and validation sets our sampling rate is higher than that. In order to use the AST model we need first to resample our dataset to the sampling rate of 16 000 and only then calculate the statistics of such preprocessed audio set.

The next cell takes care of resampling our data to the required sampling rate.

In [12]:
MODEL_SAMPLING_RATE = 16000
train_dataset = train_dataset.cast_column("audio", Audio(sampling_rate=MODEL_SAMPLING_RATE))
val_dataset = val_dataset.cast_column("audio", Audio(sampling_rate=MODEL_SAMPLING_RATE))

Let's inspect the datasets in the below cell with the help of the info attribute of the Dataset object. We can see that the sampling rate for the *audio* column is set to 16 000, which is exactly what we wanted. Great! Seems that working with the ðŸ¤— library is really easy!

In [13]:
train_dataset.info.features, val_dataset.info.features

Having the data resampled we can now load the feature extractor with the help of the *from_pretrained* method of the *ASTFeatureExtractor*. As we discussed above - we need to disable the normalization to get unnormalized spectrogram arrays on which we will calculate the required stats - mean and standard deviation.

In [14]:
feature_extractor_stats = ASTFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", do_normalize=False)

Now with the help of the *map* method of the ðŸ¤— dataset object we'll use the *calculate_stats* function from the preprocessing module that we have in the *src* directory to pass the audio data through our newly created feature extractor. 

For the *calculate_stats* function we need to pass the audio feature of our dataset, the names of the keys that will help us to extract only the array from the audio feature, and finally our feature extractor object. 

For the configuration of the map method, we set the *batched* argument to *True*, so that we won't process the data one by one. The default batch size is equal to 32, we do have some computational capacity, so we'll leave it unchanged. If you run this notebook on a smaller instance consider setting the batch_size argument to e.g. 2, to avoid out-of-memory issues. As we've said before: those two arguments ensure that you process a number of examples at once reducing a bit the computation time.

In [15]:
train_dataset = train_dataset.map(lambda x: calculate_stats(x, audio_field='audio', array_field='array', feature_extractor=feature_extractor_stats), batched=True)

If we inspect the train_dataset object once again we will see that we've just created two more columns - *mean* and *std*. Those are the statistics for each file's spectrogram.

In [16]:
train_dataset[0]

Now the very last step to calculate the stats is to take the mean of the newly created columns. Those are the dataset statistics, that we will use for the model's feature extractor.

In [17]:
dataset_mean = np.mean(train_dataset['mean'])
dataset_std = np.mean(train_dataset['std'])

In [18]:
dataset_mean, dataset_std

As we won't need the mean and std columns in the next steps we can use the *remove_columns* method of the dataset object to get rid of them.

In [19]:
train_dataset = train_dataset.remove_columns(['mean', 'std'])

Phew! We resampled the data and calculated the stats of the dataset that we are about to use. Now we need to do one last step in our data preprocessing journey - we need to once again instantiate the *ASTFeatureExtractor*, but this time we will pass the dataset stats and leave the default value of the *do_normalize* argument, which is *True*.

In [20]:
feature_extractor = ASTFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", mean=dataset_mean, std=dataset_std)

Now let's once again invoke the *map* method of the ðŸ¤— dataset object on our train and validation sets, but this time we will use the *preprocess_audio_array_function* from our *preprocessing* module. We still stick to processing the dataset in batches, but this time we will also remove the *audio* column, as we won't need it any longer. The result of the below cell is a ready dataset that we can pass through the model in the training process.

In [21]:
train_dataset_encoded = train_dataset.map(lambda x: preprocess_audio_arrays(x, audio_field='audio', 
                                                                            array_field='array', 
                                                                            feature_extractor=feature_extractor), remove_columns="audio", batched=True, batch_size=2)
val_dataset_encoded = val_dataset.map(lambda x: preprocess_audio_arrays(x, audio_field='audio', 
                                                                        array_field='array', 
                                                                        feature_extractor=feature_extractor), remove_columns="audio", batched=True, batch_size=2)

Let's inspect the two newly created datasets. Now we see two columns - *label*, which we already know, and *input_values*, which stores the spectrogram arrays.

In [22]:
train_dataset_encoded, val_dataset_encoded

**Key insights:**
* The ðŸ¤— datasets object can be processed with the help of the map method. You can define if the processing should go in batches and if yes how many data points you wish to process at once
* Some audio models may require a specific sampling rate of your data. Do remember to take care of that
* The AST model requires us to pass the data through its feature extractor (ASTFeatureExtractor) - other models have their own extractors, so if you plan to implement other models (and we strongly encourage you to do so) remember to change the imports accordingly
* The documentation advises us to pass the mean and standard deviation of our dataset - please remember to check the required preprocessing if you plan to use other models. Not doing so, may rob you of a higher score!

**Exercise time:**
* Can you inspect the input_values feature of the dataset? 
* What type of object does it store? What is the shape of it? 
* Can you figure out why the shape of one data point is equal to those numbers? *Hint*: try to inspect the [feature extractor parameters](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor)!

We finished preprocessing the data. Now we will focus on preparing the last bits for fine-tuning our model and finally run the training process for four epochs!

# Fine-tuning the AST model

The model that we are about to use was pretrained on the AudioSet dataset, which is very different from the data we are working with. That's why we need to make sure, that the model "knows" what are the classes that we want it to predict and how many of them do we have.

In order to do this we will use our *labels.json* file which contains the mapping of the labels to integer ids. Let's load and inspect it.

In [23]:
with open('data/labels.json', 'r') as f:
    labels = json.load(f)

In [24]:
labels

From the above output, we can see, that basically, we are dealing with a Python dictionary. The keys are the names of the species and the values are integers from 0 to 65.

Now we need to create one mapping of label to id and one of id to label. It's important to make sure that the ids are cast to a string. Let's create two variables that will contain the required mappings - label2id and id2label. We will use Python dictionaries to achieve this.

In [25]:
label2id, id2label = dict(), dict()
for k, v in labels.items():
    label2id[k] = str(v)
    id2label[str(v)] = k

In [26]:
label2id

In [27]:
id2label

Great! This is exactly what we need. Now let's create a variable that will contain the information about the number of labels.

In [28]:
num_labels = len(label2id)
num_labels

Excellent! Now we have all the pieces in place to instantiate the AST model.

Let's use the *ASTForAudioClassification* class to instantiate the model. We will make sure to pass the number of labels and both of the mappings. Apart from that we are adding an *ignore_mismatched_sizes* argument and setting its value to *True*. This will instantiate the model's last layer with an appropriate number of neurons, which is derived from the rest of the arguments that we passed.

In [29]:
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", 
                                                  num_labels=num_labels, 
                                                  label2id=label2id, 
                                                  id2label=id2label,
                                                  ignore_mismatched_sizes=True
                                                 )

Great! We can see that some of the weights were "newly initialized". Those are the weights of the last layer. We can also see that we *"should probably TRAIN this model"*. This is exactly what we are about to do.

If we want to train a model with the help of the ðŸ¤— library we need to create instances of two classes - TrainingArguments and Trainer. The first object tells ðŸ¤— what are the different parameters of the training process that we are about to start. The Trainer class takes those arguments along with the model, metrics that we want to compute during training, the datasets we are going to use, and the feature extractor.

Feel free to inspect the documentation of both - the [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) and [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) classes.

In [33]:
NUM_TRAIN_EPOCHS = 2                        # variable defining number of training epochs

training_args = TrainingArguments(
    output_dir='models/AST',                # directory for saving model checkpoints and logs
    num_train_epochs=NUM_TRAIN_EPOCHS,      #number of epochs
    per_device_train_batch_size=2,          # number of examples in batch for training
    per_device_eval_batch_size=2,           # number of examples in batch for evaluation
    evaluation_strategy="epoch",            # makes evaluation at the end of each epoch
    learning_rate=float(5e-5),              # learning rate
    optim="adamw_torch",                    # optimizer
    logging_steps=1,                        # number of steps for logging the training process - one step is one batch
    load_best_model_at_end=True,            # whether to load or not the best model at the end of the training
    metric_for_best_model="eval_loss",      # claiming that the best model is the one with the lowest loss on the validation set
    save_strategy='epoch'                   # saving is done at the end of each epoch
)

In [34]:
# create Trainer instance
trainer = Trainer(
    model=model,                          # passing our model
    args=training_args,                   # passing the above created arguments
    compute_metrics=compute_metrics,      # passing the compute_metrics function that we imported from gdsc_eval module
    train_dataset=train_dataset_encoded,  # passing the encoded train set
    eval_dataset=val_dataset_encoded,     # passing the encoded validation set
    tokenizer=feature_extractor           # passing the feature extractor
)

Amazing! Now we did everything that was required to fine-tune the model. We can finally run the cell which will give us our "version" of the AST classifier, which is capable to distinguish different species from audio recordings. Let's do it!

In [35]:
# train model
trainer.train()

Is it possible? We are performing better than the Random Forest model with only a fraction of data! Well, yes, that's possible, but remember that the validation set we are using here contains only 66 samples, so way less than the original set. If you want to really compare the model with the Random Forest we need to perform inference on the test set and send a submission. 

In the next section we will show you how to load the model from checkpoint and perform inference on the test set data.

**Key insights:**
* The ðŸ¤— models hub offers you a variety of models, BUT you should always remember to adjust them to your task - create appropriate mapping of labels to integers and specify the number of classes that you are working with
* There is a number of parameters that define a training job - be mindful about how you are setting them and iterate over different values - this is called hyperparameter tuning
* Fine-tuning such a big model on such a small sample is almost always a bad idea - big models require big data!

# Loading the model and doing inference on the test set

If you look back at the *TrainingArguments* class you will see that we passed an *output_dir* argument that tells ðŸ¤— where to put the checkpoint with training metadata and model. We set it to *models/AST*, so let's use this directory to load the feature extractor and the model from the best checkpoint (note that this is not necessary, as we put in our *TrainingArguments* object an argument called *load_best_model_at_end* and we set it to *True*. This ensures that the variable *model* contains already the best one based on the metric of choice. We just wanted to show you how to load the model from other checkpoints in case you'd like to experiment). With ðŸ¤— library loading the checkpoint it's just a matter of two lines.

In [37]:
feature_extractor = ASTFeatureExtractor.from_pretrained("models/AST/checkpoint-176")
model = ASTForAudioClassification.from_pretrained("models/AST/checkpoint-176")

Cool! Now let's get the test set data. We need to preprocess them in the same way as we did for the training. Let's start with simply loading the dataset and resample the audio arrays. 

In [38]:
test_path = 'data/test'
test_dataset = load_dataset("audiofolder", data_dir=test_path).get('train')
test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=MODEL_SAMPLING_RATE))

In [39]:
test_dataset

In [40]:
test_dataset[0]

As we need the predictions file to have two columns - file_name and predicted_class_id, let's take care of extracting the paths for each data point and make it a feature called "file_name". 

For this purpose we'll use the metadata information from the dataset object that we just created.

So let's get the paths of the audio files.

In [41]:
test_paths = list(test_dataset.info.download_checksums.keys())

Let's inspect the variable.

In [42]:
test_paths[:3]

Great! We obtained the paths. One thing to note is that the test_paths variable contains also the metadata.csv file with file_names and labels (check it on your own!). We don't need it, so we will use a one-liner lambda function to extract only the items related to the audio files.

Furthermore, we don't need the whole path - just the file names, so we will define another one-liner that gets the string after the last "/" character, which is exactly the file name.

We will use the built-in filter and map methods that allow for applying a function on an Python iterable. With its help we will run the below defined lambda function.

In [43]:
remove_metadata = lambda x: x.endswith(".wav")
extract_file_name = lambda x: x.split('/')[-1]

test_paths = list(filter(remove_metadata, test_paths))
test_paths = list(map(extract_file_name, test_paths))

Let's see if the test_paths variable contains the file names.

In [44]:
test_paths[:3]

Yes, we indeed have just the file names. Let's create a new column with the file names.

In [45]:
test_dataset = test_dataset.add_column("file_name", test_paths)

Let's inspect the newly created "file_name" feature.

In [46]:
test_dataset

In [None]:
test_dataset[0]

Amazing! We almost finished preprocessing the data. The last step is to pass the audio arrays through our feature extractor and set fromat of the "input_values" columns from numpy to torch, so that we can safely pass the spectrogram arrays through the model.

In [47]:
test_dataset_encoded = test_dataset.map(lambda x: preprocess_audio_arrays(x, 'audio', 'array', feature_extractor), remove_columns="audio", batched=True, batch_size = 2)
test_dataset_encoded.set_format(type='torch', columns=['input_values'])

Now let's inform the ðŸ¤— that we want to run the predicions on our GPU. To do this we need to define the *device* variable with help of the *PyTorch* library.

In [49]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Good, we are set up to perform the inference on the test set. Let's use the *make_predictions* function from our *gdsc_eval* modeule located in *src* directory. This time we will set the *batch_size* argument to 8, to avoid any out-of-memory issues. We are also dropping the "input_values" column, as we won't need it anymore.

In [50]:
test_dataset_encoded = test_dataset_encoded.map(lambda x: make_predictions(x['input_values'], model, device), batched=True, batch_size=8, remove_columns="input_values")

Let's now create a pandas dataframe from our ðŸ¤— dataset. We should see the columns file_name and predicted_class_id

In [52]:
test_dataset_encoded_df = test_dataset_encoded.to_pandas()
test_dataset_encoded_df.head()

Great! Now we need to save the dataframe in a csv file and we are ready to send the predictions. We will save it in the directory of our model, to have everything in one place.

In [53]:
test_dataset_encoded_df.to_csv("models/AST/predictions.csv", index=False)

And done! We have our CSV file with the predictions ready. Let's upload it via the challenge website and see our results!

The score is way better than the one from Random Forest. Remember that in this tutorial we are using a much more powerful model, that was designed to work with audio data. But taking into account that the F1 metric ranges from 0 to 1, there is still some room for improvement. In the next tutorial, we will see how the model performs on the whole dataset. Then you will see what the model is really capable of! In the mean time, you can try to complete the exercises while making a coffee before the final tutorial.

***
**It is important that you name the columns exactly: **file_name** and **predicted_class_id**, otherwise your score won't appear on the leaderboard!**
***

**Exercise time:**

The last exercise in this notebook is to 
* try to think how we could improve the model further apart from running it on the whole sample. What does your Data Science intuition tell you? Post your thoughts in the Team's channel and gain some recognition for your team! ðŸ˜ƒ
* try also to use another model from the ðŸ¤— model hub. You will need to import other classes instead of ASTFeatureExtractor and ASTForAudioClassification. You will also need to change the string in the *from_pretrained* method and adjust the preprocessing. Sounds like a lot? Well, this is how we do Data Science! ðŸ˜ƒ

REMINDER: After finishing your work remember to shut down the instance.