<details><summary style="display:list-item; font-size:16px; color:blue;">Jupyter Help</summary>
    
Having trouble testing your work? Double-check that you have followed the steps below to write, run, save, and test your code!
    
[Click here for a walkthrough GIF of the steps below](https://static-assets.codecademy.com/Courses/ds-python/jupyter-help.gif)

Run all initial cells to import libraries and datasets. Then follow these steps for each question:
    
1. Add your solution to the cell with `## YOUR SOLUTION HERE ## `.
2. Run the cell by selecting the `Run` button or the `Shift`+`Enter` keys.
3. Save your work by selecting the `Save` button, the `command`+`s` keys (Mac), or `control`+`s` keys (Windows).
4. Select the `Test Work` button at the bottom left to test your work.

![Screenshot of the buttons at the top of a Jupyter Notebook. The Run and Save buttons are highlighted](https://static-assets.codecademy.com/Paths/ds-python/jupyter-buttons.png)

**Setup**

In [1]:
import torch
import random
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from pprint import pprint

def set_seed(seed=42):
    random.seed(seed)
    torch.manual_seed(seed)

set_seed() 

To spare our Codecademy computers a little extra work, we've cut the IMDB dataset down to a smaller CSV file, which we'll import via pandas. If you'd like to see the Hugging Face card for the whole Dataset, check it out [here](https://huggingface.co/datasets/imdb).

Execute the next four code cells to import the data and explore it.

### Imports and EDA

In [2]:
df = pd.read_csv('imdb_data.csv')
print(df.head())
print("Number of null values:")
print(df.isnull().sum())

                                                text  label dataset
0  There is no relation at all between Fortier an...      1   train
1  This movie is a great. The plot is very true t...      1   train
2  George P. Cosmatos' "Rambo: First Blood Part I...      0   train
3  In the process of trying to establish the audi...      1   train
4  Yeh, I know -- you're quivering with excitemen...      0   train
Number of null values:
text       0
label      0
dataset    0
dtype: int64


As you can see, it's pretty simple: the `text` column holds the reviews themselves, while `label` is 1 or 0 (1 is positive sentiment, 0 is negative sentiment.) Finally, the `dataset` column separates the data into training and test sets. There are no null values in the dataset--thanks, Hugging Face!

In [3]:
print("Dataframe Info:")
print(df.info())
print("\n")
print("Dataframe Description:")
print(df.describe())
print("\n")
print("Number of unique values in each column:")
print(df.nunique())

Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     2500 non-null   object
 1   label    2500 non-null   int64 
 2   dataset  2500 non-null   object
dtypes: int64(1), object(2)
memory usage: 58.7+ KB
None


Dataframe Description:
             label
count  2500.000000
mean      0.495200
std       0.500077
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       1.000000


Number of unique values in each column:
text       2499
label         2
dataset       2
dtype: int64


We've got 2500 rows. Let's take a look at a random review to get a feel for how they sound.

In [4]:
random_index = random.randint(0, len(df) - 1)
pprint(df.loc[random_index, 'text'])

('A great, funny, sweet movie with Morgan Freeman (who plays himself) and who '
 'meets a Spanish girl named Scarlet (Paz Vega) at a small store whilst '
 'researching a potential independent film. I was a bit dubious about the film '
 'for the first ten minutes but as soon as he was in the store I really '
 'started to enjoy the film. It shows how a positive attitude can change '
 'anything. It does not contain any complex plots and it is easy to follow but '
 'will lift the saddest of moods and make you smile all the way through '
 'without the need for petty cliché romance. It includes several scenes all '
 'the way through which make you clutch your sides with laughter. A very rare '
 'masterpiece!')


In [5]:
# Curious how this tokenization code works? We'll cover it in more detail at the end of this exercise.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # this tokenizer will work for our smaller model
tokenized_reviews = df['text'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
review_token_lengths = tokenized_reviews.apply(len)
print(f"Shortest review length (in tokens): {review_token_lengths.min()}")
print(f"Longest review length (in tokens): {review_token_lengths.max()}")
print(f"Average review length (in tokens): {review_token_lengths.mean()}")


Downloading tokenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 216kB/s]
Downloading config.json: 100%|██████████| 483/483 [00:00<00:00, 1.44MB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 99.8MB/s]
Downloading tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 56.5MB/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (936 > 512). Running this sequence through the model will result in indexing errors


Shortest review length (in tokens): 35
Longest review length (in tokens): 1470
Average review length (in tokens): 300.0068


Note the warning in the last cell's output above. Some of the examples in the dataset are longer than our model can intake. We'll need to be careful about how we handle the longer reviews in our dataset.

### Evaluating the base model

#### Checkpoint 1/3

Now that we've explored our data, let's evaluate the base model's performance with test data. This way, we'll have a basic idea of how much better our finetuned model is at classifying the sentiment of movie reviews relative to the model it was based on.

Hugging Face's Dataset library has a helpful method, `from_pandas`, that can convert a DataFrame into a Hugging Face Dataset.

Call the `from_pandas` method of `Dataset` below, passing in our DataFrame (`df`) as the argument.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [8]:
from datasets import Dataset

## YOUR SOLUTION HERE ##
dataset= Dataset.from_pandas(df)

#### Checkpoint 2/3

Next we'll tokenize the data.

Hugging Face provides a useful `.map` method to instantiated datasets. Just pass it a function that takes in the data you want tokenized and outputs the tokenized data.

We've defined the `tokenize_function` for you. At the bottom of the cell, use our `dataset`'s `.map()` method to tokenize the data. You'll do this by passing the `tokenize_function` as the first argument, and passing `batched=True` as the second argument so that we'll tokenize the text in batches.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [10]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="longest", truncation=True)

## YOUR SOLUTION HERE ##
tokenized_dataset= dataset.map(tokenize_function, batched = True)


Map: 100%|██████████| 2500/2500 [00:00<00:00, 2870.11 examples/s]


Good. We'll now instantiate our model, a tiny version of BERT that will be great for learning the basics of writing finetuning code. Execute the cell below.

In [11]:
model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-tiny", num_labels=2)

Downloading config.json: 100%|██████████| 285/285 [00:00<00:00, 1.78MB/s]
Downloading pytorch_model.bin: 100%|██████████| 17.8M/17.8M [00:00<00:00, 75.7MB/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Checkpoint 3/3

While our warning correctly reminds us that bert-tiny is not ready for production use and should instead first be finetuned, we're going to evaluate it anyway by passing it our test data.

That way, we'll have a basic idea of whether or not our finetuning run actually improves the model's ability to classify movie reviews.

If you see some unfamiliar code in the cell below, don't worry. We'll cover everything in greater detail soon. For now, tell the model it's in 'evaluation mode' (and not 'training mode') by passing the arguments `do_train=False` and `do_eval=True` to the `TrainingArguments()` call.

Then, on the line where we define `eval_results`, run `trainer.evaluate()`. This will print `eval_results` to the console on the next line.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [12]:
# this is a Hugging Face method that makes it easy to filter our dataset for only 'test' data
eval_dataset = tokenized_dataset.filter(lambda x: x['dataset'] == 'test')

## YOUR SOLUTION HERE ##
training_args = TrainingArguments(
    output_dir='./temp_results',  
    do_train= False,
    do_eval= True,
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=eval_dataset
)

## YOUR SOLUTION HERE ##
eval_results = trainer.evaluate()
print(eval_results)

Filter: 100%|██████████| 2500/2500 [00:00<00:00, 2573.82 examples/s]
Detected kernel version 4.14.355, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[codecarbon INFO @ 05:16:24] [setup] RAM Tracking...
[codecarbon INFO @ 05:16:24] [setup] GPU Tracking...
[codecarbon INFO @ 05:16:24] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 05:16:24] [setup] CPU Tracking...
[codecarbon INFO @ 05:16:26] CPU Model on constant consumption mode: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[codecarbon INFO @ 05:16:26] >>> Tracker's metadata:
[codecarbon INFO @ 05:16:26]   Platform system: Linux-4.14.355-277.647.amzn2.x86_64-x86_64-with-glibc2.35
[codecarbon INFO @ 05:16:26]   Python version: 3.10.12
[codecarbon INFO @ 05:16:26]   CodeCarbon version: 2.3.1
[codecarbon INFO @ 05:16:26]   Available RAM : 30.948 GB
[codecarbon INFO @ 05:16:26]   CPU count: 8
[codecarbon INFO @ 05:16:26]   

{'eval_loss': 0.7226571440696716, 'eval_runtime': 1.5537, 'eval_samples_per_second': 804.506, 'eval_steps_per_second': 101.046}


Consider writing down the `eval_loss` number, as we'll be comparing it to the finetuned models later.