### Name: Rownita Tasneem

### Load a dataset, and tokenize and vectorize the data

In this exercise, the task is to load a text classification dataset using the `datasets` Python library, and tokenize and vectorize the loaded data using the tokenizer created in the exercise task 1. This exercise builds towards a full model training notebook.

1) Load the `imdb` movie review dataset using the `datasets` Python library. Here is a helper notebook from the Introduction to Language Technology course in case you are not familiar with `datasets` or need a reminder.

2) Tokenize and vectorize the dataset using a tokenizer created in the exercise task 1. The tokenizer can be either monolingual English or multilingual (e.g. bert-base-cased or bert-base-multilingual-cased). The outcome of a tokenized and vectorized example should look something like this (some tokenizers do not produce token_type_ids):


{'attention_mask': [1, 1, 1, ... , 1],
'input_ids': [101, 146, 12765, ... , 102],
'token_type_ids': [0, 0, 0, ..., 0]}
Hint: To tokenize and vectorize the whole dataset, write a function which receives one example, and returns it's tokenized+vectorized version. Apply this function to each example in the dataset using `dataset.map()`.

Upload your solutions as a .ipynb notebook, preferably, or a .py file with its output if needs *be*

### Step-1: Install Required Libraries

In [None]:
pip install datasets transformers



In [None]:
! pip install --quiet datasets ## install the datasets Python package on the system

In [None]:
import datasets

In [None]:
datasets.disable_progress_bar() ## disable the progress bar to make output from dataset loading a bit less verbose # disabling progress bar for cleaner output

In [None]:
DATASET_NAME = "imdb"

In [None]:
from datasets import load_dataset

In [None]:
! pip install transformers



### Step-2: Loading the IMDb dataset

In [None]:
imdb_data = load_dataset(DATASET_NAME) # Loading a dataset from the repository simply by invoking the load_dataset function with the name of the dataset

In [None]:
builder = datasets.load_dataset_builder(DATASET_NAME) ## getting some general information from the dataset

In [None]:
print(builder.info.description)




### General Dataset Information

In [None]:
print(builder.info.citation)




# Step-3: Loading the Tokenizers

In [None]:
from transformers import BertTokenizer  ##importing BertTokenizer class from the transformers library. This class is used for tokenizing text with BERT models

In [None]:
##Load the Tokenizers for the multilingual models
multi_tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased") ## loads the tokenizer for the multilingual BERT model (bert-base-multilingual-cased), which can handle multiple languages

## Step-4: Define a function to tokenize the input text

In [None]:
def tokenize_function(examples):
    return multi_tokenizer(examples["text"], padding = "max_length", truncation = True, max_length = 512) # Tokenize the text, add padding, truncation and set max_length

## Step-5: Apply the tokenize function to the dataset using the map function

In [None]:
tokenized_imdb_dataset = imdb_data.map(tokenize_function, batched = True) # Applying the tokenize function to each batch of dataset using the map function

# Step-6: Show an example of the tokenized dataset

In [None]:
print(tokenized_imdb_dataset["train"][2])

{'text': "If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />", 'label': 0, 'input_ids': [101, 14535, 10893, 10114, 33253, 14293, 10531, 12807, 10108, 10458, 10106, 10105, 16711, 119, 10747, 10458, 10124, 64888, 10146, 10151, 48580, 10473, 27024, 10192, 11170, 22500, 13617, 119, 133, 33989, 120, 135, 133, 33989, 120, 135, 11340, 20970, 38008, 13953, 10991, 13499, 10142, 62151, 77586, 11680, 10271, 12373, 10271, 54981, 10171, 10135, 10380, 11299, 97126, 93520, 46935, 41275, 11090, 17850, 10473, 10271, 15107, 10380, 136

The "label" typically represent the output or target value for a classification or prediction task. In supervised learning, this is the known answer or ground truth the model is trying to predict. Here, the 0 represents negative sentiment.

Input IDS: These are the tokenized versions of the input text. Transformers models like BERT  and GPT cannot process raw text directly, so they first convert the text into numerical representations, specifically token IDs. Each word or subword in the text is mapped to a unique numerical ID from the model´s vocabulary.


Token Type IDS: Token Type IDs are used in tasks that require distinguishing between two sequences, such as question-answering tasks or sentence pair classification. Token Type IDs help the model differentiate between two parts.
  
  In BERT, if you're feeding in two sentences (e.g., for a sentence-pair task like question answering or text entailment), tokens from the first sentence might be assigned token_type_id = 0, and tokens from the second sentence would be assigned token_type_id = 1. This helps the model know which token belongs to which sentence.
For single-sentence tasks, all token type IDs would be 0.

Attention Mask: Definition: The attention mask is used to indicate which tokens should be attended to by the model. This is especially useful in scenarios where padding is added to make all input sequences the same length.
Example:
If the original sentence length is 10, but the input needs to be padded to 15 tokens, the first 10 tokens would have an attention mask of 1 (indicating valid tokens), and the 5 padding tokens would have an attention mask of 0 (indicating tokens that should be ignored).
So, the attention mask might look like [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] for a sentence with 5 real tokens and 5 padding tokens.

## Step-7: Remove Unnecessary Columns and Set format for PyTorch or Tensorflow

In [None]:
# Step 7 (Optional): Remove unnecessary columns and set format for PyTorch or TensorFlow
tokenized_imdb_dataset = tokenized_imdb_dataset.remove_columns(["text"])
tokenized_imdb_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


In [None]:
print(tokenized_imdb_dataset["train"][2])

{'label': tensor(0), 'input_ids': tensor([   101,  14535,  10893,  10114,  33253,  14293,  10531,  12807,  10108,
         10458,  10106,  10105,  16711,    119,  10747,  10458,  10124,  64888,
         10146,  10151,  48580,  10473,  27024,  10192,  11170,  22500,  13617,
           119,    133,  33989,    120,    135,    133,  33989,    120,    135,
         11340,  20970,  38008,  13953,  10991,  13499,  10142,  62151,  77586,
         11680,  10271,  12373,  10271,  54981,  10171,  10135,  10380,  11299,
         97126,  93520,  46935,  41275,  11090,  17850,  10473,  10271,  15107,
         10380,  13663,  11178,  27224,  19753,  11203, 101101,    119,  10117,
         17904,  10165,  21405,  14942,  10169,  10192,  10751, 105624,    113,
         60015,  10464,  21405,  10741,  10169,  10464,  11371,  10464,    112,
           187,  21133,  11471,  72975,    117,  10146,  10271,  11337,  10106,
         53951,  38565,  10149,  10939,  10531,  12331,  14985,  10458,    114,
      