### Contributor: Sudesh Kumar Santhosh Kumar
- Email: sudeshkumar.santhoshkumar@lexisnexis.com
- Date: 21st June, 2023

Description:  This Notebook will show you how to fine-tune BERT for multi-class text classification tasks, meaning that there are more than just two classes in the dataset. We'll be using the 20 Newsgroups dataset, which is a test classification dataset with 20 different classes. 

**Multi-Label vs. Multi-Class**

It is important to distinguish between "Multi-Class" and "Multi-Label" as they pertain to different types of tasks and are handled differently.

In "Multi-Label" classification, documents can be assigned multiple labels. This means that a document may have several appropriate tags or categories associated with it. For instance, consider a system that suggests tags for documents where multiple tags may be relevant.

On the other hand, "Multi-Class" classification applies when each document is assigned only one category. If there are only two possible categories, it is known as "Binary" classification. If there are more than two categories, it is referred to as "Multi-Class" classification. In this scenario, each document is assigned to a single category, without the possibility of multiple labels.

By understanding these distinctions, we can appropriately handle and address the requirements of each type of classification task.

# Basic Pre-requisites and Installation
---------------------------------
If you're using this notebook on Google Colab, Don't forget to uncomment the following cell to install transformers library into the notebook. If you're using this notebook on your local machine through VS Code or Jupyter notebook, you need to do a "pip install -r requirements.txt" before running the following set of code blocks.

In [None]:
# !pip install transformers

# Importing all necessary libraries
---------------------------------

In [1]:
import textwrap
import random
import csv
import os
import time
import datetime

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import torch
from torch.utils.data import TensorDataset, random_split
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics




# Part I - Dataset & Tokenization
---------------------------------

## Setting up the Notebook

### Connecting to GPU 

In [2]:
# Checking if there is a GPU Available.
if torch.cuda.is_available():

    device = torch.device(("cuda"))
    print(f"There are {torch.cuda.device_count()} available.")
    print(f"We will use the GPU: {torch.cuda.get_device_name(0)}")

else:
    print("No GPU available, using the CPU!")
    device = torch.device("cpu")

No GPU available, using the CPU!


### 

## 