## Natural Language Processing
NLP is a machine learning field that understand the words in a sentence individually and the context behind using those words. Some of the common NLP tasks are - reviewing and classifying whole sentences, identifying each word as part of speech(noun, verb , adjective, entity), generating text with masked words, extracting information from the context, transalting or summarizing the text. 

### Transformers 
One of the mdoels used for solving NLP tasks is the transformers. I will be using the transformers library from Huggingface.
The pipeline API function will connect the pre-trained model with pre-processed inputs and post-prpcessing predictions to directly output the sentiment label of the sentence with accuracy score. For example- Zero-shot classification pipeline allows in annotating the text and classify the sentence into sentimental labels. Text generation pipeline will auto-complete the sentence by generatinhg the remaining text with a provided prompt. Some of the other examples include - Named entity recognition, question answering, summarization, transalation

### Categories of pre-trained transformers models
- GPT-like (auto-regressive)
- BERT-like (auto-encoding)
- BART/T5-like (sequence-to-sequence)

### Transformer architecture
-	Encoder: Encoder encode the input into numerical representation of each word.
-	Decoder: Decoder uses the encoder’s representation along with other inputs to generate a target sequence 
-	Encoder-only models: Bi-directional, for sentiment analysis (sentence classification) and named entity recognition. Ex- BERT, RoBERTa, ALBERT
-	Decoder-only models: unidirectional, for text generation, Ex – CTRL, GPT, GPT-2, Transformer XL
-	Encoder-Decoder or sequence-to-sequence models: translation or summarization, Ex – BART, mBART, Marian, T5

## Using Transformers
Stage 1 (tokenizer)  Stage 2 (Model)  Stage 3 (post processing)

Raw text -->     Input ID     -->       Logits (not probabilities)    -->    Predictions
Tokenizer: 
-	split the input into words, subwords, symbols referred as tokens
-	Mapping each token to an integer
-	“AutoTokenizer” class and “from_pretrained” method


In [1]:
import pandas as pd
# load dataset
df = pd.read_csv(r"C:\Users\minnu\Desktop\Sentiment-analysis-using-NLP\dataset_exploration\all-data.csv", names = ['label', 'sentences'], sep=',', encoding='latin-1')
df.head()

Unnamed: 0,label,sentences
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [2]:
df_2 = df.copy()

In [3]:
# Preprocessing with Tokenizer
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [4]:
raw_inputs = [df_2['sentences'][2],df_2['sentences'][4]] 
inputs = tokenizer(raw_inputs, padding= True, truncation = True, return_tensors = 'pt')
print(inputs)

{'input_ids': tensor([[  101,  1996,  2248,  4816,  3068,  2194,  3449, 19800,  4160,  2038,
          4201,  2125, 15295,  1997,  5126,  2013,  2049, 21169,  4322,  1025,
         10043,  2000,  3041,  3913, 27475,  1996,  2194, 11016,  1996,  6938,
          1997,  2049,  2436,  3667,  1010,  1996,  3679,  2695, 14428,  2229,
          2988,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0],
        [  101,  2429,  2000,  1996,  2194,  1005,  1055,  7172,  5656,  2005,
          1996,  2086,  2268,  1011,  2262,  1010, 19021,  8059,  7889,  1037,
          2146,  1011,  2744,  5658,  4341,  3930,  1999,  1996,  2846,  1997,
          2322,  1003,  1011,  2871,  1003,  2007,  2019,  4082,  5618,  7785,
          1997,  2184,  1003,  1011,  2322,  1003,  1997,  5658,  4341,  1012,
           102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [5]:
# Going through the model
from transformers import AutoModel   # AutoModel instantiate the model from checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)


Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# output displaying the batch size, sequence length, hidden vector dimension
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 51, 768])


In [7]:
# Model heads: making sense out of numbers
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)

tensor([[ 1.7797, -1.6082],
        [-0.9567,  0.8728]], grad_fn=<AddmmBackward>)


In [8]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim =-1)
print(predictions)

tensor([[0.9673, 0.0327],
        [0.1383, 0.8617]], grad_fn=<SoftmaxBackward>)


## From tokenizer to model
1. Intialize the BERT model 

In [9]:
# The AutoConfig API allows to instantiate the configuration 
# of a pretrained model from any checkpoint. It contains all the information needed 
# to load the model
'''
from transformers import BertConfig, BertModel 
config = BertConfig()
model = BertModel(config)   # Model is randomly intialized
'''
from transformers import BertModel 
model = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2. **Tokenizers**: Tokenizers translate raw text inputs into numerical data that can be processed by the model. It could be either word-based, character-based or subword tokenization

In [10]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = df_2["sentences"][2]
tokens = tokenizer.tokenize(sequence)

print(tokens)


['The', 'international', 'electronic', 'industry', 'company', 'El', '##cote', '##q', 'has', 'laid', 'off', 'tens', 'of', 'employees', 'from', 'its', 'Tallinn', 'facility', ';', 'contrary', 'to', 'earlier', 'lay', '##offs', 'the', 'company', 'contracted', 'the', 'ranks', 'of', 'its', 'office', 'workers', ',', 'the', 'daily', 'Post', '##ime', '##es', 'reported', '.']


In [11]:
# From tokens to input IDs
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1109, 1835, 4828, 2380, 1419, 2896, 21596, 4426, 1144, 3390, 1228, 17265, 1104, 4570, 1121, 1157, 23277, 3695, 132, 11565, 1106, 2206, 3191, 18438, 1103, 1419, 11058, 1103, 6496, 1104, 1157, 1701, 3239, 117, 1103, 3828, 3799, 10453, 1279, 2103, 119]


In [12]:
# Decode
decode_string = tokenizer.decode(ids)
print(decode_string)

The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers, the daily Postimees reported.


3. **Handling multiple sequences**: 
Model expects a batch of inputs, i.e sending multiple sentences through the model all at once. Padding makes sure all the sentences have the same length by adding a special word called padding token to the sentences with fewer values.  To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask. We can truncate the lengths of the sequences by specifying max_sequence_length parameter. 

**After understanding how tokenizer and pretrained model works on few sequences, now it's time to fine-tune the pretrained model on the whole dataset.** 

## Fine-tuning a pretrained model 


The **Trainer** class in transformers help to fine-tune any pretrained models on the dataset. Before defining the Trainer class, we start with dataprocessing. The dataset **financial_phrasebank** needs to be split into train/validation/test dataset before we define the training model by passing all the arguments.  

In [39]:
# load  dataset and split the data into train/validation/test datasets
from datasets import load_dataset, DatasetDict

raw_datasets = load_dataset("financial_phrasebank", "sentences_50agree")
# 90% train and 10% test + validation
train_test_ds = raw_datasets["train"].train_test_split(test_size=0.1)

# Split the 10% test + valid in half test, half valid
test_valid = train_test_ds['test'].train_test_split(test_size=0.5)

# Gather everything into a single DatasetDict
dataset = DatasetDict({
    'train': train_test_ds['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})
dataset

Reusing dataset financial_phrasebank (C:\Users\minnu\.cache\huggingface\datasets\financial_phrasebank\sentences_50agree\1.0.0\a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0)
100%|██████████| 1/1 [00:00<00:00, 336.11it/s]
Loading cached split indices for dataset at C:\Users\minnu\.cache\huggingface\datasets\financial_phrasebank\sentences_50agree\1.0.0\a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0\cache-011b3c541a7c235d.arrow and C:\Users\minnu\.cache\huggingface\datasets\financial_phrasebank\sentences_50agree\1.0.0\a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0\cache-a0e7c2b291cdaf5d.arrow
Loading cached split indices for dataset at C:\Users\minnu\.cache\huggingface\datasets\financial_phrasebank\sentences_50agree\1.0.0\a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0\cache-70b9cd58c8c6618c.arrow and C:\Users\minnu\.cache\huggingface\datasets\financial_phrasebank\sentences_50agree\1.0.0\a6d468761d4e0c8ae215c77367e1092

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 4361
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 243
    })
    valid: Dataset({
        features: ['sentence', 'label'],
        num_rows: 242
    })
})

In [37]:
dataset["train"][0]

{'sentence': 'The bank sees a potential for Getinge share to rise .',
 'label': 2}

The labels are already in integers. To know the corresponding label to the integer, use features to inspect the dataset.

In [40]:
dataset["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=3, names=['negative', 'neutral', 'positive'], names_file=None, id=None)}

label is of type ClassLabel. 0 corresponds to 'negative', 1 corresponds to 'neutral', and 2 corresponds to 'positive'

### Preprocess the dataset
Convert the text to numbers the model can make sense of. Thsi can be done using Tokenizer

In [46]:
from transformers import AutoTokenizer

checkpoint= "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(dataset["train"]["sentence"])
# inputs

The output of inputs contain a dictionary with keys 'input_ids', 'token_type_ids', and 'attention_mask'. To keep the data as a dataset, we will use the Dataset.map method.The map method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

In [47]:
def tokenize_function(example):
    return tokenizer(example["sentence"], padding = True, truncation = True)

In [48]:
tokenized_datasets = dataset.map(tokenize_function)


100%|██████████| 4361/4361 [00:01<00:00, 3403.98ex/s]
100%|██████████| 243/243 [00:00<00:00, 3572.34ex/s]
100%|██████████| 242/242 [00:00<00:00, 2712.32ex/s]


In [49]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4361
    })
    test: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 243
    })
    valid: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 242
    })
})

Our tokenize_function returns a dictionary with the keys input_ids, attention_mask, and token_type_ids, so those three fields are added to all splits of our dataset.

In [51]:
data_collator = DataCollatorWithPadding(tokenizer= tokenizer)

## Training

In [52]:
# defining training arguments 
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")

# defining the model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 3)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [54]:
from transformers import Trainer
trainer = Trainer (
    model, 
    training_args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["valid"],
    data_collator = data_collator,
    tokenizer = tokenizer
)

In [55]:
# fine-tune the model on our dataset 
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence.
***** Running training *****
  Num examples = 4361
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1638
 31%|███       | 500/1638 [24:40<48:34,  2.56s/it]Saving model checkpoint to test-trainer\checkpoint-500
Configuration saved in test-trainer\checkpoint-500\config.json


{'loss': 0.5593, 'learning_rate': 3.473748473748474e-05, 'epoch': 0.92}


Model weights saved in test-trainer\checkpoint-500\pytorch_model.bin
tokenizer config file saved in test-trainer\checkpoint-500\tokenizer_config.json
Special tokens file saved in test-trainer\checkpoint-500\special_tokens_map.json
 61%|██████    | 1000/1638 [48:55<25:54,  2.44s/it]Saving model checkpoint to test-trainer\checkpoint-1000
Configuration saved in test-trainer\checkpoint-1000\config.json


{'loss': 0.3197, 'learning_rate': 1.9474969474969477e-05, 'epoch': 1.83}


Model weights saved in test-trainer\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in test-trainer\checkpoint-1000\tokenizer_config.json
Special tokens file saved in test-trainer\checkpoint-1000\special_tokens_map.json
 92%|█████████▏| 1500/1638 [1:11:32<05:24,  2.35s/it]Saving model checkpoint to test-trainer\checkpoint-1500
Configuration saved in test-trainer\checkpoint-1500\config.json


{'loss': 0.1462, 'learning_rate': 4.212454212454213e-06, 'epoch': 2.75}


Model weights saved in test-trainer\checkpoint-1500\pytorch_model.bin
tokenizer config file saved in test-trainer\checkpoint-1500\tokenizer_config.json
Special tokens file saved in test-trainer\checkpoint-1500\special_tokens_map.json
100%|██████████| 1638/1638 [1:17:41<00:00,  1.98s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 1638/1638 [1:17:41<00:00,  2.85s/it]

{'train_runtime': 4661.0553, 'train_samples_per_second': 2.807, 'train_steps_per_second': 0.351, 'train_loss': 0.32459213037921686, 'epoch': 3.0}





TrainOutput(global_step=1638, training_loss=0.32459213037921686, metrics={'train_runtime': 4661.0553, 'train_samples_per_second': 2.807, 'train_steps_per_second': 0.351, 'train_loss': 0.32459213037921686, 'epoch': 3.0})

## Evaluation

In [59]:
predictions = trainer.predict(tokenized_datasets["valid"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence.
***** Running Prediction *****
  Num examples = 242
  Batch size = 8
124it [01:58,  2.07it/s]

(242, 3) (242,)


In [67]:
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
preds

array([1, 0, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       2, 1, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 0, 2, 2, 1,
       1, 1, 1, 0, 2, 2, 1, 1, 1, 1, 1, 2, 0, 1, 2, 1, 1, 0, 1, 2, 1, 1,
       0, 1, 1, 1, 1, 0, 2, 2, 1, 1, 1, 1, 0, 1, 1, 1, 2, 1, 1, 1, 0, 2,
       1, 2, 1, 1, 0, 2, 1, 1, 2, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 2, 1, 1,
       0, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 0, 1, 2, 0, 1, 1, 0, 2, 2,
       0, 1, 2, 1, 1, 1, 1, 1, 2, 0, 2, 1, 1, 2, 1, 1, 1, 1, 2, 0, 2, 2,
       1, 1, 0, 1, 2, 1, 0, 1, 1, 0, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1,
       2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 0, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 0, 0, 1, 1, 0, 1, 1, 2, 0, 1, 2, 1, 2, 2, 1, 1,
       1, 1, 2, 1, 2, 1, 0, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 0],
      dtype=int64)

## A full training
Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our tokenized_datasets

In [69]:
tokenized_datasets = tokenized_datasets.remove_columns(
    ["sentence"]
)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [70]:
# Define dataloaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["valid"], batch_size=8, collate_fn=data_collator
)

In [71]:
#inspecting dataloaders
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 54]),
 'input_ids': torch.Size([8, 54]),
 'token_type_ids': torch.Size([8, 54]),
 'labels': torch.Size([8])}

In [72]:
# Define the model 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\minnu/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.9.2",
  "type_vocab_size": 2,
  "use_cache": tru

In [73]:
# Pass batch to the model
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(1.3863, grad_fn=<NllLossBackward>) torch.Size([8, 3])


Now we need an optimizer (AdamW) and learning rate scheduler

In [74]:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr = 5e-5)

In [75]:
from transformers import get_scheduler 

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer = optimizer,
    num_warmup_steps =0, 
    num_training_steps = num_training_steps
)
print(num_training_steps)

1638


## The training loop

In [76]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cpu')

In [77]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)



KeyboardInterrupt: 