<a href="https://colab.research.google.com/github/Madhan-sukumar/NLP/blob/main/Huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook includes

1. Sentiment Analysis using Hugging Face
2. Text Classification of German language using Hugging Face

In [44]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
pip install transformers

In [4]:
from transformers import pipeline
import torch
import torch.nn.functional as F

# SENTIMENT ANALYSIS

## Creating Classifier Pipeline

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.

https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html?highlight=pipelines

In [5]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [6]:
# tesing the classifer for example
res = classifier("I am soo happy to learn about Huggingface")

In [7]:
print(res)

[{'label': 'POSITIVE', 'score': 0.999500036239624}]


The above result showing that the given text is positive statement

In [10]:
# Running the Classifier on multiple statements
results = classifier(['I am soo happy to learn about Huggingface',
                      'i liked it',
                      'sometimes i feel dizzy'
                      ])

In [9]:
for result in results:
  print(result)

{'label': 'POSITIVE', 'score': 0.999500036239624}
{'label': 'POSITIVE', 'score': 0.9998264908790588}
{'label': 'NEGATIVE', 'score': 0.997977077960968}


# Running on Sentiment Model

In [13]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2 (Stanford sentiment tree) dataset. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

- Developed by: Hugging Face
- Model Type: Text Classification

In [11]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_name)        # loads the pre-trained tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(MODEL)

This line loads the pre-trained neural network model for sentiment analysis, specifically a model for sequence classification, which means it can classify a given sequence of text into one of several possible categories (in this case, a sentiment category such as positive, negative, or neutral).

In [15]:
classifier = pipeline("sentiment-analysis", model=model,tokenizer = tokenizer)

In [58]:
#applying the same statements to classifer
x_train = ['I am soo happy to learn about Huggingface',
                      'i liked it',
                      'sometimes i feel dizzy'
                      ]

In [59]:
#we are telling the tokenizer to encode the text into a PyTorch tensor. 
batch = tokenizer(x_train,return_tensors='pt', padding=True,truncation=True,max_length=512)

In [60]:
print(batch)

{'input_ids': tensor([[    3,   103,   235,   181, 26910,   312, 26773,  6187,   569,  7074,
           281,  8596,  9115,  8312,  8716,   950,     4],
        [    3,    46, 17491,   772, 26904, 23568,     4,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0],
        [    3,   181,  3084, 14626,    46,  6224,    77,  4616, 15106, 26951,
             4,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}


In [61]:
#This encoded text can be used as input to a PyTorch model without calculating gradient
with torch.no_grad():
  outputs = model(**batch)
  print(outputs)


SequenceClassifierOutput(loss=None, logits=tensor([[-0.8231,  3.1737, -3.2101],
        [ 5.0591, -0.6693, -5.5662],
        [ 2.5611, -0.1251, -2.5206]]), hidden_states=None, attentions=None)


The ** syntax unpacks this dictionary and passes its contents as keyword arguments to the model function.

In [62]:
#applying softmax and argmax
scores = F.softmax(outputs.logits,dim=1)
print(scores)
labels = torch.argmax(scores,dim=1)
print(labels)

#converting to actual labels names
labels = [model.config.id2label[label] for label in labels.tolist()]
print(labels)


tensor([[1.8013e-02, 9.8033e-01, 1.6555e-03],
        [9.9673e-01, 3.2414e-03, 2.4213e-05],
        [9.3080e-01, 6.3422e-02, 5.7793e-03]])
tensor([1, 0, 0])
['negative', 'positive', 'positive']


softmax converts a vector of real values into probabilities, while argmax selects the index with the maximum value in an array. They are often used together in classification problems, where softmax is used to obtain a probability distribution over classes, and argmax is used to select the class with the highest probability.






# Saving the model

In [55]:
save_directory = "content/drive/MyDrive/Colab Notebooks/saved"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# Loading the model

In [56]:
tokenizer = AutoTokenizer.from_pretrained(save_directory)        # loads the pre-trained saved model
model = AutoModelForSequenceClassification.from_pretrained(save_directory)

# TEXT CLASSIFIACTION OF GERMAN LANGUAGE

This model was trained for sentiment classification of German language texts.

https://huggingface.co/oliverguhr/german-sentiment-bert

In [64]:
model_name = 'oliverguhr/german-sentiment-bert'
tokenizer = AutoTokenizer.from_pretrained(model_name)       
model_german = AutoModelForSequenceClassification.from_pretrained(model_name)

In [65]:
#equivalent english sentence
# I love spending time with my family
# The food in this restaurant tastes fantastic
# I lost my wallet and am now totally stressed
# The hotel where I stayed was dirty and noisy
# I am looking forward to the upcoming vacation

german = ['Ich liebe es, Zeit mit meiner Familie zu verbringen',
          'Das Essen in diesem Restaurant schmeckt fantastisch',
          'Ich habe meinen Geldbeutel verloren und bin jetzt total gestresst',
          'Das Hotel, in dem ich übernachtet habe, war schmutzig und laut',
          'Ich freue mich auf den bevorstehenden Urlaub']

In [72]:
encoded = tokenizer(german,return_tensors='pt', padding=True,truncation=True,max_length=512)
print(encoded)

#This encoded text can be used as input to a PyTorch model without calculating gradient
with torch.no_grad():
  outputs = model_german(**encoded)
  print(outputs)


{'input_ids': tensor([[    3,  1671, 16619,   229, 26918,   417,   114, 10183,  1786,    81,
         18206,     4,     0,     0,     0,     0,     0,     0,     0],
        [    3,   295,  6346,    50,   798,  8533, 23371,  1447, 20568,  4053,
           514,    85,     4,     0,     0,     0,     0,     0,     0],
        [    3,  1671,   555,  9685, 21850,   848, 26907,  3864,    42,  4058,
          1868, 22471,  1306,  9673, 26901,     4,     0,     0,     0],
        [    3,   295,  5593, 26918,    50,   128,  1169,   204, 11808,    75,
           555, 26918,   185,  8653,   452,    80,    42,  3696,     4],
        [    3,  1671, 24669,  1790,  3277,   115,    86, 26750,  6352,     4,
             0,     0,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     

In [70]:
#applying softmax and argmax
scores = F.softmax(outputs.logits,dim=1)
print(scores)
labels = torch.argmax(scores,dim=1)
print(labels)


tensor([[9.8782e-01, 1.1395e-02, 7.8062e-04],
        [9.9849e-01, 1.4727e-03, 3.8426e-05],
        [5.7672e-03, 9.9214e-01, 2.0893e-03],
        [1.4683e-03, 9.9849e-01, 3.7488e-05],
        [9.1960e-02, 5.9161e-02, 8.4888e-01]])
tensor([0, 0, 1, 1, 2])


In [71]:
#converting to actual labels names
labels = [model_german.config.id2label[label] for label in labels.tolist()]
print(labels)

['positive', 'positive', 'negative', 'negative', 'neutral']
