## <Font color = 'pickle'>**Install/Load Libraries**

This set of commands is used to install three Python libraries using pip, a package manager for Python. These commands are intended to be run in a command-line environment. The `!` at the beginning of each command allows you to run shell commands in the Jupyter notebook. The `-qq` flag is used to make the installation process quieter, i.e., it produces less output

1. `!pip install transformers -qq`: This command installs the `transformers` library, a Python package developed by Hugging Face. The `transformers` library provides general-purpose architectures (like BERT, GPT-2, RoBERTa, XLM, DistilBert, etc.) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over thousands of pre-trained models. .

2. `!pip install sentencepiece -qq`: This command installs the `sentencepiece` library, a text tokenizer/detokenizer developed by Google. It provides a way to break up text into manageable pieces, which is a crucial first step in many NLP tasks.




In [2]:
# Install necessary Python libraries:
!pip install transformers -qq
!pip install sentencepiece -qq

In [3]:
from transformers import pipeline

The line of code `from transformers import pipeline` is used to import the `pipeline` function from the `transformers` library.

- The `pipeline` function is a high-level, easy-to-use, API for running a sequence of models/processes. It abstracts away the underlying details and allows you to use these models with a single line of code.

- Depending on the task, different pipelines are available, such as `text-classification`, `question-answering`, `ner` (named entity recognition), etc.



# <font color = 'pickle'> **NLP Applications**

## <font color = 'pickle'> **Sequence Classification**

Sequence Classification refers to the task in Natural Language Processing (NLP) where an algorithm takes a sequence of words (or tokens) as input and outputs a category or class. Examples of sequence classification tasks include sentiment analysis (classifying a text as positive, negative, or neutral) and spam detection (classifying an email as spam or not spam).



### <font color = 'pickle'> **Sentiment Analysis**

## What is Sentiment Analysis?
Sentiment Analysis is a sub-field of Natural Language Processing (NLP) that identifies and categorizes the sentiment expressed in a piece of text. The goal is to determine whether the writer's attitude towards a particular topic, product, or service is positive, negative, or neutral.

## Why is Sentiment Analysis Important?
- **Customer Insight**: It helps businesses understand how customers perceive their products or services, providing valuable insights for improvement.
- **Brand Monitoring**: It allows businesses to monitor brand and product sentiment in real-time on social media, helping them respond to trends and public reactions quickly.
- **Product Development**: Analysis of customer reviews can guide product and service development by prioritizing features or areas that are important to customers.
- **Political Analysis**: Sentiment analysis can also be used to understand political sentiment, public opinion on social issues, or responses to government policies.

#### <font color = 'pickle'> **Create pipeline for Sentiment Analysis**

In [4]:
# This line of code creates a sentiment analysis pipeline using a pre-trained model from the transformers library.
# The sentiment_classifier object can be used to classify the sentiment of input text.
sentiment_classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


#### <font color = 'pickle'> **Apply pipeline on examples**

In [5]:
# Applying the sentiment analysis pipeline on an example text.
# The pipeline will predict the sentiment (positive or negative) of the input text.
sentiment_classifier(" movie is terribly exciting")

[{'label': 'POSITIVE', 'score': 0.9998461008071899}]

In [6]:
# applying pipeline to list of sentences
sentiment_classifier([" this is very intersting movie",
           "  movie is terribly exciting",
           "movie was moving very slowly"])

[{'label': 'POSITIVE', 'score': 0.9761605262756348},
 {'label': 'POSITIVE', 'score': 0.9998461008071899},
 {'label': 'NEGATIVE', 'score': 0.9984513521194458}]

#### <font color = 'pickle'> **Use GPUs for faster inference**</font>

When performing inference with complex machine learning models, such as those used in Natural Language Processing (NLP), utilizing Graphics Processing Units (GPUs) can significantly speed up the process. GPUs offer several advantages for faster inference:

1. **Parallel Processing**: GPUs are specifically designed to handle thousands of computations simultaneously. This parallel processing capability is ideal for executing the matrix operations involved in deep learning models. Unlike traditional Central Processing Units (CPUs), which excel at sequential processing, GPUs shine when it comes to parallel processing, making them highly efficient for neural network computations.

2. **Matrix Operations**: Deep learning models, especially transformer-based models like BERT and GPT, heavily rely on matrix operations. GPUs are equipped with specialized hardware to accelerate these matrix computations, leading to substantial speedups during inference.

By harnessing the computational power of GPUs, we can accelerate the sentiment analysis pipeline's inference process. This allows for quicker and more efficient sentiment predictions, making it particularly beneficial when dealing with larger datasets or more complex models.


In [7]:
# Import the torch library. PyTorch is a Python library for deep learning.
import torch

# Check if a CUDA-enabled GPU is available for PyTorch.
# This can speed up neural network computations.
torch.cuda.is_available()

True

In [8]:
# Get the ID of the current CUDA device that PyTorch is using.
torch.cuda.current_device()

0

In [9]:
# The 'device=0' argument specifies that the pipeline should use the first CUDA-enabled GPU.
classifier = pipeline("sentiment-analysis", device = 0)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [10]:
classifier([" this is very intersting movie",
           "  movie is terribly exciting",
           "movie was moving very slowly"])

[{'label': 'POSITIVE', 'score': 0.9761606454849243},
 {'label': 'POSITIVE', 'score': 0.9998461008071899},
 {'label': 'NEGATIVE', 'score': 0.9984513521194458}]

#### <font color = 'pickle'> **Check the model used by pipeline**

In [11]:
classifier.model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.31.0",
  "vocab_size": 30522
}

### <font color = 'pickle'> **Emotion Classification**
Emotion Classification is a task in Natural Language Processing (NLP) where the goal is to categorize a piece of text based on the emotional tone it conveys. This goes beyond basic sentiment analysis (which typically involves determining whether a text is positive, negative, or neutral) by identifying specific emotions that the text might be expressing.

In the context provided, the Emotion Classification task involves classifying a piece of text into one of the following categories:

- admiration
- amusement
- anger
- annoyance
- approval
- caring
- confusion
- curiosity
- desire
- disappointment
- disapproval
- disgust
- embarrassment
- excitement
- fear
- gratitude
- grief
- joy
- love
- nervousness
- optimism
- pride
- realization
- relief
- remorse
- sadness
- surprise
- neutral

Each of these labels represents a different emotion. For example, if a text is classified as "joy", it means that the text expresses a feeling of joy or happiness. The "neutral" label is typically used for texts that don't express any particular emotion.

By classifying texts based on their emotional content, we can gain a deeper understanding of the underlying sentiments, attitudes, and opinions expressed in the text. This can be especially useful in fields like social media analysis, customer feedback analysis, and psychology.

#### <font color = 'pickle'> **Create pipeline for Emotion Classification**

This line of code below creates an NLP pipeline for the task of text classification using the `pipeline` function from the `transformers` library.

- Here, the task is specified as `'text-classification'`, which means that the pipeline is set up to classify input text into one or more categories.

- The `model` argument is specified as `'arpanghoshal/EmoRoBERTa'`. This means that the pipeline will use a pre-trained model called `EmoRoBERTa`, which is hosted on Hugging Face's model hub and has been uploaded by a user named `arpanghoshal`.

- In this case, `EmoRoBERTa` is presumably a model that has been fine-tuned for emotion classification.

Let us look at some models from Huggingface now: https://huggingface.co/models


In [12]:
emotion_classifier = pipeline(task = 'text-classification', model = "arpanghoshal/EmoRoBERTa" )

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/501M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at arpanghoshal/EmoRoBERTa.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


Downloading (…)okenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [13]:
emotion_classifier(["this is very intersting movie",
           "movie is terribly exciting",
           "movie was moving very slowly"])

[{'label': 'nervousness', 'score': 0.8732843399047852},
 {'label': 'excitement', 'score': 0.987686038017273},
 {'label': 'realization', 'score': 0.7034745812416077}]

### <font color = 'pickle'> **Zero-shot Classification**
- Zero-shot classification in the context of NLP refers to the ability of a model to classify text into categories it hasn't been specifically trained on. The model uses its understanding of language semantics to make these classifications.

- In the context of zero-shot classification, the model is given a sentence and a potential class label, and it must determine if the label is entailed (logically follows from) the sentence. By framing the task this way and providing multiple potential class labels, the model can effectively perform zero-shot classification.

- If there is no specific pre-trained model that has similar labels as your task then this can be a useful approach for sequence classification.








In [14]:
# Create a Zero-shot Classification pipeline using the pipeline function from the transformers library.
# The framework argument is set to 'pt', specifying that the pipeline should use the PyTorch framework.
# The device argument is set to 0, indicating that the pipeline will use the first CUDA-enabled GPU if available.
zero_shot_classifier = pipeline("zero-shot-classification",
                      framework='pt',
                     device = 0)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [15]:
# List of reviews for zero-shot classification.
Reviews= ["I would've preferred a more perfect balance since you could barely taste the red peppers and caramelized onions.",
          "The hostess was non responsive when we asked for a table for two.",
          "Don't let the location or aesthetics of this place fool you",
          "I was surprised by the low reviews."]

# List of candidate labels for zero-shot classification.
candidate_labels = ['food', 'service', 'ambience']

In [16]:
# Applying the Zero-shot Classification pipeline on the reviews with the candidate labels.
# The pipeline will determine the best-matched label for each review.
output = zero_shot_classifier(Reviews, candidate_labels)
output

[{'sequence': "I would've preferred a more perfect balance since you could barely taste the red peppers and caramelized onions.",
  'labels': ['food', 'ambience', 'service'],
  'scores': [0.6135451793670654, 0.2861732244491577, 0.10028164088726044]},
 {'sequence': 'The hostess was non responsive when we asked for a table for two.',
  'labels': ['service', 'ambience', 'food'],
  'scores': [0.8258006572723389, 0.11019383370876312, 0.0640055239200592]},
 {'sequence': "Don't let the location or aesthetics of this place fool you",
  'labels': ['ambience', 'service', 'food'],
  'scores': [0.8303676843643188, 0.12712791562080383, 0.04250437393784523]},
 {'sequence': 'I was surprised by the low reviews.',
  'labels': ['ambience', 'service', 'food'],
  'scores': [0.6036509871482849, 0.27215495705604553, 0.12419410794973373]}]

In [17]:
# Get the final label for each review (label with the highest score)
import numpy as np
final_labels = [review['labels'][np.argmax(review['scores'])] for review in output]

# Print the final labels
print(final_labels)

['food', 'service', 'ambience', 'ambience']


## <font color = 'pickle'> **Token Classification**

Token Classification is a natural language processing task where the goal is to classify individual tokens (words or subword units) in a given text into specific categories or labels. Unlike sequence classification, which assigns a single label to the entire input sequence, token classification performs labeling at the token level.




### <font color='pickle'> **Named Entity Recognition (NER)**</font>

**What is Named Entity Recognition (NER)?**

Named Entity Recognition (NER) is a fundamental natural language processing task where the goal is to identify and classify named entities in a given text into predefined categories. Named entities are specific objects, people, locations, dates, organizations, or any other entities that have a proper name.

NER involves locating the boundaries of named entities in the text and assigning a label to each entity to indicate its type. Common entity types include PERSON (names of people), LOCATION (names of places), ORGANIZATION (names of companies or organizations), DATE (specific dates), and more.

For example, given the sentence: "Apple Inc. was founded by Steve Jobs and Steve Wozniak on April 1, 1976, in California," NER might identify and classify the named entities as follows:
- "Apple Inc." -> ORGANIZATION
- "Steve Jobs" -> PERSON
- "Steve Wozniak" -> PERSON
- "April 1, 1976" -> DATE
- "California" -> LOCATION

**Why Named Entity Recognition (NER) is Important?**

NER plays a crucial role in various natural language processing applications due to its ability to extract and classify specific information from unstructured text. Here are some reasons why NER is important:

1. **Information Extraction**: NER helps extract structured information from unstructured text, such as identifying names of people, organizations, and locations, which is valuable for knowledge extraction.

2. **Question Answering**: NER is used in question-answering systems to identify relevant entities that provide answers to specific questions.

3. **Aspect-based Sentiment Analysis**: Recognizing named entities in sentiment analysis enables understanding sentiments towards specific entities like products or brands.

4. **Language Understanding**: Identifying named entities enhances the overall understanding of the language and context in various NLP tasks.

5. **Data Analysis**: In various domains like finance and healthcare, NER aids in analyzing data by identifying important entities and their relationships.

NER helps transform unstructured text into structured data, enabling deeper analysis and interpretation of textual information, making it a fundamental component in many NLP pipelines.



In [18]:
# The aggregation_strategy argument is set to 'simple'.
ner_pipeline = pipeline('ner', device=0, aggregation_strategy = "simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [19]:
ner_pipeline('University of Texas at Dallas is a public university based in Richardson, Texas.')

[{'entity_group': 'ORG',
  'score': 0.99716234,
  'word': 'University of Texas at Dallas',
  'start': 0,
  'end': 29},
 {'entity_group': 'LOC',
  'score': 0.98807263,
  'word': 'Richardson',
  'start': 62,
  'end': 72},
 {'entity_group': 'LOC',
  'score': 0.9949463,
  'word': 'Texas',
  'start': 74,
  'end': 79}]


In the **"simple" aggregation strategy** for Named Entity Recognition (NER), entities are grouped following the default schema, which involves using the tags "B-TAG" (beginning of an entity) and "I-TAG" (inside an entity). The entities are then combined based on these tags to form the final output.

Example of Entity Grouping using "B-TAG" and "I-TAG":
Consider the text: "University of Texas at Dallas is a public university based in Richardson, Texas."

In NER, the entities in this sentence might be tagged as follows:

- "University" -> B-ORG (Beginning of an organization entity)
- "of" -> I-ORG (Inside an organization entity)
- "Texas" -> I-ORG (Inside an organization entity)
- "at" -> I-ORG (Inside an organization entity)
- "Dallas" -> I-ORG (Inside an organization entity)

Using the "simple" aggregation strategy, the entities with the same "B-TAG" and "I-TAG" tags (in this case, "B-ORG" and "I-ORG") are grouped together to form a single entity:

Entity Group: "University of Texas at Dallas"
Entity Type: ORGANIZATION

In [20]:
text = """
Oil prices rose early on Wednesday, driven by brighter economic prospects for the United States and continued recovery in oil demand in America and elsewhere in the world.
As of 9:04 a.m. EDT on Wednesday, ahead of the weekly inventory report by the U.S. Energy Information Administration (EIA), WTI Crude was up 1.04 percent at $73.61,
and Brent Crude traded at $75.54, up by 0.99 percent on the day.Prices found support late on Tuesday after the American Petroleum Institute (API)
reported a draw in crude oil inventories of 7.199 million barrels for the week ending June 18. If the EIA confirms a draw today, it would be the fifth consecutive week of crude inventory draws in the United States, where demand for fuels continues to grow.
"""

In [21]:
ner_pipeline(text)

[{'entity_group': 'LOC',
  'score': 0.9996959,
  'word': 'United States',
  'start': 83,
  'end': 96},
 {'entity_group': 'LOC',
  'score': 0.9997377,
  'word': 'America',
  'start': 137,
  'end': 144},
 {'entity_group': 'ORG',
  'score': 0.99871325,
  'word': 'U',
  'start': 251,
  'end': 252},
 {'entity_group': 'ORG',
  'score': 0.99815315,
  'word': 'S',
  'start': 253,
  'end': 254},
 {'entity_group': 'ORG',
  'score': 0.99941,
  'word': 'Energy Information Administration',
  'start': 256,
  'end': 289},
 {'entity_group': 'ORG',
  'score': 0.99916553,
  'word': 'EIA',
  'start': 291,
  'end': 294},
 {'entity_group': 'ORG',
  'score': 0.98631895,
  'word': 'WTI Crude',
  'start': 297,
  'end': 306},
 {'entity_group': 'ORG',
  'score': 0.9900016,
  'word': 'Brent Crude',
  'start': 342,
  'end': 353},
 {'entity_group': 'ORG',
  'score': 0.9992612,
  'word': 'American Petroleum Institute',
  'start': 449,
  'end': 477},
 {'entity_group': 'ORG',
  'score': 0.9991831,
  'word': 'API',
  

### <font color = 'pickle'> **Part of Speech Tagging**</font>

Part-of-Speech (POS) Tagging is a natural language processing task where each word in a given text is assigned a specific grammatical label based on its role and function in a sentence. The purpose of POS tagging is to analyze and categorize words into their respective parts of speech, which include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and more.

Here's an example of POS tagging for the sentence: "She runs quickly."

- "She" -> PRON (pronoun)
- "runs" -> VERB (verb)
- "quickly" -> ADV (adverb)

In this example, each word is assigned a POS tag based on its role in the sentence. "She" is recognized as a pronoun, "runs" as a verb, and "quickly" as an adverb.

POS tagging is useful for various natural language processing tasks, including:

1. **Named Entity Recognition (NER)**: POS tagging can be used as a back-off method to aid in Named Entity Recognition. For example, recognizing proper nouns (identified through POS tags) can help find named entities like names of people, places, or organizations.

2. **Machine Translation**: POS tags are utilized in language translation to preserve grammatical structure during the translation process.

3. **Text-to-Speech (TTS)**: POS tags assist in generating natural-sounding speech by providing information about pronunciation and intonation.For example, How do you pronounce “lead”?

4. **Regular Expressions and Phrase Identification**: POS tags provide a compact representation of the syntactic categories of words in a sentence. This allows researchers and developers to design regular expressions or patterns over POS tags to identify specific syntactic structures, such as noun phrases, verb phrases, or adjective phrases. This process simplifies the task of extracting structured information from the text.

POS tagging is an essential building block in many NLP pipelines and facilitates a deeper understanding of textual data by breaking down sentences into their grammatical components.

In [22]:
pos_tagger = pipeline(model="QCRI/bert-base-multilingual-cased-pos-english", aggregation_strategy="simple")

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.12k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/712M [00:00<?, ?B/s]

Some weights of the model checkpoint at QCRI/bert-base-multilingual-cased-pos-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [23]:
pos_tagger("My name is Sarah and I live in London")

[{'entity_group': 'PRP$',
  'score': 0.99948126,
  'word': 'My',
  'start': 0,
  'end': 2},
 {'entity_group': 'NN',
  'score': 0.99959725,
  'word': 'name',
  'start': 3,
  'end': 7},
 {'entity_group': 'VBZ',
  'score': 0.99956065,
  'word': 'is',
  'start': 8,
  'end': 10},
 {'entity_group': 'NNP',
  'score': 0.98474914,
  'word': 'Sarah',
  'start': 11,
  'end': 16},
 {'entity_group': 'CC',
  'score': 0.9995889,
  'word': 'and',
  'start': 17,
  'end': 20},
 {'entity_group': 'PRP',
  'score': 0.9994993,
  'word': 'I',
  'start': 21,
  'end': 22},
 {'entity_group': 'VBP',
  'score': 0.9979292,
  'word': 'live',
  'start': 23,
  'end': 27},
 {'entity_group': 'IN',
  'score': 0.9996567,
  'word': 'in',
  'start': 28,
  'end': 30},
 {'entity_group': 'NNP',
  'score': 0.9960855,
  'word': 'London',
  'start': 31,
  'end': 37}]

## <font color = 'pickle'> **Sequence-to-Sequence tasks**
Sequence-to-Sequence (Seq2Seq) tasks are a type of natural language processing (NLP) tasks where the goal is to transform an input sequence into an output sequence of potentially different lengths. The input and output sequences can be text, speech, or any other form of sequential data. Seq2Seq tasks are characterized by their ability to handle variable-length input and output data, making them suitable for tasks that involve translation, summarization, text generation, and more

### <font color = 'pickle'> **Text Summarization**
**Text Summarization** is the process of condensing a longer piece of text into a shorter version while retaining its key information and main points. The goal is to provide a concise and coherent summary that captures the essence of the original text. Text summarization can be performed using two main approaches: abstractive summarization and extractive summarization.

- **Abstractive Summarization** is a text summarization technique that involves generating new sentences to represent the main ideas of the original text. In this approach, the model comprehends the input text and paraphrases it in a more concise and human-like manner. Abstractive summarization requires natural language generation capabilities and can produce more fluent and coherent summaries. However, it is a challenging task, as the model needs to understand the content and generate semantically correct sentences.

- **Extractive Summarization**, on the other hand, involves selecting and rearranging sentences or phrases directly from the original text to create a summary. Instead of generating new text, extractive summarization picks the most relevant and important sentences, typically based on various ranking or scoring methods. Extractive methods are often computationally simpler and can preserve the exact wording of the original text. However, they might not produce summaries that flow as smoothly as abstractive methods.









#### <font color = 'pickle'> Abstractive Summartization

In [24]:
sample_text= '''Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and
analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally
ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN)
 -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most
 severe mental illnesses are incarcerated until they're ready to appear in court. Most often, they face drug charges or charges
 of assaulting an officer --charges that Judge Steven Leifman says are usually "avoidable felonies." He says the arrests often
 result from confrontations with police. Mentally ill people often won't do what they're told when police arrive on the scene -
 - confrontation seems to exacerbate their illness and they become more paranoid, delusional, and less likely to follow directions,
  according to Leifman. So, they end up on the ninth floor severely mentally disturbed, but not getting any real help because they're
  in jail. We toured the jail with Leifman. He is well known in Miami as an advocate for justice and the mentally ill. Even though
  we were not exactly welcomed with open arms by the guards, we were given permission to shoot videotape and tour the floor.
  Go inside the 'forgotten floor' » . At first, it's hard to determine where the people are. The prisoners are wearing sleeveless
  robes. Imagine cutting holes for arms and feet in a heavy wool sleeping bag -- that's kind of what they look like. They're
  designed to keep the mentally ill patients from injuring themselves. That's also why they have no shoes, laces or mattresses.
  Leifman says about one-third of all people in Miami-Dade county jails are mentally ill. So, he says, the sheer volume is
  overwhelming the system, and the result is what we see on the ninth floor. Of course, it is a jail, so it's .'''

In [25]:
summarizer_bart = pipeline(task = 'summarization', model = 't5-large',framework='pt', device =0 )

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [26]:
summary = summarizer_bart(sample_text)
summary

[{'summary_text': 'mentally ill inmates are housed on the ninth floor of a florida jail . most face drug charges or charges of assaulting an officer . judge says arrests often result from confrontations with police . he says about one-third of all people in Miami-dade county jails are mentally sick .'}]

### <font color = 'pickle'> **Question Answering**
- Abstractive QA
- Extractive QA

#### <font color = 'pickle'> **Extractive Question Answering**
- Predict start and end indices
- Input - Context and Question Pair


In [27]:
# Create a Question Answering (QA) pipeline using a pre-trained model for question answering
qa_pipeline = pipeline('question-answering', framework='pt', device=0)


No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [28]:
Context = 'My favorie all time movie is Terminator. My favorite music is jazz'
Question = "Out of all the movies, which movie i like most"

In [29]:
qa_pipeline(context = Context, question = Question)

{'score': 0.9915904402732849, 'start': 29, 'end': 39, 'answer': 'Terminator'}

In [30]:
Context = '''Alexander Graham Bell (/ˈɡreɪ.əm/, born Alexander Bell; March 3, 1847 – August 2, 1922)[4] was a Scottish-born[N 1] inventor,
scientist and engineer who is credited with patenting the first practical telephone. He also co-founded the American Telephone and Telegraph
Company (AT&T) in 1885.[7] Bell's father, grandfather, and brother had all been associated with work on elocution and speech, and both his mother
and wife were deaf; profoundly influencing Bell's life's work.[8] His research on hearing and speech further led him to experiment with hearing
devices which eventually culminated in Bell being awarded the first U.S. patent for the telephone, on March 7, 1876.[N 2] Bell considered his
invention an intrusion on his real work as a scientist and refused to have a telephone in his study.[9][N 3]
Many other inventions marked Bell's later life, including groundbreaking work in optical telecommunications, hydrofoils,
and aeronautics. Bell also had a strong influence on the National Geographic Society[11] and its magazine while serving as
the second president from January 7, 1898, until 1903.
Beyond his work in engineering, Bell had a deep interest in the emerging science of heredity.[12]'''

In [31]:
Question= 'When was Grahan born?'

In [32]:
qa_pipeline(context = Context, question = Question)

{'score': 0.953169584274292, 'start': 56, 'end': 69, 'answer': 'March 3, 1847'}

In [33]:
Question= 'What was his occupation?'

In [34]:
qa_pipeline(context = Context, question = Question)

{'score': 0.4196772277355194,
 'start': 126,
 'end': 148,
 'answer': 'scientist and engineer'}

In [35]:
Question = "What was Grahan Bell's known for?"
qa_pipeline(context = Context, question = Question)

{'score': 0.3618393540382385,
 'start': 170,
 'end': 209,
 'answer': 'patenting the first practical telephone'}

In [36]:
Question = "What was Grahan Bell's main invention?"
qa_pipeline(context = Context, question = Question)

{'score': 0.7727235555648804, 'start': 200, 'end': 209, 'answer': 'telephone'}

In [37]:
Question = "What else was Grahan Bell famous for?"
qa_pipeline(context = Context, question = Question)

{'score': 0.28598737716674805,
 'start': 1191,
 'end': 1199,
 'answer': 'heredity'}

In [38]:
Question= 'Where was Grahan Bell born ?'
qa_pipeline(context = Context, question = Question)

{'score': 0.96455979347229, 'start': 97, 'end': 105, 'answer': 'Scottish'}

In [39]:
Question= 'What is the recepie for cake?'
qa_pipeline(context = Context, question = Question)

{'score': 0.0021810883190482855,
 'start': 200,
 'end': 209,
 'answer': 'telephone'}

### <font color = 'pickle'> **Machine Translation**

In [40]:
# Create a translation pipeline to translate English to French
translator_en_fr = pipeline("translation_en_to_fr")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [41]:
translator_en_fr("Football is my favorite sport")

[{'translation_text': 'Le football est mon sport préféré'}]

In [42]:
# Create a translation pipeline to translate English to Hindi using the specified model
translator_en_hi = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]



In [43]:
translator_en_hi ("Football is my favorite sport")

[{'translation_text': 'फुटबाल मेरा पसंदीदा खेल है'}]

### <font color = 'pickle'> **Text2Text Generation Examples**

In [44]:
# Create a text-to-text generation pipeline for various language generation tasks
text2text = pipeline("text2text-generation")


No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


#### <font color = 'pickle'> **Question Answer**

In [45]:
text2text("question: Which is capital city of India? context: New Delhi is India's capital")


[{'generated_text': 'New Delhi'}]

#### <font color = 'pickle'> **Translation**

In [46]:
text2text("translate English to french: New Delhi is India's capital")

[{'generated_text': "New Delhi est la capitale de l'Inde."}]

#### <font color = 'pickle'> **Summarization**

In [47]:
text2text("""summarize: Natural language processing (NLP) is a subfield of linguistics, computer science,
and artificial intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural language data.""")

[{'generated_text': 'natural language processing (NLP) is a subfield of linguistics, computer science'}]

#### <font color = 'pickle'> **Sentiment Analysis**

In [48]:
# Using the text-to-text generation pipeline to perform a task related to SST2 (Stanford Sentiment Treebank) dataset.
# The task is to analyze the sentiment of the sentence "New Zealand is a beautiful country."
text2text("sst2 sentence: New Zealand is a beautiful country")


[{'generated_text': 'positive'}]

#### <font color = 'pickle'> **Sentiment Span Extraction**

In [49]:
# Using the text-to-text generation pipeline to answer a question based on the provided context.
# The question is "positive," and the context is "New Zealand is a beautiful country."
text2text("question : positive context: New Zealand is a beautiful country.")


[{'generated_text': 'a beautiful country'}]

#### <font color = 'pickle'> **Question Generation**

In [50]:
text2text = pipeline("text2text-generation", model = "valhalla/t5-base-e2e-qg")
text2text("generate questions : New Delhi is India's capital.", num_beams=4, max_length = 8)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


[{'generated_text': "What city is India's capital"}]

#### <font color = 'pickle'> **English Tasks**

In [51]:
text2text_grammarly = pipeline('text2text-generation', model = "grammarly/coedit-large")

Downloading (…)lve/main/config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [52]:
# Fix garmmer
text2text_grammarly("Fix the grammar: My friend and I goes to the park yesterday.")

[{'generated_text': 'My friend and I went to the park yesterday.'}]

In [53]:
# Make sentences coherent
text2text_grammarly("make the text coherent: I love ice cream. It's too cold to eat outside.")

[{'generated_text': "I love ice cream, but it's too cold to eat outside."}]

In [54]:
# Paraphrase
text2text_grammarly("paraphrase: Despite the tough challenges, the team managed to finish the project on time.")

[{'generated_text': 'The team, despite the challenges, managed to finish the project on time.'}]

In [55]:
# Write formally
text2text_grammarly("rewrite formally: Hey, could you maybe get those reports done by tomorrow?")

[{'generated_text': 'Could you possibly get those reports done by tomorrow?'}]

In [56]:
# Easier to understand
text2text_grammarly("""make this easier to understand: Despite the fact that precipitation was in the forecast, we did not allow this meteorological
prediction to deter our plans for an outdoor picnic""")

[{'generated_text': 'We did not let this meteorological prediction stop us from having an outdoor picnic.'}]

## <font color = 'pickle'> **Language Modeling-Text Generation**

**Language Modeling** is a natural language processing task where a model is trained to predict the next word or sequence of words in a given context. The goal is to learn the probability distribution of words in a language and generate coherent and contextually relevant text.

- Language models are often used for various text generation tasks, such as completing sentences, generating dialogue, writing poetry, and even creating stories or articles.
- They form the foundation for many advanced NLP applications and are essential in generating human-like text.

In [57]:
text_gen = pipeline("text-generation", device=0, framework='pt', model = 'gpt2-large')

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [58]:
prompt = "Transformers in NLP are the"

In [59]:
text_gen(prompt, clean_up_tokenization_spaces=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Transformers in NLP are the right approach to do this.\n\nOne way to start from scratch is to work out some common themes. For example "emotional" is generally associated with the word "happy" while "unusual" is'}]

In [60]:
text_gen(prompt, clean_up_tokenization_spaces=True, num_return_sequences = 3, max_new_tokens = 8)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Transformers in NLP are the real-names-as-a-'},
 {'generated_text': 'Transformers in NLP are the worst. I know! The one that'},
 {'generated_text': 'Transformers in NLP are the largest, and the most popular. N'}]

## <font color = 'pickle'> **Masked Language Modeling** </font>

**What is Masked Language Modeling?**

Masked Language Modeling is a language modeling technique used to train models to predict missing or masked words in a given sentence. In this approach, certain words in the input text are randomly masked or replaced with special tokens. The model's task is to predict the original words based on the context of the surrounding words.

For example, in the sentence "The quick brown ___ jumps over the lazy dog," the word "fox" might be masked, and the model's objective is to correctly predict the missing word "fox" based on the context provided by the other words in the sentence.

**Why is Masked Language Modeling Important?**

Masked Language Modeling is crucial for several reasons:

1. **Pre-training Language Models**: It is widely used in pre-training language models like BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa. These models are pretrained on large corpora of text using masked language modeling, which helps them capture rich contextual information about words and their relationships.

2. **Contextual Understanding**: Masked Language Modeling allows models to understand the context and meaning of words in a sentence. By predicting masked words, the model learns to rely on the surrounding words to infer the correct missing word.

3. **Semantic Representations**: Pretrained models using masked language modeling can produce high-quality word embeddings and semantic representations that benefit various downstream NLP tasks like text classification, named entity recognition, and sentiment analysis.

Overall, Masked Language Modeling is a valuable technique for pre-training language models, improving contextual understanding, and enhancing the performance of NLP models across a wide range of tasks.

In [61]:
mlm =pipeline('fill-mask', framework='pt', device = 0)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [62]:
mlm('the quick brown fox <mask> over the lazy dog')

[{'score': 0.12256921827793121,
  'token': 33189,
  'token_str': ' leaping',
  'sequence': 'the quick brown fox leaping over the lazy dog'},
 {'score': 0.04575977474451065,
  'token': 2693,
  'token_str': ' wins',
  'sequence': 'the quick brown fox wins over the lazy dog'},
 {'score': 0.03614804521203041,
  'token': 13855,
  'token_str': ' jumps',
  'sequence': 'the quick brown fox jumps over the lazy dog'},
 {'score': 0.03298729285597801,
  'token': 32564,
  'token_str': ' leaps',
  'sequence': 'the quick brown fox leaps over the lazy dog'},
 {'score': 0.02487633377313614,
  'token': 878,
  'token_str': ' running',
  'sequence': 'the quick brown fox running over the lazy dog'}]

## <font color = 'pickle'> **ChatBot**

In [63]:
from transformers import pipeline, Conversation

In [64]:
chatbot = pipeline(model="microsoft/DialoGPT-medium")

Downloading (…)lve/main/config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [65]:
conversation = Conversation("What is NLP")
conversation = chatbot(conversation)
conversation

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Conversation id: f7648102-72f6-4f34-8c13-785e3db7976f 
user >> What is NLP 
bot >> Natural Language Processing 

## <font color = 'pickle'> **Feature extraction**
Machine learning algorithms require numerical data as input. Feature extraction transforms textual data into numerical features that algorithms can process.

In [66]:
feature_extractor = pipeline(task="feature-extraction", model="bert-base-uncased", return_tensors = True, device = 0)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [67]:
features= feature_extractor (Reviews)

In [68]:
# prompt: get the mean of all features

features[2].shape, features[1].shape

(torch.Size([1, 15, 768]), torch.Size([1, 16, 768]))

In [69]:
import torch
review_embeddings = torch.stack([t.mean(dim=1) for t in features])

In [70]:
review_embeddings.shape

torch.Size([4, 1, 768])

In [71]:
review_embeddings = review_embeddings.squeeze(dim = 1)

In [72]:
review_embeddings.shape

torch.Size([4, 768])

# <font color = 'pickle'>**Pipeline Steps**

The pipeline groups together three steps:
1. preprocessing
2. passing the inputs through the model,
3. postprocessing


#### <font color = 'pickle'>**1. Preprocessing**

In [73]:
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [74]:
text = [" this is very intersting movie",
           "movie is terribly exciting",
           "movie was moving very slowly"]

In [75]:
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  2023,  2003,  2200,  6970, 16643,  3070,  3185,   102],
        [  101,  3185,  2003, 16668, 10990,   102,     0,     0,     0],
        [  101,  3185,  2001,  3048,  2200,  3254,   102,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 0, 0]])}


#### <font color = 'pickle'>**2. Passing the inputs through the model**

In [76]:
from transformers import AutoModel
model = AutoModel.from_pretrained(model_name)

In [77]:
outputs = model(**inputs)
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.2737,  0.3258,  0.3651,  ..., -0.0682,  0.8293,  0.0176],
         [-0.2063,  0.4402,  0.2473,  ..., -0.1077,  0.8852,  0.0530],
         [-0.2451,  0.4472,  0.3704,  ..., -0.1224,  0.8805,  0.1819],
         ...,
         [-0.2283,  0.3407,  0.3836,  ..., -0.1466,  0.8337, -0.0511],
         [-0.0512,  0.4027,  0.3131,  ...,  0.0028,  0.8705, -0.1176],
         [ 0.7080,  0.6137,  0.6896,  ...,  0.1185,  0.7360, -0.5389]],

        [[ 0.5364,  0.0430,  0.1936,  ...,  0.4000,  0.9084, -0.4706],
         [ 0.8147,  0.1248,  0.2523,  ...,  0.1625,  0.7363, -0.4146],
         [ 0.7266,  0.1126,  0.0194,  ...,  0.3085,  0.8434, -0.2526],
         ...,
         [ 0.4643, -0.0082,  0.0661,  ...,  0.4233,  0.8137, -0.3507],
         [ 0.4551, -0.0173,  0.0869,  ...,  0.4047,  0.8042, -0.3597],
         [ 0.6272,  0.0367,  0.0485,  ...,  0.4574,  0.7861, -0.2509]],

        [[-0.8396,  0.3566,  0.6816,  ...,  0.0285, -0.3160, -0.2215],
         [-

In [78]:
print(outputs.last_hidden_state.shape)

torch.Size([3, 9, 768])


In [79]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
outputs = model(**inputs)

In [80]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.8127,  1.8995],
        [-4.2377,  4.5418],
        [ 3.5005, -2.9683]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [81]:
outputs.logits.shape

torch.Size([3, 2])

In [82]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[2.3839e-02, 9.7616e-01],
        [1.5384e-04, 9.9985e-01],
        [9.9845e-01, 1.5486e-03]], grad_fn=<SoftmaxBackward0>)


#### <font color = 'pickle'>**3. PostProcessing**

In [83]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# <Font color = 'pickle'>**Complete Example-Sentiment Analysis - IMDB Dataset**

For this notebook, we will use IMDB movie review dataset. <br>
LInk for complete dataset: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

For this analysis,we have taken a smaller version of the datasets. We have taken smaller reviews (less than 400 words).


## <Font color = 'pickle'>**import libraries**

In [84]:
from pathlib import Path
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

## <Font color = 'pickle'>**Mount Google Drive**

In [85]:
# mount google drive
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')

Mounted at /content/drive


## <Font color = 'pickle'>**Specify Data Folder**

In [86]:
# This is the path where we will downlaod and save data
if 'google.colab' in str(get_ipython()):
  base_folder = Path('/content/drive/MyDrive/data')
else:
  base_folder = Path('/home/harpreet/Insync/google_drive_shaannoor/data')

In [87]:
data_folder = base_folder/'datasets/aclImdb'

In [88]:
# location of train and test files
train_file = data_folder /'train_data_smaller.csv'

## <Font color = 'pickle'>**Create DataFrame**

In [89]:
# creating Pandas Dataframe
train_data = pd.read_csv(train_file, index_col=0)

In [90]:
# print shape of the datasets
print(f'Shape of Training data set is : {train_data.shape}')

Shape of Training data set is : (2024, 3)


In [91]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2024 entries, 7 to 24985
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  2024 non-null   object
 1   Labels   2024 non-null   int64 
 2   length   2024 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 63.2+ KB


In [92]:
train_data.head()

Unnamed: 0,Reviews,Labels,length
7,An excellent family movie... gives a lot to th...,1,365
17,My favorite movie. What a great story this rea...,1,132
19,Why did this movie fail commercially? It's got...,1,242
20,"I don't quite know how to explain ""Darkend Roo...",1,424
30,What a good film! Made Men is a great action m...,1,329


## <Font color = 'pickle'>**Load Pipeline**

In [93]:
classifier = pipeline('sentiment-analysis', device = 0)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


## <Font color = 'pickle'>**Create a list of reviews**

In [94]:
texts_train = train_data['Reviews'].tolist()

## <Font color = 'pickle'>**Get Predictions**

In [95]:
predictions_train = classifier(texts_train)

In [96]:
predictions_train[0:10]

[{'label': 'POSITIVE', 'score': 0.999883770942688},
 {'label': 'NEGATIVE', 'score': 0.9934924244880676},
 {'label': 'POSITIVE', 'score': 0.9991313815116882},
 {'label': 'POSITIVE', 'score': 0.9776297807693481},
 {'label': 'POSITIVE', 'score': 0.9998581409454346},
 {'label': 'POSITIVE', 'score': 0.9982699155807495},
 {'label': 'POSITIVE', 'score': 0.9998489618301392},
 {'label': 'POSITIVE', 'score': 0.9997541308403015},
 {'label': 'POSITIVE', 'score': 0.8102975487709045},
 {'label': 'POSITIVE', 'score': 0.9997641444206238}]

In [97]:
preds_train = np.array([1 if d['label'].startswith('P') else 0 for d in predictions_train])


In [98]:
preds_train[0:10]

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1])

## <font color = 'pickle'>**Model Evaluation**

In [99]:
print("acc", np.mean(train_data['Labels']==preds_train)*100)

acc 91.40316205533597


In [100]:
cm_train = confusion_matrix(train_data['Labels'], preds_train, normalize = 'true')

In [101]:
cm_train

array([[0.90635838, 0.09364162],
       [0.08024159, 0.91975841]])