<a href="https://colab.research.google.com/github/Jadamoureen/feature_generation/blob/transformers/Screening_exercise(Moureen_Caroline).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 A short overview of using  *[Transformers](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines)

***Installing libraries*** is essential for building software efficiently and effectively. By using pre-existing code, developers can save time and effort, ensure consistent and reliable functionality, and leverage the best tools and practices available in the industry.

In [1]:
# Transformers installation
! pip install transformers datasets
# ! pip install git+https://github.com/huggingface/transformers.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collectin

**Installing the Machine Learning Frameworks**

In [2]:
pip install torch


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
pip install tensorflow


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


 **Pipeline**

The [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) is the easiest and fastest way to use a pretrained model for inference. You can use the [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) out-of-the-box for many tasks across different modalities, some of which are shown in the table below:

<Tip>

For a complete list of available tasks, check out the [pipeline API reference](https://huggingface.co/docs/transformers/main/en/./main_classes/pipelines).

</Tip>

| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |

Start by creating an instance of [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) and specifying a task you want to use it for. In this guide, you'll use the [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) for sentiment analysis as an example:

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [5]:
classifier("Today I feel so discouraged.")

[{'label': 'NEGATIVE', 'score': 0.9997559189796448}]

The code above is using a pre-trained text classification model to predict the class label for a given text.

First, a **pre-trained text classification model** is loaded and initialized as the classifier object. The code does not show how the model was trained, but it is likely that it was trained on a large labeled dataset using a **neural network architecture**.

Then, the `classifier` object is called with the input text "Today I feel so discouraged." as the argument. The `classifier` object likely contains a predict method that takes in a string of text as input and returns a predicted class label.

The output of the `classifier` call is not shown in the code snippet, but it could be a numerical value representing the predicted class label ***(e.g. 0 for negative sentiment, 1 for positive sentiment), or a string representing the class label (e.g. "negative" or "positive").***

Overall, this code demonstrates how pre-trained text classification models can be used to automatically classify text into different categories, such as sentiment analysis, topic modeling, or spam detection. This has many potential applications in natural language processing, from analyzing customer reviews to filtering out unwanted content on social media.

In [7]:
classifier("After feeling dejected before, I feel fantastic right now.")


[{'label': 'POSITIVE', 'score': 0.9998574256896973}]

Transformers
Transformers is our natural language processing library and our hub is now open to all ML models, with support from libraries like Flair, Asteroid, ESPnet, Pyannote, and more to come.

In [10]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

mask_fill = pipeline("fill-mask", model="bert-base-uncased")
mask_fill(f"My laptop is so annoying {tokenizer.mask_token} and I couldn't complete my machine learning project.", top_k=2)



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.8227320313453674,
  'token': 1010,
  'token_str': ',',
  'sequence': "my laptop is so annoying, and i couldn't complete my machine learning project."},
 {'score': 0.13677158951759338,
  'token': 1012,
  'token_str': '.',
  'sequence': "my laptop is so annoying. and i couldn't complete my machine learning project."}]

The code above is using the Hugging Face Transformers library to perform masked language modeling (MLM) with the **BERT model**.

First, the code imports the BertTokenizer class from the Transformers library, which is used to tokenize raw text into BERT-compatible input features.

Then, an instance of the **BertTokenizer class** is created by calling the from_pretrained method with the argument 'bert-base-uncased'. This loads the pre-trained BERT model from the Hugging Face model hub and initializes the tokenizer with the appropriate vocabulary and settings.

Next, the code creates an instance of the pipeline class with the argument "fill-mask" and "bert-base-uncased". This creates a pipeline that performs MLM using the BERT model specified by "bert-base-uncased".

Finally, the fill_mask method of the pipeline is called with a sentence containing a masked token ``` ({tokenizer.mask_token})```, and the argument top_k=2. This replaces the masked token with the top 2 most likely words that could fill in the blank according to the BERT model's predictions.

For example, the sentence "My laptop is so annoying ``` [MASK]```  and I couldn't complete my machine learning project." is passed as input to the fill_mask method. The output will be a list of two dictionaries, each containing a sequence key with a different predicted word that could fill the masked token. The score key in each dictionary indicates the confidence of the BERT model's prediction for that word.

The output looks like:


```
[{'score': 0.8227320313453674,
  'token': 1010,
  'token_str': ',',
  'sequence': "my laptop is so annoying, and i couldn't complete my machine learning project."},
 {'score': 0.13677158951759338,
  'token': 1012,
  'token_str': '.',
  'sequence': "my laptop is so annoying. and i couldn't complete my machine learning project."}]
```


This demonstrates how the Hugging Face Transformers library can be used to perform MLM using pre-trained language models like BERT, and how the pipeline class can simplify the process of setting up and using the models.
