<a href="https://colab.research.google.com/github/Sandeepk14/GenerativeAI_Basic-NLP_projects-with-Huggingface_Transformers-/blob/main/HuggingFace_Transformers_Basic_Projects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `prompt: about hugging face`

Hugging Face is a community and platform that provides tools and resources for building, training, and deploying machine learning models, particularly in the field of Natural Language Processing (NLP).  Here's a breakdown of key aspects:

**Key Features and Components:**

* **Transformers Library:**
This is arguably their most well-known offering. It's a Python library providing pre-trained models for various NLP tasks, including text classification, question answering, translation, text generation, and summarization.  It simplifies the process of using these complex models, offering a consistent interface and pre-processing tools.

* **Datasets Library:**  Hugging Face Datasets provides a collection of datasets for various machine learning tasks, particularly in NLP. It provides a unified way to access and manage these datasets, making it easier to experiment with different models and benchmarks.

* **Model Hub:**  A repository of pre-trained models from various sources.  You can find and download models for different tasks and languages.  This fosters collaboration and allows researchers and developers to share and benefit from each other's work.

* **Spaces:**  An environment to deploy and share machine learning demos and apps.  This allows you to create interactive interfaces for your models and share them with the wider community.

* **Tokenizers:**  Crucial for NLP, tokenizers break down text into smaller units that models can process. Hugging Face provides a suite of tokenization tools that are compatible with their models.

* **Community:** A vibrant community of developers and researchers contributes to and utilizes the Hugging Face ecosystem.  This fosters collaboration, knowledge sharing, and the rapid advancement of NLP technologies.

**How it's Used:**

1. **Fine-tuning Pre-trained Models:**  One of the most common uses is taking a pre-trained model and fine-tuning it on a specific dataset for a particular task.  This leverages the existing knowledge of the pre-trained model to achieve better performance with less data and training time.

2. **Exploring and Experimenting:**  The Model Hub and Datasets Library provide a convenient way to experiment with different models and datasets, speeding up the research and development process.

3. **Deploying Models:**  Hugging Face Spaces offer a straightforward method to deploy models and make them accessible to others through an interactive interface.

4. **Accessing Datasets:**  Using the Datasets library makes it much easier to work with different datasets, standardizing the loading and preparation processes.


**In summary,** Hugging Face is a powerful ecosystem that significantly simplifies the use of machine learning models, especially in NLP, by providing convenient tools, pre-trained models, datasets, and a collaborative community.


In [1]:
! pip install transformers



In [2]:
! pip install 'transformers[tf-cpu]'

Collecting keras<2.16,>2.9 (from transformers[tf-cpu])
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Collecting tensorflow-cpu<2.16,>2.9 (from transformers[tf-cpu])
  Downloading tensorflow_cpu-2.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting onnxconverter-common (from transformers[tf-cpu])
  Downloading onnxconverter_common-1.14.0-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting tf2onnx (from transformers[tf-cpu])
  Downloading tf2onnx-1.16.1-py3-none-any.whl.metadata (1.3 kB)
Collecting tensorflow-text<2.16 (from transformers[tf-cpu])
  Downloading tensorflow_text-2.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting keras-nlp<0.14.0,>=0.3.1 (from transformers[tf-cpu])
  Downloading keras_nlp-0.12.1-py3-none-any.whl.metadata (6.8 kB)
Collecting tensorflow-probability<0.24 (from transformers[tf-cpu])
  Downloading tensorflow_probability-0.23.0-py2.py3-none-any.whl.metadata (13 kB)
C

# Sentiment Anaysis





In [3]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


In [4]:
classifier("We are very happy to show you the 🤗 Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [5]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


# Text Classification

In [6]:


classifier1 = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=None)

sentences = ["I am not having a great day"]

model_outputs = classifier1(sentences)
print(model_outputs[0])
# produces a list of dicts for each of the labels


config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'disappointment', 'score': 0.46669596433639526}, {'label': 'sadness', 'score': 0.3984943926334381}, {'label': 'annoyance', 'score': 0.06806610524654388}, {'label': 'neutral', 'score': 0.05703021585941315}, {'label': 'disapproval', 'score': 0.04423946887254715}, {'label': 'nervousness', 'score': 0.014850739389657974}, {'label': 'realization', 'score': 0.014059904962778091}, {'label': 'approval', 'score': 0.011267454363405704}, {'label': 'joy', 'score': 0.0063033816404640675}, {'label': 'remorse', 'score': 0.006221487186849117}, {'label': 'caring', 'score': 0.006029392126947641}, {'label': 'embarrassment', 'score': 0.005265498533844948}, {'label': 'anger', 'score': 0.004981442354619503}, {'label': 'disgust', 'score': 0.004259037785232067}, {'label': 'grief', 'score': 0.004002132453024387}, {'label': 'confusion', 'score': 0.003382926108315587}, {'label': 'relief', 'score': 0.0031404944602400064}, {'label': 'desire', 'score': 0.0028274687938392162}, {'label': 'admiration', 'scor

In [7]:

try:
    classifier = pipeline("sentiment-analysis")
except OSError:
    print("Installing transformers...")
    !pip install transformers
    classifier = pipeline("sentiment-analysis")

try:
    classifier1 = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=None)
except OSError:
    print("Installing required model...")
    !pip install 'transformers[tf-cpu]'
    classifier1 = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=None)


sentences = ["I am not having a great day"]

model_outputs = classifier1(sentences)
model_outputs[0]


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0
Device set to use cuda:0


[{'label': 'disappointment', 'score': 0.46669596433639526},
 {'label': 'sadness', 'score': 0.3984943926334381},
 {'label': 'annoyance', 'score': 0.06806610524654388},
 {'label': 'neutral', 'score': 0.05703021585941315},
 {'label': 'disapproval', 'score': 0.04423946887254715},
 {'label': 'nervousness', 'score': 0.014850739389657974},
 {'label': 'realization', 'score': 0.014059904962778091},
 {'label': 'approval', 'score': 0.011267454363405704},
 {'label': 'joy', 'score': 0.0063033816404640675},
 {'label': 'remorse', 'score': 0.006221487186849117},
 {'label': 'caring', 'score': 0.006029392126947641},
 {'label': 'embarrassment', 'score': 0.005265498533844948},
 {'label': 'anger', 'score': 0.004981442354619503},
 {'label': 'disgust', 'score': 0.004259037785232067},
 {'label': 'grief', 'score': 0.004002132453024387},
 {'label': 'confusion', 'score': 0.003382926108315587},
 {'label': 'relief', 'score': 0.0031404944602400064},
 {'label': 'desire', 'score': 0.0028274687938392162},
 {'label': '

# questioning and answering

In [8]:
# prompt: questioning and answering



try:
    question_answerer = pipeline("question-answering")
except OSError:
    print("Installing transformers...")

    question_answerer = pipeline("question-answering")

context = """
Hugging Face is a community and platform that provides tools and resources for building, training, and deploying machine learning models,
particularly in the field of Natural Language Processing (NLP).  It offers the Transformers library, Datasets library, Model Hub, Spaces,
Tokenizers, and a vibrant community.  Common uses include fine-tuning pre-trained models, exploring and experimenting with models and datasets,
 deploying models, and accessing datasets.
"""

question = "What is Hugging Face?"
result = question_answerer(question=question, context=context)
print(f"Answer: {result['answer']}")

question = "What are some common uses of Hugging Face?"
result = question_answerer(question=question, context=context)
print(f"Answer: {result['answer']}")


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


Answer: a community and platform
Answer: fine-tuning pre-trained models


# Name Entity Recognition

# NER

In [9]:


# Load NER pipeline with a pre-trained model
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Sample text for NER
text = "My name is Sandeep Kumar,I'm data scientist at Jupiter AI Labs in Noida."

# Perform NER
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cuda:0


Entity: Sand, Label: I-PER, Score: 0.9996
Entity: ##eep, Label: I-PER, Score: 0.9996
Entity: Kumar, Label: I-PER, Score: 0.9998
Entity: Jupiter, Label: I-ORG, Score: 0.9984
Entity: AI, Label: I-ORG, Score: 0.9988
Entity: Labs, Label: I-ORG, Score: 0.9860
Entity: No, Label: I-LOC, Score: 0.9960
Entity: ##ida, Label: I-LOC, Score: 0.9973


 # Summarizing a Long Text with BART

In [10]:


# Load summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Input text
text = """
Hugging Face is a company that has revolutionized the AI and NLP industry with its open-source libraries
and machine learning models. Their most popular library, Transformers, provides pre-trained deep learning models
for a variety of NLP tasks, including text generation, translation, and summarization. The company also offers
Hugging Face Hub, where developers can share and access AI models and datasets. With a strong community and
partnerships with major tech companies, Hugging Face continues to push the boundaries of AI research and development.
"""

# Generate summary
summary = summarizer(text, max_length=50, min_length=20, do_sample=False)

# Print summary
print(summary[0]['summary_text'])


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


Hugging Face is a company that has revolutionized the AI and NLP industry with its open-source libraries and machine learning models. Their most popular library, Transformers, provides pre-trained deep learning models for a variety of NLP


# English to Hindi Translation with Hugging Face

In [11]:
! pip install transformers sentencepiece




In [12]:
# Load translation pipeline
translator = pipeline("translation_en_to_hi", model="Helsinki-NLP/opus-mt-en-hi")

# Input text
text = "My name is Sandeep Kumar,I'm data scientist at Jupiter AI Labs in Noida."

# Translate
translation = translator(text, max_length=100)
print(translation[0]['translation_text'])

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

Device set to use cuda:0


मेरा नाम सैनग्ली कुमार है, मैं नोआ में बॅब पर डेटा वैज्ञानिक हूँ.


In [13]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load NLLB model and tokenizer
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input text
text = "My name is Sandeep Kumar,I'm data scientist at Jupiter AI Labs in Noida."

# Prepend source language token (English - "eng_Latn")
text = f"<<eng_Latn>> {text}"

# Tokenize input text
inputs = tokenizer(text, return_tensors="pt")

# Get token ID for Hindi (Target language: "hin_Deva")
hindi_token_id = tokenizer.convert_tokens_to_ids("hin_Deva")

# Generate translation
translated_tokens = model.generate(**inputs, forced_bos_token_id=hindi_token_id)
translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

print("Hindi Translation:", translation)


tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

Hindi Translation: मेरा नाम संदीप कुमार है, मैं नोएडा में बृहस्पति एआई लैब्स में डेटा वैज्ञानिक हूँ।


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load NLLB model and tokenizer
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

while True:
    # Get user input
    text = input("\nEnter text in English (or type 'exit' to quit): ").strip()

    if text.lower() == "exit":
        print("Exiting translator. Goodbye! 👋")
        break

    # Prepend source language token (English - "eng_Latn")
    text = f"<<eng_Latn>> {text}"

    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt")

    # Get token ID for Hindi (Target language: "hin_Deva")
    hindi_token_id = tokenizer.convert_tokens_to_ids("hin_Deva")

    # Generate translation
    translated_tokens = model.generate(**inputs, forced_bos_token_id=hindi_token_id)
    translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

    # Print the Hindi translation
    print("\nHindi Translation:", translation)



Enter text in English (or type 'exit' to quit): My name is Sandeep Kumar,I'm data scientist at Jupiter AI Labs in Noida

Hindi Translation: मेरा नाम संदीप कुमार है, मैं नोएडा में बृहस्पति एआई लैब्स में डेटा वैज्ञानिक हूँ

Enter text in English (or type 'exit' to quit): exit
Exiting translator. Goodbye! 👋


# Image Generation

In [14]:
import tensorflow as tf
from diffusers import StableDiffusionPipeline

# Load Stable Diffusion Model with TensorFlow
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, from_pt=False)

# Define function for text-to-image generation
def generate_image(prompt):
    image = pipe(prompt).images[0]  # Generate image
    image.save("tensorflow_generated_image.png")  # Save image
    return image

# Get user input and generate an image
prompt = input("Enter a text prompt: ")
generated_image = generate_image(prompt)
generated_image.show()


model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

(…)ature_extractor/preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

text_encoder/config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

safety_checker/config.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

unet/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Keyword arguments {'from_pt': False} are not expected by StableDiffusionPipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

Enter a text prompt: A red parrot eat a apple


  0%|          | 0/50 [00:00<?, ?it/s]

In [15]:
generated_image.show()