<a href="https://colab.research.google.com/github/DigitalSocrates/DataScience/blob/master/HuggingFacesTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installation: 
To use the Hugging Face Transformers library, you need to install it first. You can use pip to install it:

In [None]:
!pip install transformers

This library is a part of the Hugging Face Transformers ecosystem, which is a popular framework for working with natural language processing (NLP) and deep learning models.

In [None]:
!pip install datasets

Purpose: The datasets library is designed to make it easy to access and manipulate a wide variety of datasets commonly used in NLP and machine learning research. It provides a unified and user-friendly interface to access datasets for tasks like text classification, language modeling, question answering, and more.

Data Sources: The library offers access to a vast collection of datasets, including but not limited to those from the Common Crawl, Wikipedia, OpenWebText, academic sources, and other text corpora. It also supports various languages and domains.

Data Formats: Datasets in the datasets library are typically provided in a structured and easy-to-use format. They are available in both raw text and tokenized forms, making it convenient for researchers and developers to preprocess and use the data for training and evaluation.

Integration with Transformers: The library is tightly integrated with the Hugging Face Transformers library, allowing users to seamlessly load pre-trained models and use them for various NLP tasks with minimal code.

### Importing Libraries:

Once installed, import the necessary libraries:

In [None]:
# imports
from transformers import pipeline
from transformers import GPT2LMHeadModel, GPT2Tokenizer

Loading a Pre-trained Model: 

Loading a pre-trained model in Hugging Face's Transformers library is a crucial step in preparing for text generation tasks. Hugging Face provides an extensive selection of pre-trained models for various NLP tasks, including text generation. In this example, we'll demonstrate how to load the GPT-2 model, but you can choose from a range of other models to suit your particular requirements.

It's essential to keep in mind that downloading and initializing certain models may require substantial storage and take some time, particularly for larger models like "gpt2-large" or "gpt2-xl." Ensure that you have sufficient disk space and a stable internet connection when working with these models.

Once you've loaded the model, you're ready to use it for various NLP tasks, such as text generation, text completion, or text classification, depending on your specific project requirements.

In [None]:
# Specify the model you want to use (e.g., GPT-2)
model_name = "gpt2"  # different models -> "gpt2-medium", "gpt2-large", "gpt2-xl"

# Create a tokenizer for the selected model
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Load the pre-trained model weights and architecture
model = GPT2LMHeadModel.from_pretrained(model_name)

Let's break down what each part of this code does:

model_name: This variable specifies the name of the pre-trained model you want to use. In this case, we've chosen "gpt2," which is GPT-2, but you can choose a different model that aligns with your task and resource constraints.

tokenizer: This line creates a tokenizer associated with the selected model. The tokenizer converts text into a format that the model can understand, such as tokenizing words into token IDs.

model: Here, you load the pre-trained model itself. This includes both the model's architecture and its learned weights, which were fine-tuned during pre-training on a massive text corpus. The from_pretrained method downloads and initializes the model if you haven't previously downloaded it.

In [5]:
# Input text prompt
prompt = "Once upon a time"

# Encode the prompt text into token IDs
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text
output = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2)

# Decode the generated token IDs back into text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# This will display the generated text based on your input prompt.
print(generated_text)

Once upon a time, the world was a place of great beauty and great danger. The world of the gods was the place where the great gods were born, and where they were to live.

The world that was created was not the same as the one that is now. It was an endless, endless world. And the Gods were not born of nothing. They were created of a single, single thing. That was why the universe was so beautiful. Because the cosmos was made of two


input_ids: This is where you encode your text prompt using the model's tokenizer. It converts the input text into a sequence of token IDs that the model can understand.

model.generate(): This function generates text based on the input_ids. You can specify parameters like max_length (maximum length of generated text), num_return_sequences (number of different text samples to generate), and more.

generated_text: Finally, you decode the generated token IDs back into human-readable text using the tokenizer.

TODO add experiments with different parameters


In [6]:
# summarize
generator("summarize: test here")

NameError: name 'generator' is not defined

In [None]:
# sentiment
generator("sst2 sentence: text goes here")

In [None]:
# Questions
generator("questions: question goes here ")

In [None]:
# text generation
gpt2_Generator = pipeline("text-generation", model="gpt2")

In [None]:
gpt2_Generator("some phrase goes here", max_new_tokens=512)

In [None]:
from datasets import load_dataset, list_datasets

In [None]:
available = list_datasets()
print([i for i in available if '/' not in i])