<h1>Hugging Face</h1>

In [1]:
from huggingface_hub import HfApi

<h3>Searching for a model</h3>

In [2]:
# Create the instance of the API
api = HfApi()

# Return the filtered list from the Hub
models = api.list_models(
    filter="text-classification",
    sort="downloads",
    direction=-1,
  	limit=5
)

# Store as a list
modelList = list(models)

# View top search result
print(modelList[0].modelId)

1231czx/llama3_it_ultra_list_and_bold500


<h3>Saving a model</h3>

In [3]:
from transformers import AutoModel

In [4]:
modelId = "distilbert-base-uncased-finetuned-sst-2-english"

In [5]:
# Instantiate the AutoModel class
model = AutoModel.from_pretrained(modelId)

In [6]:
# Save the model
model.save_pretrained(save_directory=f"models/{modelId}")

<h3>Inspecting a dataset</h3>

In [7]:
from datasets import load_dataset_builder

In [8]:
data_builder = load_dataset_builder("imdb")

In [9]:
print(data_builder.info.description)




In [10]:
print(data_builder.info.features)

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


In [11]:
from datasets import load_dataset

In [12]:
data = load_dataset("imdb")

In [13]:
imdb = load_dataset("imdb", split="train")
#wikipedia = load_dataset("wikimedia/wikipedia", language="20231101.en", split="train")

In [14]:
# Filter imdb
filtered = imdb.filter(lambda row: row['label']==0)

In [15]:
sliced = filtered.select(range(2))

print(sliced)

print(sliced[0]['text'])
print(sliced[0]['label'])

Dataset({
    features: ['text', 'label'],
    num_rows: 2
})
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. 

<h4>Auto classes to directly download a model</h4>

<h4>AutoModel class for each type of task</h4>

In [16]:
from transformers import AutoModelForSequenceClassification

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

<h4>AutoTokenizers prepare text input data</h4>

<h4>Recommended to use the tokenizer paired with the model</h4>

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

<h3>The Pipeline Module</h3>

<h4>Contains all task-specific steps</h4>

<h4>Best for quickly performing tasks</h4>

<h4>Great for getting started</h4>

In [19]:
# Task-specific pipeline for each task that leverage auto classes behind the scene
from transformers import pipeline, SummarizationPipeline, \
    TextClassificationPipeline, AudioClassificationPipeline, ImageSegmentationPipeline, QuestionAnsweringPipeline, AutoModelForSequenceClassification

<h4>Creating a pipeline</h4>

In [20]:
my_pipeline = pipeline(
    task="text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

In [21]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

In [22]:
my_pipeline = pipeline(
    task="text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

In [23]:
input = "Hi, welcome to this introduction about Hugging Face"

my_pipeline(input)

[{'label': 'POSITIVE', 'score': 0.9997480511665344}]

In [24]:
# Download the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Create the pipeline
model_pipeline = pipeline(task="sentiment-analysis", model=model, tokenizer=tokenizer)

# Predict the sentiment
output = model_pipeline(input)

print(f"Sentiment using AutoClasses: {output[0]['label']}")

Sentiment using AutoClasses: POSITIVE


In [25]:
# Import the AutoTokenizer
from transformers import AutoTokenizer

# Download the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased", 
    clean_up_tokenization_spaces=True
)

# Normalize the input string
output = tokenizer.backend_tokenizer.normalizer.normalize_str('HOWDY, how aré yoü?')

print(output)

howdy, how are you?


In [26]:
from transformers import GPT2Tokenizer, DistilBertTokenizer

# Download the gpt tokenizer
gpt_tokenizer = GPT2Tokenizer.from_pretrained(
    "gpt2", 
    clean_up_tokenization_spaces=True
)

# Tokenize the input
gpt_tokens = gpt_tokenizer.tokenize(input)

# Repeat for distilbert
distil_tokenizer = DistilBertTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english", 
    clean_up_tokenization_spaces=True
)
distil_tokens = distil_tokenizer.tokenize(text=input)

# Compare the output
print(f"GPT tokenizer: {gpt_tokens}")
print(f"DistilBERT tokenizer: {distil_tokens}")

GPT tokenizer: ['Hi', ',', 'Ġwelcome', 'Ġto', 'Ġthis', 'Ġintroduction', 'Ġabout', 'ĠHug', 'ging', 'ĠFace']
DistilBERT tokenizer: ['hi', ',', 'welcome', 'to', 'this', 'introduction', 'about', 'hugging', 'face']


In [27]:
# Create a pipeline
classifier = pipeline(
  task="text-classification", 
  model="abdulmatinomotoso/English_Grammar_Checker"
)

# Predict classification
output = classifier("I will walk dog")

print(output)



[{'label': 'LABEL_0', 'score': 0.9956323504447937}]


In [28]:
# Create the pipeline
classifier = pipeline(task="text-classification", model="cross-encoder/qnli-electra-base")

# Predict the output
output = classifier("Where is the capital of France?, Brittany is known for their kouign-amann.")

print(output)

[{'label': 'LABEL_0', 'score': 0.0052389358170330524}]


In [29]:
# Build the zero-shot classifier
classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

# Create the list
candidate_labels = ["politics", "science", "sports"]

# Predict the output
output = classifier(sequences="A 75-million-year-old Gorgosaurus fossil is the first tyrannosaur skeleton ever found with a filled stomach.", candidate_labels=candidate_labels)

print(f"Top Label: {output['labels'][0]} with score: {output['scores'][0]}")

Top Label: science with score: 0.903059720993042


In [30]:
# Create the summarization pipeline
summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum")

original_text = "Greece has many islands, with estimates ranging from somewhere around 1,200 to 6,000, depending on the minimum size to take into account. \
                The number of inhabited islands is variously cited as between 166 and 227. The Greek islands are traditionally grouped into the following \
                clusters: the Argo-Saronic Islands in the Saronic Gulf near Athens; the Cyclades, a large but dense collection occupying the central part of \
                the Aegean Sea; the North Aegean islands, a loose grouping off the west coast of Turkey; the Dodecanese, another loose collection in the \
                southeast between Crete and Turkey; the Sporades, a small tight group off the coast of Euboea; and the Ionian Islands, chiefly located to \
                the west of the mainland in the Ionian Sea. Crete with its surrounding islets and Euboea are traditionally excluded from this grouping."

# Summarize the text
summary_text = summarizer(original_text)

# Compare the length
# Compare the length
print(f"Original text length: {len(original_text)}")
print(f"Summary length: {len(summary_text[0]['summary_text'])}")

Original text length: 907
Summary length: 473


In [31]:
# Create a short summarizer
short_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=1, max_length=10)

# Summarize the input text
short_summary_text = short_summarizer(original_text)

# Print the short summary
print(short_summary_text[0]["summary_text"])

Greece has many islands, with estimates ranging


In [32]:
# Create a short summarizer
short_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=1, max_length=10)

# Summarize the input text
short_summary_text = short_summarizer(original_text)

# Print the short summary
print(short_summary_text[0]["summary_text"])

# Repeat for a long summarizer
long_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=50, max_length=150)

long_summary_text = long_summarizer(original_text)

# Print the long summary
print(long_summary_text[0]["summary_text"])

Greece has many islands, with estimates ranging
Greece has many islands, with estimates ranging from somewhere around 1,200 to 6,000 depending on the minimum size to take into account. The number of inhabited islands is variously cited as between 166 and 227. The Greek islands are traditionally grouped into the following clusters: the Argo-Saronic Islands in the Saronic Gulf near Athens; the Cyclades, a large but dense collection occupying the central part of the Aegean Sea; the North Aegesan islands, an loose group


In [33]:
# Create the list
text_to_summarize = [w["text"] for w in imdb]

# Create the pipeline
summarizer = pipeline("summarization", model="cnicu/t5-small-booksum", min_length=20, max_length=50)

# Summarize each item in the list
summaries = summarizer(text_to_summarize[:3], truncation=True)

# Create for-loop to print each summary
for i in range(0,3):
  print(f"Summary {i+1}: {summaries[i]['summary_text']}")

Summary 1: I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. The plot is centered around a young Swedish drama student named Lena
Summary 2: "I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. I
Summary 3: This film is interesting as an experiment but tells no cogent story. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless
