<a href="https://colab.research.google.com/github/Lineker98/generative-AI/blob/main/HuggingFace%20demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
from transformers import pipeline

## Text generation

In [None]:
text_generator = pipeline("text-generation", model='distilbert/distilgpt2')
generated_text = text_generator("Today is a rainy day in London",
                                truncation=True,
                                num_return_sequences=2)

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
print(generated_text[0]['generated_text'])

Today is a rainy day in London, and the city's winter weather has been very dry.




But since the first storm of April, there have been several days of rain in the city on the North Tower Bridge.
This morning, the National Weather Service and the Met Office were called to the scene and made a report about the situation.
There have been no reports of any damage.
The National Weather Service said it had been working to reduce the rain by up to two inches per day.
The National Weather Service said they were not aware of any damage to the bridge and urged commuters to stay away from the bridge.
In a statement the National Weather Service said: "Although the National Weather Service is working to reduce the rain, we are working with the National Weather Service to ensure that all emergency services are in full contact with the National Weather Service on their website to identify the exact cause of the storm."
An emergency response will be launched within 30 days.
It is understood that peo

In [None]:
print(generated_text[1]['generated_text'])

Today is a rainy day in London, and it's something that we all have to take seriously.



The recent financial meltdown, a series of financial crises that have already cost the economy trillions of pounds, left millions of our jobs and raised the cost of public services to almost 500 million people.
They've given us more money than any other one of our most important public services in recent history, as we've seen during the financial crisis.
In the last seven years, the cost of public services has declined by about 50%.
The crisis has also left thousands of jobless and unemployed.
When we talk about the financial crisis, there's a lot to be done.
The Government has been saying that there's a massive budget deficit in the budget, but that's not entirely accurate.
There's a lot to be done. There's a lot to be done, but it's not fully done.
We're going to go to a special conference in London on 24 July - we're going to have to talk about the budget deficit.
In fact, one of our most impo

# Question Answering

In [None]:
qa_model= pipeline("question-answering")
question = 'What is my job?'
context = 'I build machine learning models and perform statistic analysis'

qa_model(question=question, context=context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'score': 0.2939864695072174,
 'start': 2,
 'end': 62,
 'answer': 'build machine learning models and perform statistic analysis'}

# Tokenization

In [None]:
from transformers import (AutoTokenizer,
                          AutoModelForSequenceClassification,
                          DistilBertTokenizer,
                          DistilBertForSequenceClassification)

In [None]:
model_name2 = "nlptown/bert-base-multilingual-uncased-sentiment"
mymodel2 = AutoModelForSequenceClassification.from_pretrained(model_name2)
mytonekizer2 = AutoTokenizer.from_pretrained(model_name2)

classifier = pipeline("sentiment-analysis")
res = classifier('I was so not happy with the movie')
print(res)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9998102784156799}]


In [None]:
model_name2 = "nlptown/bert-base-multilingual-uncased-sentiment"
mymodel2 = AutoModelForSequenceClassification.from_pretrained(model_name2)
mytonekizer2 = AutoTokenizer.from_pretrained(model_name2)

classifier = pipeline("sentiment-analysis", model=mymodel2, tokenizer=mytonekizer2)
res = classifier('I was so happy with the movie')
print(res)

Device set to use cuda:0


[{'label': '5 stars', 'score': 0.6779806017875671}]


In [None]:
model_name2 = "nlptown/bert-base-multilingual-uncased-sentiment"
mymodel2 = AutoModelForSequenceClassification.from_pretrained(model_name2)
mytonekizer2 = AutoTokenizer.from_pretrained(model_name2)

classifier = pipeline("sentiment-analysis", model=mymodel2, tokenizer=mytonekizer2)
res = classifier('I was so not happy with the movie')
print(res)

Device set to use cuda:0


[{'label': '2 stars', 'score': 0.5216465592384338}]


In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# text example
text = "I was so not happy with the movie"

# tokenize the text
tokens = tokenizer.tokenize(text)

# print the tokens
print("Tokens:", tokens)

Tokens: ['i', 'was', 'so', 'not', 'happy', 'with', 'the', 'movie']


In [None]:
# convert tokens to inpu IDS
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# print the input IDs
print("Input IDs:", input_ids)

Input IDs: [1045, 2001, 2061, 2025, 3407, 2007, 1996, 3185]


In [None]:
# Encode the text (tokenization + converting to inpu IDs)
encoded_input = tokenizer(text)

# print the input IDs
print("Encoded Input:", encoded_input)

Encoded Input: {'input_ids': [101, 1045, 2001, 2061, 2025, 3407, 2007, 1996, 3185, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# Decode the text
decoded_text = tokenizer.decode(input_ids)

# print the decoded text
print("Decoded Text:", decoded_text)

Decoded Text: i was so not happy with the movie


- Token type IDs

These IDs are used to distinguish between different sequences in tasks that involve multiple sentences, such as questions-answering and sentence-pair classification.

BERT uses this mechanism to understand which tokens belong to whch segment. For single-sequence tasks like sentiment analysis, token_type_ids are all zeros

- Attention mask

The attention mask is used to differentiate between actual tokens and padding tokens (if any). It helps the model focus on non-padding tokens and ignore padding tokens. A value of 1 indicates that the token should be attendend to, while a value of 0 indicates padding.

- Why Padding tokens are used?

Uniform Sequence Length: Deep Learning models typically process input data in bacthes. To efficiently process these batches, all sequences in the batch mush have the same length. Padding tokens ensure this by extending shorter sequences to match the length of the longest sequence in the batch.

Efficient computation: Fixed-length sequences allow more efficient used of hardware resources, as the model cam process all sequences in parallel without to handle variable-length sequences individually.

# Fine Tunning IMDB

## Step1. Install Necessary Libriries

In [None]:
!pip install -U datasets huggingface_hub fsspec
!pip install transformers

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


## Step 2: Load and prepare the dataset

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('imdb')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
dataset['train']['text'][0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

## Setp 3: preprocess the Data

Tokenize teh dataset using the tokenizer associated with the pre-trained model

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [None]:
tokenized_datasets['train']['text'][0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

In [None]:
tokenized_datasets['train']['label'][0]

0

## Step 4: Set Up the Trainig Arguments

In [None]:
from transformers import TrainingArguments

In [None]:
training_args = TrainingArguments(output_dir='./results',
                                  eval_strategy='epoch',
                                  learning_rate=2e-5,
                                  num_train_epochs=1,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16,
                                  weight_decay=0.01,
                                  )
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False

## Step 5: Initialize the Model

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer

In [None]:
# Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=tokenized_datasets['train'],
                  eval_dataset=tokenized_datasets['test'])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 6: Train the Model

fine tuning the pre-trained model on your specific dataset

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.199,0.177632


TrainOutput(global_step=1563, training_loss=0.2514614457894958, metrics={'train_runtime': 3097.7809, 'train_samples_per_second': 8.07, 'train_steps_per_second': 0.505, 'total_flos': 6577776384000000.0, 'train_loss': 0.2514614457894958, 'epoch': 1.0})

## Step 7: Evaluate the Model

Assess te model's performance on a validation set

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

## Step 8: Save the fine-tuned Model

- Save the fined tunned model for later use

In [None]:
# Save the model
model.save('./fine_tuned_model')
tokenizer.save('./fine_tuned_model')

# ArXiv Project

In [1]:
!pip install arxiv

Collecting arxiv
  Downloading arxiv-2.2.0-py3-none-any.whl.metadata (6.3 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading arxiv-2.2.0-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6046 sha256=e40c879a953f4335fa60574d97f7d021730528cff16b2c11e2909c07dcfe85f1
  Stored in directory: /root/.cache/pip/wheels/3b/25/2a/105d6a15df6914f4d15047691c6c28f9052cc1173e40285d03
Successfully built sgmllib3k
Installing collected packag

In [2]:
import arxiv
import pandas as pd

In [3]:
query = 'ai OR articial intelligence OR machine learning'
search = arxiv.Search(query=query,
                      max_results=10,
                      sort_by=arxiv.SortCriterion.SubmittedDate)

In [5]:
papers = []

for result in search.results():
    papers.append({'published': result.published,
                   'title': result.title,
                   'abstract': result.summary,
                   'categories': result.categories})

df = pd.DataFrame(papers)
pd.set_option('display.max_colwidth', None)

  for result in search.results():


In [6]:
df.head()

Unnamed: 0,published,title,abstract,categories
0,2025-05-29 17:59:56+00:00,Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought,"Recent advances in multimodal large language models (MLLMs) have demonstrated\nremarkable capabilities in vision-language tasks, yet they often struggle with\nvision-centric scenarios where precise visual focus is needed for accurate\nreasoning. In this paper, we introduce Argus to address these limitations with\na new visual attention grounding mechanism. Our approach employs object-centric\ngrounding as visual chain-of-thought signals, enabling more effective\ngoal-conditioned visual attention during multimodal reasoning tasks.\nEvaluations on diverse benchmarks demonstrate that Argus excels in both\nmultimodal reasoning tasks and referring object grounding tasks. Extensive\nanalysis further validates various design choices of Argus, and reveals the\neffectiveness of explicit language-guided visual region-of-interest engagement\nin MLLMs, highlighting the importance of advancing multimodal intelligence from\na visual-centric perspective. Project page: https://yunzeman.github.io/argus/",[cs.CV]
1,2025-05-29 17:59:55+00:00,From Chat Logs to Collective Insights: Aggregative Question Answering,"Conversational agents powered by large language models (LLMs) are rapidly\nbecoming integral to our daily interactions, generating unprecedented amounts\nof conversational data. Such datasets offer a powerful lens into societal\ninterests, trending topics, and collective concerns. Yet, existing approaches\ntypically treat these interactions as independent and miss critical insights\nthat could emerge from aggregating and reasoning across large-scale\nconversation logs. In this paper, we introduce Aggregative Question Answering,\na novel task requiring models to reason explicitly over thousands of\nuser-chatbot interactions to answer aggregative queries, such as identifying\nemerging concerns among specific demographics. To enable research in this\ndirection, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative\nquestions derived from 182,330 real-world chatbot conversations. Experiments\nshow that existing methods either struggle to reason effectively or incur\nprohibitive computational costs, underscoring the need for new approaches\ncapable of extracting collective insights from large-scale conversational data.","[cs.CL, cs.AI, cs.LG]"
2,2025-05-29 17:59:52+00:00,MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence,"Spatial intelligence is essential for multimodal large language models\n(MLLMs) operating in the complex physical world. Existing benchmarks, however,\nprobe only single-image relations and thus fail to assess the multi-image\nspatial reasoning that real-world deployments demand. We introduce MMSI-Bench,\na VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision\nresearchers spent more than 300 hours meticulously crafting 1,000 challenging,\nunambiguous multiple-choice questions from over 120,000 images, each paired\nwith carefully designed distractors and a step-by-step reasoning process. We\nconduct extensive experiments and thoroughly evaluate 34 open-source and\nproprietary MLLMs, observing a wide gap: the strongest open-source model\nattains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while\nhumans score 97%. These results underscore the challenging nature of MMSI-Bench\nand the substantial headroom for future research. Leveraging the annotated\nreasoning processes, we also provide an automated error analysis pipeline that\ndiagnoses four dominant failure modes, including (1) grounding errors, (2)\noverlap-matching and scene-reconstruction errors, (3) situation-transformation\nreasoning errors, and (4) spatial-logic errors, offering valuable insights for\nadvancing multi-image spatial intelligence. Project page:\nhttps://runsenxu.com/projects/MMSI_Bench .","[cs.CV, cs.CL]"
3,2025-05-29 17:59:51+00:00,ZeroGUI: Automating Online GUI Learning at Zero Human Cost,"The rapid advancement of large Vision-Language Models (VLMs) has propelled\nthe development of pure-vision-based GUI Agents, capable of perceiving and\noperating Graphical User Interfaces (GUI) to autonomously fulfill user\ninstructions. However, existing approaches usually adopt an offline learning\nframework, which faces two core limitations: (1) heavy reliance on high-quality\nmanual annotations for element grounding and action supervision, and (2)\nlimited adaptability to dynamic and interactive environments. To address these\nlimitations, we propose ZeroGUI, a scalable, online learning framework for\nautomating GUI Agent training at Zero human cost. Specifically, ZeroGUI\nintegrates (i) VLM-based automatic task generation to produce diverse training\ngoals from the current environment state, (ii) VLM-based automatic reward\nestimation to assess task success without hand-crafted evaluation functions,\nand (iii) two-stage online reinforcement learning to continuously interact with\nand learn from GUI environments. Experiments on two advanced GUI Agents\n(UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance\nacross OSWorld and AndroidLab environments. The code is available at\nhttps://github.com/OpenGVLab/ZeroGUI.","[cs.AI, cs.CL, cs.CV]"
4,2025-05-29 17:59:50+00:00,Differential Information: An Information-Theoretic Perspective on Preference Optimization,"Direct Preference Optimization (DPO) has become a standard technique for\naligning language models with human preferences in a supervised manner. Despite\nits empirical success, the theoretical justification behind its log-ratio\nreward parameterization remains incomplete. In this work, we address this gap\nby utilizing the Differential Information Distribution (DID): a distribution\nover token sequences that captures the information gained during policy\nupdates. First, we show that when preference labels encode the differential\ninformation required to transform a reference policy into a target policy, the\nlog-ratio reward in DPO emerges as the uniquely optimal form for learning the\ntarget policy via preference optimization. This result naturally yields a\nclosed-form expression for the optimal sampling distribution over rejected\nresponses. Second, we find that the condition for preferences to encode\ndifferential information is fundamentally linked to an implicit assumption\nregarding log-margin ordered policies-an inductive bias widely used in\npreference optimization yet previously unrecognized. Finally, by analyzing the\nentropy of the DID, we characterize how learning low-entropy differential\ninformation reinforces the policy distribution, while high-entropy differential\ninformation induces a smoothing effect, which explains the log-likelihood\ndisplacement phenomenon. We validate our theoretical findings in synthetic\nexperiments and extend them to real-world instruction-following datasets. Our\nresults suggest that learning high-entropy differential information is crucial\nfor general instruction-following, while learning low-entropy differential\ninformation benefits knowledge-intensive question answering. Overall, our work\npresents a unifying perspective on the DPO objective, the structure of\npreference data, and resulting policy behaviors through the lens of\ndifferential information.","[cs.LG, cs.AI, cs.CL]"


In [10]:
abstract = df['abstract'][0]

summarizer = pipeline('summarization', model='facebook/bart-large-cnn')

# Suumarization
summary = summarizer(abstract)

Device set to use cuda:0


In [12]:
summary[0]['summary_text']

'Argus is a new visual attention grounding mechanism. It uses object-centricgrounding as visual chain-of-thought signals. Argus excels in both multimodal reasoning tasks and referring object grounding tasks. It highlights the importance of advancing multimodal intelligence from a visual-centric perspective.'