In [None]:
!nvidia-smi

Thu Aug 28 13:47:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!pip install transformers



Transformers in Hugging face is a python library to load, train, and use pretrained AI/ML models (mainly NLP, now also vision & audio) with very little code.

## Hugging Face Tasks

In [None]:
from transformers import pipeline

#NLP Tasks

classifier = pipeline("text-classification") #Assigning a category to a piece of text(like positive/negative, spam/ham, topic labels).

token_classifier = pipeline("token-classification") #Classifies individual tokens (words/subwords) in a sentence. Ex: "John Paris" → John → PERSON, Paris → LOCATION.

question_answerer = pipeline("question-answering") #Finds the answer to a question from a given context/passage. Ex: "Where is John?" → Paris.

text_generator = pipeline("text-generation") #Generates new text based on a prompt(like GPT). Ex: "Once upon a time," → "there was a young boy who loved to code!".

summarizer = pipeline("summarization") #Produces a short summary of a long passage. Ex: "Apples are Red" → "Apples Red".

translator = pipeline("translation",
                      model="Helsinki-NLP/opus-mt-en-fr") #Translates text from one language to another. Ex: English → Kannada.

text2text_generator = pipeline("text2text-generation") #Converts one text into another, depending on the model. Ex: With T5 model: Ip: "translate English to German: Hello world" Op: "Hallo Welt".

fill_mask = pipeline("fill-mask") #Predicts missing words in a sentence (masked language modeling). Ex: "I love [MASK]" → "I love pizza" / "I love coding".

feature_extractor = pipeline("feature-extraction") #Converts text into numerical embeddings (vectors) for use in ML models. Ex: "I love AI" → [0.12, -0.33, 0.85, ...].

#Computer Vision Tasks

image_classifier = pipeline("image-classification") #Classifying the main content of an image. Ex: Dog photo - "Dog".

object_detector = pipeline("object-detection") #Identifying objects within an image and their bounding boxes. Ex: Street photo → "Car", "Person".

image_segmenter = pipeline("image-segmentation") #Splits an image into regions (pixel-level labeling). Ex: Cat on sofa → "Cat mask", "Sofa mask".

#Speech Processing Tasks

speech_recognizer = pipeline("automatic-speech-recognition") #Converts speech → text.

speech_translator = pipeline("translation", model="some-speech-to-text-model") #Converts speech in one language → text in another language.

audio_classifier = pipeline("audio-classification") #Labels sounds/audio clips. Ex: Dog barking → "Dog sound".

#Multimodal Tasks

image_captioner = pipeline("image-to-text") #Generates a text caption for an image. Ex: Dog photo → "A brown dog".

vqa = pipeline("visual-question-answering") #Answers questions about an image. Ex: Cat photo + "What animal is this?" → "Cat".

#Other Tasks

table_qa = pipeline("table-question-answering") #Answers questions from structured tabular data (like CSVs).

doc_qa = pipeline("document-question-answering") #Extracts answers from documents (PDFs, scanned files).

#Time Series Forecasting (Not in core Transformers, but available via extensions like transformers-ts or pytorch-forecasting) It Predicts future values from historical time series data.


In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I was so not happy with the last Mission Impossible Movie")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9997795224189758}]


In [None]:
pipeline(task = "sentiment-analysis")("I was confused what type of clothes to wear in function")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9990531802177429}]

In [None]:
pipeline(task = "sentiment-analysis")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we can actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9964345693588257}]

In [None]:
pipeline(task = "sentiment-analysis", model="facebook/bart-large-mnli")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'label': 'neutral', 'score': 0.7693331837654114}]

## Batch Sentiment Analysis:

We will be analyzing multiple texts at once to determine their sentiment (positive, negative, neutral).

In [None]:
classifier = pipeline(task = "sentiment-analysis")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we can actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know.",\
            "I hate long Meetings."]

classifier(task_list)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9978686571121216},
 {'label': 'NEGATIVE', 'score': 0.9995476603507996},
 {'label': 'NEGATIVE', 'score': 0.9983084201812744},
 {'label': 'NEGATIVE', 'score': 0.9969881176948547}]

In [None]:
classifier = pipeline(task = "sentiment-analysis", model = "SamLowe/roberta-base-go_emotions")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know. It is pretty funny name for a Regression Model.",\
            "I hate long Meetings."]

classifier(task_list)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'admiration', 'score': 0.7406536340713501},
 {'label': 'confusion', 'score': 0.9066852331161499},
 {'label': 'amusement', 'score': 0.9083253145217896},
 {'label': 'anger', 'score': 0.7870621085166931}]

## Text Generation:
Generates new text based on a prompt(like GPT). Ex: "Once upon a time," → "there was a young boy who loved to code!".

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation", model="distilbert/distilgpt2")
generated_text = text_generator("Today is a rainy day in Bangalore",
                                truncation=True,          #Ensures that if the input text exceeds the model’s maximum token limit, it is automatically trimmed so the model can process it without errors.
                                num_return_sequences = 2) #Instead of just one continuation, the model generates 2 different possible outputs for the same prompt.

print("Generated_text:\n ", generated_text[0]['generated_text'])

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated_text:
  Today is a rainy day in Bangalore.


The roads are crowded with people on the streets, and they are crowded with people on the streets.
The streets are crowded with people on the streets, and they are crowded with people on the streets.
A lot of people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the buildings, and people are sleeping next to the

## Question Answering:
Finds the answer to a question from a given context/passage. Ex: "Where is John?" → Paris.

In [None]:
from transformers import pipeline

qa_model = pipeline("question-answering")
question = "What is my job?"
context = "I am developing AI models with Python."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'score': 0.7823827266693115,
 'start': 5,
 'end': 25,
 'answer': 'developing AI models'}

## Tokenization:
Tokenization is the process of splitting text into smaller units called tokens (like words, subwords, or characters) so that a model can understand and process it. Tokens are the basic input units for NLP models. Each token is usually converted into a numeric ID that the model can work with.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DistilBertTokenizer, DistilBertForSequenceClassification

In [None]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
mymode = AutoModelForSequenceClassification.from_pretrained(model_name)
mytokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model = mymode , tokenizer = mytokenizer)
res = classifier("I was so not happy with the Barbie Movie")
print(res)

Device set to use cuda:0


[{'label': '2 stars', 'score': 0.5099301934242249}]


In [None]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokens: ['I', 'was', 'so', 'not', 'happy', 'with', 'the', 'Barbie', 'Movie']
Input IDs: [146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275]
Encoded Input: {'input_ids': [101, 146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Decode Output:  I was so not happy with the Barbie Movie


### token_type_ids
These IDs are used to distinguish between different sequences in tasks that involve multiple sentences, such as question-answering and sentence-pair classification. BERT uses this mechanism to understand which tokens belong to which segment. For single-sequence tasks like sentiment analysis, token_type_ids are all zeros.

### attention_mask
The attention mask is used to differentiate between actual tokens and padding tokens (if any). It helps the model focus on non-padding tokens and ignore padding tokens. A value of 1 indicates that the token should be attended to, while a value of 0 indicates padding.

### Why Padding Tokens Are Used
- Uniform Sequence Length: Deep learning models typically process input data in batches. To efficiently process these batches, all sequences in a batch must have the same length. Padding tokens ensure this by extending shorter sequences to match the length of the longest sequence in the batch.

- Efficient Computation: Fixed-length sequences allow for more efficient use of hardware resources, as the model can process all sequences in parallel without needing to handle variable-length sequences individually.



## Fine Tunning IMDB:

Fine-tuning is the process of taking a pretrained model and training it further on a specific task or dataset to make it perform better for that task. The model already knows general patterns (from pretraining). Fine-tuning adjusts the weights slightly to specialize it for your problem.

## Step 1: Install Necessary Libraries

In [None]:
!pip install datasets



## Step 2: Load and Prepare the Dataset

In [None]:
from datasets import load_dataset
dataset = load_dataset('imdb')

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Step 3: Preprocess the Data
Tokenize the dataset using the tokenizer associated with the pre-trained model.

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Step 4: Set Up the Training Arguments
Specify the hyperparameters and training settings.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy ="epoch",     # Evaluate every epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    num_train_epochs=1,              # Number of training epochs
    weight_decay=0.01,               # Strength of weight decay
    report_to=['tensorboard']        # Remove 'wandb' from reporting destinations
)
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False

## Step 5: Initialize the Model
Load the pre-trained model and define the training procedure.

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 6: Train the Model
Fine-tune the pre-trained model on your specific dataset.

In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2015,0.178599


TrainOutput(global_step=1563, training_loss=0.2511283590178877, metrics={'train_runtime': 3290.5503, 'train_samples_per_second': 7.598, 'train_steps_per_second': 0.475, 'total_flos': 6577776384000000.0, 'train_loss': 0.2511283590178877, 'epoch': 1.0})

## Step 7: Evaluate the Model
Assess the model's performance on a validation set.

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

{'eval_loss': 0.17859867215156555, 'eval_runtime': 793.7004, 'eval_samples_per_second': 31.498, 'eval_steps_per_second': 1.969, 'epoch': 1.0}


## Step 8: Save the Fine-Tuned Model
Save the fine-tuned model for later use.

In [None]:
# Save the model
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-tokenizer')

('./fine-tuned-tokenizer/tokenizer_config.json',
 './fine-tuned-tokenizer/special_tokens_map.json',
 './fine-tuned-tokenizer/vocab.txt',
 './fine-tuned-tokenizer/added_tokens.json',
 './fine-tuned-tokenizer/tokenizer.json')

## ArXiv Project:

It's a python library and lightweight wrapper around the arXiv API that allows us to search, retrieve, and download research papers (titles, abstracts, authors, metadata, and PDFs) from the arXiv repository directly using Python.

In [None]:
!pip install arxiv



In [None]:
import arxiv
import pandas as pd

In [None]:
# Query to fetch AI-related papers
query = 'ai OR artificial intelligence OR machine learning'
search = arxiv.Search(query=query, max_results=10, sort_by=arxiv.SortCriterion.SubmittedDate)

# Fetch papers
papers = []
for result in search.results():
    papers.append({
      'published': result.published,
        'title': result.title,
        'abstract': result.summary,
        'categories': result.categories
    })

# Convert to DataFrame
df = pd.DataFrame(papers)

pd.set_option('display.max_colwidth', None)
df.head(10)

  for result in search.results():


Unnamed: 0,published,title,abstract,categories
0,2025-08-28 17:59:55+00:00,Dress&Dance: Dress up and Dance as You Like It - Technical Preview,"We present Dress&Dance, a video diffusion framework that generates high\nquality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a\nuser wearing desired garments while moving in accordance with a given reference\nvideo. Our approach requires a single user image and supports a range of tops,\nbottoms, and one-piece garments, as well as simultaneous tops and bottoms\ntry-on in a single pass. Key to our framework is CondNet, a novel conditioning\nnetwork that leverages attention to unify multi-modal inputs (text, images, and\nvideos), thereby enhancing garment registration and motion fidelity. CondNet is\ntrained on heterogeneous training data, combining limited video data and a\nlarger, more readily available image dataset, in a multistage progressive\nmanner. Dress&Dance outperforms existing open source and commercial solutions\nand enables a high quality and flexible try-on experience.","[cs.CV, cs.LG]"
1,2025-08-28 17:59:46+00:00,OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning,"In this paper, we introduce OneReward, a unified reinforcement learning\nframework that enhances the model's generative capabilities across multiple\ntasks under different evaluation criteria using only \textit{One Reward} model.\nBy employing a single vision-language model (VLM) as the generative reward\nmodel, which can distinguish the winner and loser for a given task and a given\nevaluation criterion, it can be effectively applied to multi-task generation\nmodels, particularly in contexts with varied data and diverse task objectives.\nWe utilize OneReward for mask-guided image generation, which can be further\ndivided into several sub-tasks such as image fill, image extend, object\nremoval, and text rendering, involving a binary mask as the edit area. Although\nthese domain-specific tasks share same conditioning paradigm, they differ\nsignificantly in underlying data distributions and evaluation metrics. Existing\nmethods often rely on task-specific supervised fine-tuning (SFT), which limits\ngeneralization and training efficiency. Building on OneReward, we develop\nSeedream 3.0 Fill, a mask-guided generation model trained via multi-task\nreinforcement learning directly on a pre-trained base model, eliminating the\nneed for task-specific SFT. Experimental results demonstrate that our unified\nedit model consistently outperforms both commercial and open-source\ncompetitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across\nmultiple evaluation dimensions. Code and model are available at:\nhttps://one-reward.github.io",[cs.CV]
2,2025-08-28 17:59:34+00:00,Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation,"Learning control policies in simulation enables rapid, safe, and\ncost-effective development of advanced robotic capabilities. However,\ntransferring these policies to the real world remains difficult due to the\nsim-to-real gap, where unmodeled dynamics and environmental disturbances can\ndegrade policy performance. Existing approaches, such as domain randomization\nand Real2Sim2Real pipelines, can improve policy robustness, but either struggle\nunder out-of-distribution conditions or require costly offline retraining. In\nthis work, we approach these problems from a different perspective. Instead of\nrelying on diverse training conditions before deployment, we focus on rapidly\nadapting the learned policy in the real world in an online fashion. To achieve\nthis, we propose a novel online adaptive learning framework that unifies\nresidual dynamics learning with real-time policy adaptation inside a\ndifferentiable simulation. Starting from a simple dynamics model, our framework\nrefines the model continuously with real-world data to capture unmodeled\neffects and disturbances such as payload changes and wind. The refined dynamics\nmodel is embedded in a differentiable simulation framework, enabling gradient\nbackpropagation through the dynamics and thus rapid, sample-efficient policy\nupdates beyond the reach of classical RL methods like PPO. All components of\nour system are designed for rapid adaptation, enabling the policy to adjust to\nunseen disturbances within 5 seconds of training. We validate the approach on\nagile quadrotor control under various disturbances in both simulation and the\nreal world. Our framework reduces hovering error by up to 81% compared to\nL1-MPC and 55% compared to DATT, while also demonstrating robustness in\nvision-based control without explicit state estimation.",[cs.RO]
3,2025-08-28 17:59:05+00:00,Prompt-to-Product: Generative Assembly via Bimanual Manipulation,"Creating assembly products demands significant manual effort and expert\nknowledge in 1) designing the assembly and 2) constructing the product. This\npaper introduces Prompt-to-Product, an automated pipeline that generates\nreal-world assembly products from natural language prompts. Specifically, we\nleverage LEGO bricks as the assembly platform and automate the process of\ncreating brick assembly structures. Given the user design requirements,\nPrompt-to-Product generates physically buildable brick designs, and then\nleverages a bimanual robotic system to construct the real assembly products,\nbringing user imaginations into the real world. We conduct a comprehensive user\nstudy, and the results demonstrate that Prompt-to-Product significantly lowers\nthe barrier and reduces manual effort in creating assembly products from\nimaginative ideas.","[cs.RO, cs.AI]"
4,2025-08-28 17:58:29+00:00,OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models,"As multi-turn dialogues with large language models (LLMs) grow longer and\nmore complex, how can users better evaluate and review progress on their\nconversational goals? We present OnGoal, an LLM chat interface that helps users\nbetter manage goal progress. OnGoal provides real-time feedback on goal\nalignment through LLM-assisted evaluation, explanations for evaluation results\nwith examples, and overviews of goal progression over time, enabling users to\nnavigate complex dialogues more effectively. Through a study with 20\nparticipants on a writing task, we evaluate OnGoal against a baseline chat\ninterface without goal tracking. Using OnGoal, participants spent less time and\neffort to achieve their goals while exploring new prompting strategies to\novercome miscommunication, suggesting tracking and visualizing goals can\nenhance engagement and resilience in LLM dialogues. Our findings inspired\ndesign implications for future LLM chat interfaces that improve goal\ncommunication, reduce cognitive load, enhance interactivity, and enable\nfeedback to improve LLM performance.","[cs.HC, cs.AI, cs.LG]"
5,2025-08-28 17:57:55+00:00,Mixture of Contexts for Long Video Generation,"Long video generation is fundamentally a long context memory problem: models\nmust retain and retrieve salient events across a long range without collapsing\nor drifting. However, scaling diffusion transformers to generate long-context\nvideos is fundamentally limited by the quadratic cost of self-attention, which\nmakes memory and computation intractable and difficult to optimize for long\nsequences. We recast long-context video generation as an internal information\nretrieval task and propose a simple, learnable sparse attention routing module,\nMixture of Contexts (MoC), as an effective long-term memory retrieval engine.\nIn MoC, each query dynamically selects a few informative chunks plus mandatory\nanchors (caption, local windows) to attend to, with causal routing that\nprevents loop closures. As we scale the data and gradually sparsify the\nrouting, the model allocates compute to salient history, preserving identities,\nactions, and scenes over minutes of content. Efficiency follows as a byproduct\nof retrieval (near-linear scaling), which enables practical training and\nsynthesis, and the emergence of memory and consistency at the scale of minutes.","[cs.GR, cs.AI, cs.CV]"
6,2025-08-28 17:55:25+00:00,Activity propagation with Hebbian learning,"We investigate the impact of Hebbian learning on the contact process, a\nparadigmatic model for infection spreading, which has been also proposed as a\nsimple model to capture the dynamics of inter-regional brain activity\npropagation as well as population spreading. Each of these contexts calls for\nan extension of the contact process with local learning. We introduce Hebbian\nlearning as a positive or negative reinforcement of the activation rate between\na pair of sites after each successful activation event. Learning can happen\neither in both directions motivated by social distancing (mutual learning\nmodel), or in only one of the directions motivated by brain and population\ndynamics (source or target learning models). Hebbian learning leads to a rich\nclass of emergent behavior, where local incentives can lead to the opposite\nglobal effects. In general, positive reinforcement (increasing activation\nrates) leads to a loss of the active phase, while negative reinforcement\n(reducing activation rates) can turn the inactive phase into a globally active\nphase. In two dimensions and above, the effect of negative reinforcement is\ntwofold: it promotes the spreading of activity, but at the same time gives rise\nto the appearance of effectively immune regions, entailing the emergence of two\ndistinct critical points. Positive reinforcement can lead to Griffiths effects\nwith non-universal power-law scaling, through the formation of random loops of\nactivity, a manifestation of the ``ant mill"" phenomenon.","[cond-mat.stat-mech, cond-mat.dis-nn, q-bio.PE]"
7,2025-08-28 17:55:14+00:00,FakeParts: a New Family of AI-Generated DeepFakes,"We introduce FakeParts, a new class of deepfakes characterized by subtle,\nlocalized manipulations to specific spatial regions or temporal segments of\notherwise authentic videos. Unlike fully synthetic content, these partial\nmanipulations, ranging from altered facial expressions to object substitutions\nand background modifications, blend seamlessly with real elements, making them\nparticularly deceptive and difficult to detect. To address the critical gap in\ndetection capabilities, we present FakePartsBench, the first large-scale\nbenchmark dataset specifically designed to capture the full spectrum of partial\ndeepfakes. Comprising over 25K videos with pixel-level and frame-level\nmanipulation annotations, our dataset enables comprehensive evaluation of\ndetection methods. Our user studies demonstrate that FakeParts reduces human\ndetection accuracy by over 30% compared to traditional deepfakes, with similar\nperformance degradation observed in state-of-the-art detection models. This\nwork identifies an urgent vulnerability in current deepfake detection\napproaches and provides the necessary resources to develop more robust methods\nfor partial video manipulations.","[cs.CV, cs.AI, cs.MM]"
8,2025-08-28 17:55:07+00:00,Enabling Equitable Access to Trustworthy Financial Reasoning,"According to the United States Internal Revenue Service, ''the average\nAmerican spends $\$270$ and 13 hours filing their taxes''. Even beyond the\nU.S., tax filing requires complex reasoning, combining application of\noverlapping rules with numerical calculations. Because errors can incur costly\npenalties, any automated system must deliver high accuracy and auditability,\nmaking modern large language models (LLMs) poorly suited for this task. We\npropose an approach that integrates LLMs with a symbolic solver to calculate\ntax obligations. We evaluate variants of this system on the challenging\nStAtutory Reasoning Assessment (SARA) dataset, and include a novel method for\nestimating the cost of deploying such a system based on real-world penalties\nfor tax errors. We further show how combining up-front translation of\nplain-text rules into formal logic programs, combined with intelligently\nretrieved exemplars for formal case representations, can dramatically improve\nperformance on this task and reduce costs to well below real-world averages.\nOur results demonstrate the promise and economic feasibility of neuro-symbolic\narchitectures for increasing equitable access to reliable tax assistance.","[cs.CL, cs.AI, cs.CY]"
9,2025-08-28 17:53:05+00:00,Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning,"Deepfake detection remains a formidable challenge due to the complex and\nevolving nature of fake content in real-world scenarios. However, existing\nacademic benchmarks suffer from severe discrepancies from industrial practice,\ntypically featuring homogeneous training sources and low-quality testing\nimages, which hinder the practical deployments of current detectors. To\nmitigate this gap, we introduce HydraFake, a dataset that simulates real-world\nchallenges with hierarchical generalization testing. Specifically, HydraFake\ninvolves diversified deepfake techniques and in-the-wild forgeries, along with\nrigorous training and evaluation protocol, covering unseen model architectures,\nemerging forgery techniques and novel data domains. Building on this resource,\nwe propose Veritas, a multi-modal large language model (MLLM) based deepfake\ndetector. Different from vanilla chain-of-thought (CoT), we introduce\npattern-aware reasoning that involves critical reasoning patterns such as\n""planning"" and ""self-reflection"" to emulate human forensic process. We further\npropose a two-stage training pipeline to seamlessly internalize such deepfake\nreasoning capacities into current MLLMs. Experiments on HydraFake dataset\nreveal that although previous detectors show great generalization on\ncross-model scenarios, they fall short on unseen forgeries and data domains.\nOur Veritas achieves significant gains across different OOD scenarios, and is\ncapable of delivering transparent and faithful detection outputs.","[cs.CV, cs.AI]"


In [None]:
from transformers import pipeline
# Example abstract from API
abstract = df['abstract'][0]

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarization
summarization_result = summarizer(abstract)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [None]:
summarization_result[0]['summary_text']

'Dress&Dance is a video diffusion framework that generates 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution. Key to our framework is CondNet, a novel conditioningnetwork that leverages attention to unify multi-modal inputs.'