# Installation

In [1]:
!nvidia-smi

Fri Nov 22 04:55:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install transformers



# Hugging Face Tasks

In [None]:
from transformers import pipeline
#---------------------------------------------------#
#                     NLP TASKS                     #
#---------------------------------------------------#

'''
1. Text Classification: Assigning a category to a piece of text.
Sentiment Analysis
Topic Classification
Spam Detection '''

classifier = pipeline("text-classification")

'''
2. Token Classification: Assigning labels to individual tokens in a sequence.
Named Entity Recognition (NER)
Part-of-Speech Tagging
'''

token_classifier = pipeline("token-classification")

'''
3. Question Answering: Extracting an answer from a given context based on a question.
'''
question_answerer = pipeline("question-answering")

'''
4. Text Generation: Generating text based on a given prompt.
Language Modeling
Story Generation

'''

text_generator = pipeline("text-generation")

'''
5. Summarization: Condensing long documents into shorter summaries.
'''

summarizer = pipeline("summarization")

'''
Translation: Translating text from one language to another.
'''

translator = pipeline("translation",
                      model="Helsinki-NLP/opus-mt-en-fr")

'''
6. Text2Text Generation: General-purpose text transformation, including summarization and translation.
'''

text2text_generator = pipeline("text2text-generation")

'''
7. Fill-Mask: Predicting the masked token in a sequence.
'''

fill_mask = pipeline("fill-mask")

'''
8. Feature Extraction: Extracting hidden states or features from text.
'''

feature_extractor = pipeline("feature-extraction")

'''
9. Sentence Similarity: Measuring the similarity between two sentences.
'''
sentence_similarity = pipeline("sentence-similarity")

#---------------------------------------------------#
#             Computer Vision TASKS                 #
#---------------------------------------------------#

'''
1. Image Classification: Classifying the main content of an image.

'''

image_classifier = pipeline("image-classification")

'''
2. Object Detection: Identifying objects within an image and their bounding boxes.
'''

object_detector = pipeline("object-detection")

'''
3. Image Segmentation: Segmenting different parts of an image into classes.
'''

image_segmenter = pipeline("image-segmentation")

'''
4. Image Generation: Generating images from textual descriptions (using DALL-E or similar models).
'''

#---------------------------------------------------#
#             Speech Processing TASKS               #
#---------------------------------------------------#

'''
1. utomatic Speech Recognition (ASR): Converting spoken language into text.
'''

speech_recognizer = pipeline("automatic-speech-recognition")

'''
2. Speech Translation: Translating spoken language from one language to another.
3. Audio Classification: Classifying audio signals into predefined categories.
'''

#---------------------------------------------------#
#                   Multimodal TASKS                #
#---------------------------------------------------#

'''
1. Image Captioning: Generating a textual description of an image.
'''
image_captioner = pipeline("image-to-text")
'''
2. Visual Question Answering (VQA): Answering questions about the content of an image.
'''

#---------------------------------------------------#
#                     Other TASKS                   #
#---------------------------------------------------#
'''
1. Table Question Answering: Answering questions based on tabular data.
'''
table_qa = pipeline("table-question-answering")

'''
2. Document Question Answering: Extracting answers from documents like PDFs.

'''
doc_qa = pipeline("document-question-answering")
'''
3. Time Series Forecasting: Predicting future values in time series data (not directly supported in the main Transformers library but available through extensions).
'''

# NLP Tasks

## Sentiment Analysis

In [6]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I was so not happy with the last Mission Impossible Movie")
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'NEGATIVE', 'score': 0.9997795224189758}]


In [4]:
pipeline(task = "sentiment-analysis")("I was confused with the Barbie Movie")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'NEGATIVE', 'score': 0.9992005228996277}]

In [5]:
pipeline(task = "sentiment-analysis")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9964345693588257}]

In [6]:
pipeline(task = "sentiment-analysis", model="facebook/bart-large-mnli")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'neutral', 'score': 0.7693336606025696}]

# **There are 2 Types of Models**

**LLM (Large Language Model):** A type of advanced neural network trained on vast amounts of text data to generate human-like language, typically at a larger scale.**(at a time multiple task)**

**LM (Language Model):**A general term for models that predict or generate text, which can range from simple statistical models to large-scale AI systems like LLMs.**(at a time single task)**.

### Batch Senteniment Analysis

In [7]:
classifier = pipeline(task = "sentiment-analysis")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know.",\
            "I hate long Meetings."]
classifier(task_list)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9978686571121216},
 {'label': 'NEGATIVE', 'score': 0.9995476603507996},
 {'label': 'NEGATIVE', 'score': 0.9983083009719849},
 {'label': 'NEGATIVE', 'score': 0.9969879984855652}]

In [8]:
classifier = pipeline(task = "sentiment-analysis", model = "SamLowe/roberta-base-go_emotions")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know. It is pretty funny name for a Regression Model.",\
            "I hate long Meetings."]
classifier(task_list)

config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'admiration', 'score': 0.7406534552574158},
 {'label': 'confusion', 'score': 0.9066852331161499},
 {'label': 'amusement', 'score': 0.9083252549171448},
 {'label': 'anger', 'score': 0.7870614528656006}]

**Note 2:We can find the different sentiment by using these model="SamLowe/roberta-base-go_emotions"**

## Text Generation

In [9]:
# Use a pipeline as a high-level helper
from transformers import pipeline

text_generator = pipeline("text-generation", model="distilbert/distilgpt2")
generated_text = text_generator("Today is a rainy day in London",
                                truncation=True,
                                num_return_sequences = 2)
print("Generated_text:\n ", generated_text[0]['generated_text'])

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Generated_text:
  Today is a rainy day in London and it can feel chilly and rainy. But as with most other outdoor spaces on the planet, it is the same at home. In fact, if you walk in a few metres away, your foot will pick it


## Question Answering

In [10]:
from transformers import pipeline

qa_model = pipeline("question-answering")
question = "What is my job?"
context = "I am developing AI models with Python."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.7823829054832458,
 'start': 5,
 'end': 25,
 'answer': 'developing AI models'}

# Tokenization

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DistilBertTokenizer, DistilBertForSequenceClassification

In [12]:
model_name2 = "nlptown/bert-base-multilingual-uncased-sentiment"
mymodel2 = AutoModelForSequenceClassification.from_pretrained(model_name2)
mytokenizer2 = AutoTokenizer.from_pretrained(model_name2)

classifier = pipeline("sentiment-analysis", model = mymodel2 , tokenizer = mytokenizer2)
res = classifier("I was so not happy with the Barbie Movie")
print(res)

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': '2 stars', 'score': 0.5099301338195801}]


In [13]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokens: ['i', 'was', 'so', 'not', 'happy', 'with', 'the', 'barbie', 'movie']


In [14]:
# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)


Input IDs: [1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185]


In [15]:

# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)


Encoded Input: {'input_ids': [101, 1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [16]:

# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)


Decode Output:  i was so not happy with the barbie movie


In [17]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokens: ['I', 'was', 'so', 'not', 'happy', 'with', 'the', 'Barbie', 'Movie']
Input IDs: [146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275]
Encoded Input: {'input_ids': [101, 146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Decode Output:  I was so not happy with the Barbie Movie


**token_type_ids**<br>
These IDs are used to distinguish between different sequences in tasks that involve multiple sentences, such as question-answering and sentence-pair classification. BERT uses this mechanism to understand which tokens belong to which segment. For single-sequence tasks like sentiment analysis, token_type_ids are all zeros.

**attention_mask** <br>
The attention mask is used to differentiate between actual tokens and padding tokens (if any). It helps the model focus on non-padding tokens and ignore padding tokens. A value of 1 indicates that the token should be attended to, while a value of 0 indicates padding.

**Why Padding Tokens Are Used**<br>
Uniform Sequence Length: Deep learning models typically process input data in batches. To efficiently process these batches, all sequences in a batch must have the same length. Padding tokens ensure this by extending shorter sequences to match the length of the longest sequence in the batch.
Efficient Computation: Fixed-length sequences allow for more efficient use of hardware resources, as the model can process all sequences in parallel without needing to handle variable-length sequences individually.



# Fine Tunning IMDB

## Step 1: Install Necessary Libraries

In [4]:
!pip install datasets



## Step 2: Load and Prepare the Dataset

In [5]:
from datasets import load_dataset
dataset = load_dataset('imdb')

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [21]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Step 3: Preprocess the Data
Tokenize the dataset using the tokenizer associated with the pre-trained model.

In [7]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [8]:
tokenized_datasets["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Step 4: Set Up the Training Arguments
Specify the hyperparameters and training settings.

In [9]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy ="epoch",     # Evaluate every epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    num_train_epochs=1,              # Number of training epochs
    weight_decay=0.01,               # Strength of weight decay
)
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
eval_use_gather_object

## Step 5: Initialize the Model
Load the pre-trained model and define the training procedure.

In [10]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 6: Train the Model
Fine-tune the pre-trained model on your specific dataset.

In [None]:
# Train the model
trainer.train()

## Step 7: Evaluate the Model
Assess the model's performance on a validation set.

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

## Step 8: Save the Fine-Tuned Model
Save the fine-tuned model for later use.

In [None]:
# Save the model
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-tokenizer')


# ArXiv Project

In [1]:
!pip install arxiv

Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=e9e704e8d761648e64c1bd02176e88500fac4927820af86e7075235a1d720f1b
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packag

In [2]:
import arxiv
import pandas as pd

In [3]:
# Query to fetch AI-related papers
query = 'ai OR artificial intelligence OR machine learning'
search = arxiv.Search(query=query, max_results=10, sort_by=arxiv.SortCriterion.SubmittedDate)

# Fetch papers
papers = []
for result in search.results():
    papers.append({
      'published': result.published,
        'title': result.title,
        'abstract': result.summary,
        'categories': result.categories
    })

# Convert to DataFrame
df = pd.DataFrame(papers)

pd.set_option('display.max_colwidth', None)
df.head(10)

  for result in search.results():


Unnamed: 0,published,title,abstract,categories
0,2024-11-21 18:59:51+00:00,Stable Flow: Vital Layers for Training-Free Image Editing,"Diffusion models have revolutionized the field of content synthesis and\nediting. Recent models have replaced the traditional UNet architecture with the\nDiffusion Transformer (DiT), and employed flow-matching for improved training\nand sampling. However, they exhibit limited generation diversity. In this work,\nwe leverage this limitation to perform consistent image edits via selective\ninjection of attention features. The main challenge is that, unlike the\nUNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it\nunclear in which layers to perform the injection. Therefore, we propose an\nautomatic method to identify ""vital layers"" within DiT, crucial for image\nformation, and demonstrate how these layers facilitate a range of controlled\nstable edits, from non-rigid modifications to object addition, using the same\nmechanism. Next, to enable real-image editing, we introduce an improved image\ninversion method for flow models. Finally, we evaluate our approach through\nqualitative and quantitative comparisons, along with a user study, and\ndemonstrate its effectiveness across multiple applications. The project page is\navailable at https://omriavrahami.com/stable-flow","[cs.CV, cs.GR, cs.LG]"
1,2024-11-21 18:59:08+00:00,Revisiting the Integration of Convolution and Attention for Vision Backbone,"Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically\nconsidered alternatives to each other for building vision backbones. Although\nsome works try to integrate both, they apply the two operators simultaneously\nat the finest pixel granularity. With Convs responsible for per-pixel feature\nextraction already, the question is whether we still need to include the heavy\nMHSAs at such a fine-grained level. In fact, this is the root cause of the\nscalability issue w.r.t. the input resolution for vision transformers. To\naddress this important problem, we propose in this work to use MSHAs and Convs\nin parallel \textbf{at different granularity levels} instead. Specifically, in\neach layer, we use two different ways to represent an image: a fine-grained\nregular grid and a coarse-grained set of semantic slots. We apply different\noperations to these two representations: Convs to the grid for local features,\nand MHSAs to the slots for global features. A pair of fully differentiable soft\nclustering and dispatching modules is introduced to bridge the grid and set\nrepresentations, thus enabling local-global fusion. Through extensive\nexperiments on various vision tasks, we empirically verify the potential of the\nproposed integration scheme, named \textit{GLMix}: by offloading the burden of\nfine-grained features to light-weight Convs, it is sufficient to use MHSAs in a\nfew (e.g., 64) semantic slots to match the performance of recent\nstate-of-the-art backbones, while being more efficient. Our visualization\nresults also demonstrate that the soft clustering module produces a meaningful\nsemantic grouping effect with only IN1k classification supervision, which may\ninduce better interpretability and inspire new weakly-supervised semantic\nsegmentation approaches. Code will be available at\n\url{https://github.com/rayleizhu/GLMix}.","[cs.CV, cs.AI]"
2,2024-11-21 18:58:32+00:00,Transformer-based Heuristic for Advanced Air Mobility Planning,"Safety is extremely important for urban flights of autonomous Unmanned Aerial\nVehicles (UAVs). Risk-aware path planning is one of the most effective methods\nto guarantee the safety of UAVs. This type of planning can be represented as a\nConstrained Shortest Path (CSP) problem, which seeks to find the shortest route\nthat meets a predefined safety constraint. Solving CSP problems is NP-hard,\npresenting significant computational challenges. Although traditional methods\ncan accurately solve CSP problems, they tend to be very slow. Previously, we\nintroduced an additional safety dimension to the traditional A* algorithm,\nknown as ASD A*, to effectively handle Constrained Shortest Path (CSP)\nproblems. Then, we developed a custom learning-based heuristic using\ntransformer-based neural networks, which significantly reduced computational\nload and enhanced the performance of the ASD A* algorithm. In this paper, we\nexpand our dataset to include more risk maps and tasks, improve the proposed\nmodel, and increase its performance. We also introduce a new heuristic strategy\nand a novel neural network, which enhance the overall effectiveness of our\napproach.",[cs.RO]
3,2024-11-21 18:57:17+00:00,Whack-a-Chip: The Futility of Hardware-Centric Export Controls,"U.S. export controls on semiconductors are widely known to be permeable, with\nthe People's Republic of China (PRC) steadily creating state-of-the-art\nartificial intelligence (AI) models with exfiltrated chips. This paper presents\nthe first concrete, public evidence of how leading PRC AI labs evade and\ncircumvent U.S. export controls. We examine how Chinese companies, notably\nTencent, are not only using chips that are restricted under U.S. export\ncontrols but are also finding ways to circumvent these regulations by using\nsoftware and modeling techniques that maximize less capable hardware.\nSpecifically, we argue that Tencent's ability to power its Hunyuan-Large model\nwith non-export controlled NVIDIA H20s exemplifies broader gains in efficiency\nin machine learning that have eroded the moat that the United States initially\nbuilt via its existing export controls. Finally, we examine the implications of\nthis finding for the future of the United States' export control strategy.","[cs.CY, cs.AI]"
4,2024-11-21 18:56:33+00:00,Learning Fair Robustness via Domain Mixup,"Adversarial training is one of the predominant techniques for training\nclassifiers that are robust to adversarial attacks. Recent work, however has\nfound that adversarial training, which makes the overall classifier robust, it\ndoes not necessarily provide equal amount of robustness for all classes. In\nthis paper, we propose the use of mixup for the problem of learning fair robust\nclassifiers, which can provide similar robustness across all classes.\nSpecifically, the idea is to mix inputs from the same classes and perform\nadversarial training on mixed up inputs. We present a theoretical analysis of\nthis idea for the case of linear classifiers and show that mixup combined with\nadversarial training can provably reduce the class-wise robustness disparity.\nThis method not only contributes to reducing the disparity in class-wise\nadversarial risk, but also the class-wise natural risk. Complementing our\ntheoretical analysis, we also provide experimental results on both synthetic\ndata and the real world dataset (CIFAR-10), which shows improvement in class\nwise disparities for both natural and adversarial risks.","[cs.LG, cs.CR, cs.IT, math.IT]"
5,2024-11-21 18:54:43+00:00,From RNNs to Foundation Models: An Empirical Study on Commercial Building Energy Consumption,"Accurate short-term energy consumption forecasting for commercial buildings\nis crucial for smart grid operations. While smart meters and deep learning\nmodels enable forecasting using past data from multiple buildings, data\nheterogeneity from diverse buildings can reduce model performance. The impact\nof increasing dataset heterogeneity in time series forecasting, while keeping\nsize and model constant, is understudied. We tackle this issue using the\nComStock dataset, which provides synthetic energy consumption data for U.S.\ncommercial buildings. Two curated subsets, identical in size and region but\ndiffering in building type diversity, are used to assess the performance of\nvarious time series forecasting models, including fine-tuned open-source\nfoundation models (FMs). The results show that dataset heterogeneity and model\narchitecture have a greater impact on post-training forecasting performance\nthan the parameter count. Moreover, despite the higher computational cost,\nfine-tuned FMs demonstrate competitive performance compared to base models\ntrained from scratch.",[cs.LG]
6,2024-11-21 18:46:45+00:00,Adversarial Poisoning Attack on Quantum Machine Learning Models,"With the growing interest in Quantum Machine Learning (QML) and the\nincreasing availability of quantum computers through cloud providers,\naddressing the potential security risks associated with QML has become an\nurgent priority. One key concern in the QML domain is the threat of data\npoisoning attacks in the current quantum cloud setting. Adversarial access to\ntraining data could severely compromise the integrity and availability of QML\nmodels. Classical data poisoning techniques require significant knowledge and\ntraining to generate poisoned data, and lack noise resilience, making them\nineffective for QML models in the Noisy Intermediate Scale Quantum (NISQ) era.\nIn this work, we first propose a simple yet effective technique to measure\nintra-class encoder state similarity (ESS) by analyzing the outputs of encoding\ncircuits. Leveraging this approach, we introduce a quantum indiscriminate data\npoisoning attack, QUID. Through extensive experiments conducted in both\nnoiseless and noisy environments (e.g., IBM\_Brisbane's noise), across various\narchitectures and datasets, QUID achieves up to $92\%$ accuracy degradation in\nmodel performance compared to baseline models and up to $75\%$ accuracy\ndegradation compared to random label-flipping. We also tested QUID against\nstate-of-the-art classical defenses, with accuracy degradation still exceeding\n$50\%$, demonstrating its effectiveness. This work represents the first attempt\nto reevaluate data poisoning attacks in the context of QML.","[quant-ph, cs.CR, cs.CV]"
7,2024-11-21 18:46:23+00:00,Multi-Agent Environments for Vehicle Routing Problems,"Research on Reinforcement Learning (RL) approaches for discrete optimization\nproblems has increased considerably, extending RL to an area classically\ndominated by Operations Research (OR). Vehicle routing problems are a good\nexample of discrete optimization problems with high practical relevance where\nRL techniques have had considerable success. Despite these advances,\nopen-source development frameworks remain scarce, hampering both the testing of\nalgorithms and the ability to objectively compare results. This ultimately\nslows down progress in the field and limits the exchange of ideas between the\nRL and OR communities.\n Here we propose a library composed of multi-agent environments that simulates\nclassic vehicle routing problems. The library, built on PyTorch, provides a\nflexible modular architecture design that allows easy customization and\nincorporation of new routing problems. It follows the Agent Environment Cycle\n(""AEC"") games model and has an intuitive API, enabling rapid adoption and easy\nintegration into existing reinforcement learning frameworks.\n The library allows for a straightforward use of classical OR benchmark\ninstances in order to narrow the gap between the test beds for algorithm\nbenchmarking used by the RL and OR communities. Additionally, we provide\nbenchmark instance sets for each environment, as well as baseline RL models and\ntraining code.",[cs.LG]
8,2024-11-21 18:45:03+00:00,Engineering spectro-temporal light states with physics-trained deep learning,"Frequency synthesis and spectro-temporal control of optical wave packets are\ncentral to ultrafast science, with supercontinuum (SC) generation standing as\none remarkable example. Through passive manipulation, femtosecond (fs) pulses\nfrom nJ-level lasers can be transformed into octave-spanning spectra,\nsupporting few-cycle pulse outputs when coupled with external pulse\ncompressors. While strategies such as machine learning have been applied to\ncontrol the SC's central wavelength and bandwidth, their success has been\nlimited by the nonlinearities and strong sensitivity to measurement noise.\nHere, we propose and demonstrate how a physics-trained convolutional neural\nnetwork (P-CNN) can circumvent such challenges, showing few-fold speedups over\nthe direct approaches. We highlight three key advancements enabled by the P-CNN\napproach: (i) on-demand control over spectral features of SC, (ii) direct\ngeneration of sub-3-cycle pulses from the highly nonlinear fiber, and (iii) the\nproduction of high-order solitons, capturing distinct ""breather"" dynamics in\nboth spectral and temporal domains. This approach heralds a new era of\narbitrary spectro-temporal state engineering, with transformative implications\nfor ultrafast and quantum science.","[physics.optics, nlin.PS, physics.class-ph, quant-ph]"
9,2024-11-21 18:39:15+00:00,Exploring Methods for Integrating and Augmenting Multimodal Data to Improve Prognostic Accuracy in Imbalanced Datasets for Intraoperative Aneurysm Occlusion,"This study evaluates a multimodal machine learning framework for predicting\ntreatment outcomes in intracranial aneurysms (IAs). Combining angiographic\nparametric imaging (API), patient biomarkers, and disease morphology, the\nframework aims to enhance prognostic accuracy. Data from 340 patients were\nanalyzed, with separate deep neural networks processing quantitative and\ncategorical data. These networks' pre decision layers were concatenated and\ninputted into a final predictive network. Various data augmentation strategies,\nincluding Synthetic Minority Oversampling Technique for Nominal and Continuous\ndata (SMOTE NC), addressed dataset imbalances. Performance metrics, evaluated\nthrough Monte Carlo cross validation, showed significant improvements with\naugmentation, particularly in intermediate fusion models. This study validates\nthe framework's efficacy in accurately predicting IA treatment outcomes,\ndemonstrating that data augmentation techniques can substantially enhance model\nperformance.",[physics.med-ph]


In [7]:
# Example abstract from API
abstract = df['abstract'][0]

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarization
summarization_result = summarizer(abstract)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [8]:
summarization_result[0]['summary_text']

'Diffusion models have revolutionized the field of content synthesis andediting. The main challenge is that, unlike the traditional UNet-based models, DiT lacks a coarse-to-fine synthesis structure. We propose anautomatic method to identify "vital layers" within DiT, crucial for imageformation.'