 **Installation**


In [1]:
!nvidia-smi

Thu Jan  2 03:16:34 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:

!pip install transformers



In [None]:
from transformers import pipeline
#---------------------------------------------------#
#                     NLP TASKS                     #
#---------------------------------------------------#

'''
1. Text Classification: Assigning a category to a piece of text.
Sentiment Analysis
Topic Classification
Spam Detection '''

classifier = pipeline("text-classification")

'''
2. Token Classification: Assigning labels to individual tokens in a sequence.
Named Entity Recognition (NER)
Part-of-Speech Tagging
'''

token_classifier = pipeline("token-classification")

'''
3. Question Answering: Extracting an answer from a given context based on a question.
'''
question_answerer = pipeline("question-answering")

'''
4. Text Generation: Generating text based on a given prompt.
Language Modeling
Story Generation

'''

text_generator = pipeline("text-generation")

'''
5. Summarization: Condensing long documents into shorter summaries.
'''

summarizer = pipeline("summarization")

'''
Translation: Translating text from one language to another.
'''

translator = pipeline("translation",
                      model="Helsinki-NLP/opus-mt-en-fr")

'''
6. Text2Text Generation: General-purpose text transformation, including summarization and translation.
'''

text2text_generator = pipeline("text2text-generation")

'''
7. Fill-Mask: Predicting the masked token in a sequence.
'''

fill_mask = pipeline("fill-mask")

'''
8. Feature Extraction: Extracting hidden states or features from text.
'''

feature_extractor = pipeline("feature-extraction")

'''
9. Sentence Similarity: Measuring the similarity between two sentences.
'''
sentence_similarity = pipeline("sentence-similarity")

#---------------------------------------------------#
#             Computer Vision TASKS                 #
#---------------------------------------------------#

'''
1. Image Classification: Classifying the main content of an image.

'''

image_classifier = pipeline("image-classification")

'''
2. Object Detection: Identifying objects within an image and their bounding boxes.
'''

object_detector = pipeline("object-detection")

'''
3. Image Segmentation: Segmenting different parts of an image into classes.
'''

image_segmenter = pipeline("image-segmentation")

'''
4. Image Generation: Generating images from textual descriptions (using DALL-E or similar models).
'''

#---------------------------------------------------#
#             Speech Processing TASKS               #
#---------------------------------------------------#

'''
1. utomatic Speech Recognition (ASR): Converting spoken language into text.
'''

speech_recognizer = pipeline("automatic-speech-recognition")

'''
2. Speech Translation: Translating spoken language from one language to another.
3. Audio Classification: Classifying audio signals into predefined categories.
'''

#---------------------------------------------------#
#                   Multimodal TASKS                #
#---------------------------------------------------#

'''
1. Image Captioning: Generating a textual description of an image.
'''
image_captioner = pipeline("image-to-text")
'''
2. Visual Question Answering (VQA): Answering questions about the content of an image.
'''

#---------------------------------------------------#
#                     Other TASKS                   #
#---------------------------------------------------#
'''
1. Table Question Answering: Answering questions based on tabular data.
'''
table_qa = pipeline("table-question-answering")

'''
2. Document Question Answering: Extracting answers from documents like PDFs.

'''
doc_qa = pipeline("document-question-answering")
'''
3. Time Series Forecasting: Predicting future values in time series data (not directly supported in the main Transformers library but available through extensions).
'''

**# NLP Tasks**
# **Sentiment Analysis**

In [7]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I was so not happy with the last Mission Impossible Movie")
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9997795224189758}]


In [9]:
pipeline(task = "sentiment-analysis")("I was confused with the Barbie Movie")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9992005228996277}]

In [10]:
pipeline(task = "sentiment-analysis")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9964345693588257}]

In [20]:
pipeline(task = "sentiment-analysis", model="facebook/bart-large-mnli")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")



Device set to use cuda:0


[{'label': 'neutral', 'score': 0.7693331837654114}]

# Batch Senteniment Analysis

In [21]:
classifier = pipeline(task = "sentiment-analysis")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know.",\
            "I hate long Meetings."]
classifier(task_list)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9978686571121216},
 {'label': 'NEGATIVE', 'score': 0.9995476603507996},
 {'label': 'NEGATIVE', 'score': 0.9983084201812744},
 {'label': 'NEGATIVE', 'score': 0.9969881176948547}]

In [22]:
classifier = pipeline(task = "sentiment-analysis", model = "SamLowe/roberta-base-go_emotions")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know.",\
            "I hate long Meetings."]
classifier(task_list)

config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'admiration', 'score': 0.7406536340713501},
 {'label': 'confusion', 'score': 0.9066852927207947},
 {'label': 'neutral', 'score': 0.7837758660316467},
 {'label': 'anger', 'score': 0.7870620489120483}]

# Text Generation

In [31]:
from transformers import pipeline

text_generator = pipeline("text-generation", model="distilbert/distilgpt2")
generate_text = text_generator("Today is a rainy day in Dallas.", truncation=True,
                                num_return_sequences = 2)

print(generate_text[0]['generated_text'])


Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today is a rainy day in Dallas. But the first thing to notice is some nice weather. Just imagine what would happen if the tornado or hail would come and knock down and storm Texas.


# Question Answering

In [37]:
from transformers import pipeline

questions = pipeline("question-answering")
question = "whats my faourite car?"
context = "I love sports cars"
questions(question = question, context = context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'score': 0.9025944471359253, 'start': 7, 'end': 18, 'answer': 'sports cars'}

# Tokenization

In [41]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DistilBertTokenizer, DistilBertForSequenceClassification


In [42]:
model_name2 = "nlptown/bert-base-multilingual-uncased-sentiment"
mymodel2 = AutoModelForSequenceClassification.from_pretrained(model_name2)
mytokenizer2 = AutoTokenizer.from_pretrained(model_name2)

classifier = pipeline("sentiment-analysis", model = mymodel2 , tokenizer = mytokenizer2)
res = classifier("I was so not happy with the Barbie Movie")
print(res)


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': '2 stars', 'score': 0.5099301934242249}]


In [43]:
from transformers import pipeline
# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokens: ['i', 'was', 'so', 'not', 'happy', 'with', 'the', 'barbie', 'movie']


In [44]:
# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

Input IDs: [1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185]


In [45]:
# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

Encoded Input: {'input_ids': [101, 1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [46]:
# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

Decode Output:  i was so not happy with the barbie movie


In [47]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokens: ['I', 'was', 'so', 'not', 'happy', 'with', 'the', 'Barbie', 'Movie']
Input IDs: [146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275]
Encoded Input: {'input_ids': [101, 146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Decode Output:  I was so not happy with the Barbie Movie


token_type_ids
These IDs are used to distinguish between different sequences in tasks that involve multiple sentences, such as question-answering and sentence-pair classification. BERT uses this mechanism to understand which tokens belong to which segment. For single-sequence tasks like sentiment analysis, token_type_ids are all zeros.

attention_mask
The attention mask is used to differentiate between actual tokens and padding tokens (if any). It helps the model focus on non-padding tokens and ignore padding tokens. A value of 1 indicates that the token should be attended to, while a value of 0 indicates padding.

Why Padding Tokens Are Used
Uniform Sequence Length: Deep learning models typically process input data in batches. To efficiently process these batches, all sequences in a batch must have the same length. Padding tokens ensure this by extending shorter sequences to match the length of the longest sequence in the batch. Efficient Computation: Fixed-length sequences allow for more efficient use of hardware resources, as the model can process all sequences in parallel without needing to handle variable-length sequences individually.

# **Fine Tunning IMDB**

In [48]:
!pip install datasets




# Step 2: Load and Prepare the Dataset

In [49]:
from datasets import load_dataset
dataset = load_dataset('imdb')

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [50]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [52]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# Step 3: Preprocess the Data
Tokenize the dataset using the tokenizer associated with the pre-trained model.

In [54]:
from transformers import AutoTokenizer
#load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

#tokenize the dataset
def tokenize_function(examples):
  return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [55]:

tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [56]:
tokenized_datasets["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# Step 4: Set Up the Training Arguments
Specify the hyperparameters and training settings

In [57]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy ="epoch",     # Evaluate every epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    num_train_epochs=1,              # Number of training epochs
    weight_decay=0.01,               # Strength of weight decay
)
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
eval_use_gather_object

# Step 5: Initialize the Model




In [58]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Step 6: Train the Model

In [59]:
# Train the model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,0.4382,0.414707


TrainOutput(global_step=1563, training_loss=0.5235612482423593, metrics={'train_runtime': 3578.6201, 'train_samples_per_second': 6.986, 'train_steps_per_second': 0.437, 'total_flos': 6577776384000000.0, 'train_loss': 0.5235612482423593, 'epoch': 1.0})

# Step 7: Evaluate the Model
Assess the model's performance on a validation set.

In [61]:
# Evaluate the model
results = trainer.evaluate()
print(results)

{'eval_loss': 0.4147070646286011, 'eval_runtime': 728.2757, 'eval_samples_per_second': 34.328, 'eval_steps_per_second': 2.146, 'epoch': 1.0}


# Step 8: Save the Fine-Tuned Model

In [62]:
# Save the model
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-tokenizer')

('./fine-tuned-tokenizer/tokenizer_config.json',
 './fine-tuned-tokenizer/special_tokens_map.json',
 './fine-tuned-tokenizer/vocab.txt',
 './fine-tuned-tokenizer/added_tokens.json',
 './fine-tuned-tokenizer/tokenizer.json')

# ArXiv Project

In [60]:

!pip install arxiv

Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=971e61cbb95b0666c58d19f24040811da677951f960e7922fcbf6d7a753e4042
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packag

In [63]:

import arxiv
import pandas as pd

In [64]:
# Query to fetch AI-related papers
query = 'ai OR artificial intelligence OR machine learning'
search = arxiv.Search(query=query, max_results=10, sort_by=arxiv.SortCriterion.SubmittedDate)

# Fetch papers
papers = []
for result in search.results():
    papers.append({
      'published': result.published,
        'title': result.title,
        'abstract': result.summary,
        'categories': result.categories
    })

# Convert to DataFrame
df = pd.DataFrame(papers)

pd.set_option('display.max_colwidth', None)
df.head(10)


  for result in search.results():


Unnamed: 0,published,title,abstract,categories
0,2024-12-30 18:59:58+00:00,PERSE: Personalized 3D Generative Avatars from A Single Portrait,"We present PERSE, a method for building an animatable personalized generative\navatar from a reference portrait. Our avatar model enables facial attribute\nediting in a continuous and disentangled latent space to control each facial\nattribute, while preserving the individual's identity. To achieve this, our\nmethod begins by synthesizing large-scale synthetic 2D video datasets, where\neach video contains consistent changes in the facial expression and viewpoint,\ncombined with a variation in a specific facial attribute from the original\ninput. We propose a novel pipeline to produce high-quality, photorealistic 2D\nvideos with facial attribute editing. Leveraging this synthetic attribute\ndataset, we present a personalized avatar creation method based on the 3D\nGaussian Splatting, learning a continuous and disentangled latent space for\nintuitive facial attribute manipulation. To enforce smooth transitions in this\nlatent space, we introduce a latent space regularization technique by using\ninterpolated 2D faces as supervision. Compared to previous approaches, we\ndemonstrate that PERSE generates high-quality avatars with interpolated\nattributes while preserving identity of reference person.",[cs.CV]
1,2024-12-30 18:59:55+00:00,Action-Agnostic Point-Level Supervision for Temporal Action Detection,"We propose action-agnostic point-level (AAPL) supervision for temporal action\ndetection to achieve accurate action instance detection with a lightly\nannotated dataset. In the proposed scheme, a small portion of video frames is\nsampled in an unsupervised manner and presented to human annotators, who then\nlabel the frames with action categories. Unlike point-level supervision, which\nrequires annotators to search for every action instance in an untrimmed video,\nframes to annotate are selected without human intervention in AAPL supervision.\nWe also propose a detection model and learning method to effectively utilize\nthe AAPL labels. Extensive experiments on the variety of datasets (THUMOS '14,\nFineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed\napproach is competitive with or outperforms prior methods for video-level and\npoint-level supervision in terms of the trade-off between the annotation cost\nand detection performance.","[cs.CV, cs.AI, cs.LG]"
2,2024-12-30 18:59:46+00:00,"SoS Certificates for Sparse Singular Values and Their Applications: Robust Statistics, Subspace Distortion, and More","We study $\textit{sparse singular value certificates}$ for random rectangular\nmatrices. If $M$ is an $n \times d$ matrix with independent Gaussian entries,\nwe give a new family of polynomial-time algorithms which can certify upper\nbounds on the maximum of $\|M u\|$, where $u$ is a unit vector with at most\n$\eta n$ nonzero entries for a given $\eta \in (0,1)$. This basic algorithmic\nprimitive lies at the heart of a wide range of problems across algorithmic\nstatistics and theoretical computer science.\n Our algorithms certify a bound which is asymptotically smaller than the naive\none, given by the maximum singular value of $M$, for nearly the widest-possible\nrange of $n,d,$ and $\eta$. Efficiently certifying such a bound for a range of\n$n,d$ and $\eta$ which is larger by any polynomial factor than what is achieved\nby our algorithm would violate lower bounds in the SQ and low-degree\npolynomials models. Our certification algorithm makes essential use of the\nSum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop\na new combinatorial connection between the graph matrix approach to analyze\nrandom matrices with dependent entries, and the Efron-Stein decomposition of\nfunctions of independent random variables.\n As applications of our certification algorithm, we obtain new efficient\nalgorithms for a wide range of well-studied algorithmic tasks. In algorithmic\nrobust statistics, we obtain new algorithms for robust mean and covariance\nestimation with tradeoffs between breakdown point and sample complexity, which\nare nearly matched by SQ and low-degree polynomial lower bounds (that we\nestablish). We also obtain new polynomial-time guarantees for certification of\n$\ell_1/\ell_2$ distortion of random subspaces of $\mathbb{R}^n$ (also with\nnearly matching lower bounds), sparse principal component analysis, and\ncertification of the $2\rightarrow p$ norm of a random matrix.","[cs.DS, cs.LG]"
3,2024-12-30 18:59:06+00:00,Distributed Mixture-of-Agents for Edge Inference with Large Language Models,"Mixture-of-Agents (MoA) has recently been proposed as a method to enhance\nperformance of large language models (LLMs), enabling multiple individual LLMs\nto work together for collaborative inference. This collaborative approach\nresults in improved responses to user prompts compared to relying on a single\nLLM. In this paper, we consider such an MoA architecture in a distributed\nsetting, where LLMs operate on individual edge devices, each uniquely\nassociated with a user and equipped with its own distributed computing power.\nThese devices exchange information using decentralized gossip algorithms,\nallowing different device nodes to talk without the supervision of a\ncentralized server. In the considered setup, different users have their own LLM\nmodels to address user prompts. Additionally, the devices gossip either their\nown user-specific prompts or augmented prompts to generate more refined answers\nto certain queries. User prompts are temporarily stored in the device queues\nwhen their corresponding LLMs are busy. Given the memory limitations of edge\ndevices, it is crucial to ensure that the average queue sizes in the system\nremain bounded. In this paper, we address this by theoretically calculating the\nqueuing stability conditions for the device queues under reasonable\nassumptions, which we validate experimentally as well. Further, we demonstrate\nthrough experiments, leveraging open-source LLMs for the implementation of\ndistributed MoA, that certain MoA configurations produce higher-quality\nresponses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The\nimplementation is available at:\nhttps://github.com/purbeshmitra/distributed_moa.","[cs.IT, cs.CL, cs.DC, cs.LG, cs.NI, math.IT]"
4,2024-12-30 18:55:35+00:00,Sparse chaos in cortical circuits,"Nerve impulses, the currency of information flow in the brain, are generated\nby an instability of the neuronal membrane potential dynamics. Neuronal\ncircuits exhibit collective chaos that appears essential for learning, memory,\nsensory processing, and motor control. However, the factors controlling the\nnature and intensity of collective chaos in neuronal circuits are not well\nunderstood. Here we use computational ergodic theory to demonstrate that basic\nfeatures of nerve impulse generation profoundly affect collective chaos in\nneuronal circuits. Numerically exact calculations of Lyapunov spectra,\nKolmogorov-Sinai-entropy, and upper and lower bounds on attractor dimension\nshow that changes in nerve impulse generation in individual neurons moderately\nimpact information encoding rates but qualitatively transform phase space\nstructure. Specifically, we find a drastic reduction in the number of unstable\nmanifolds, Kolmogorov-Sinai entropy, and attractor dimension. Beyond a critical\npoint, marked by the simultaneous breakdown of the diffusion approximation, a\npeak in the largest Lyapunov exponent, and a localization transition of the\nleading covariant Lyapunov vector, networks exhibit sparse chaos: prolonged\nperiods of near stable dynamics interrupted by short bursts of intense chaos.\nAnalysis of large, more realistically structured networks supports the\ngenerality of these findings. In cortical circuits, biophysical properties\nappear tuned to this regime of sparse chaos. Our results reveal a close link\nbetween fundamental aspects of single-neuron biophysics and the collective\ndynamics of cortical circuits, suggesting that nerve impulse generation\nmechanisms are adapted to enhance circuit controllability and information flow.","[q-bio.NC, cond-mat.dis-nn, cs.LG, nlin.CD]"
5,2024-12-30 18:55:12+00:00,Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs,"The remarkable performance of models like the OpenAI o1 can be attributed to\ntheir ability to emulate human-like long-time thinking during inference. These\nmodels employ extended chain-of-thought (CoT) processes, exploring multiple\nstrategies to enhance problem-solving capabilities. However, a critical\nquestion remains: How to intelligently and efficiently scale computational\nresources during testing. This paper presents the first comprehensive study on\nthe prevalent issue of overthinking in these models, where excessive\ncomputational resources are allocated for simple problems with minimal benefit.\nWe introduce novel efficiency metrics from both outcome and process\nperspectives to evaluate the rational use of computational resources by o1-like\nmodels. Using a self-training paradigm, we propose strategies to mitigate\noverthinking, streamlining reasoning processes without compromising accuracy.\nExperimental results show that our approach successfully reduces computational\noverhead while preserving model performance across a range of testsets with\nvarying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.",[cs.CL]
6,2024-12-30 18:50:37+00:00,Two-component spatiotemporal template for activation-inhibition of speech in ECoG,"I compute the average trial-by-trial power of band-limited speech activity\nacross epochs of multi-channel high-density electrocorticography (ECoG)\nrecorded from multiple subjects during a consonant-vowel speaking task. I show\nthat previously seen anti-correlations of average beta frequency activity\n(12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement\nare observable between individual ECoG channels in the sensorimotor cortex\n(SMC). With this I fit a variance-based model using principal component\nanalysis to the band-powers of individual channels of session-averaged ECoG\ndata in the SMC and project SMC channels onto their lower-dimensional principal\ncomponents.\n Spatiotemporal relationships between speech-related activity and principal\ncomponents are identified by correlating the principal components of both\nfrequency bands to individual ECoG channels over time using windowed\ncorrelation. Correlations of principal component areas to sensorimotor areas\nreveal a distinct two-component activation-inhibition-like representation for\nspeech that resembles distinct local sensorimotor areas recently shown to have\ncomplex interplay in whole-body motor control, inhibition, and posture. Notably\nthe third principal component shows insignificant correlations across all\nsubjects, suggesting two components of ECoG are sufficient to represent SMC\nactivity during speech movement.","[q-bio.NC, cs.CL, cs.LG, eess.AS, eess.SP]"
7,2024-12-30 18:44:16+00:00,Cavity-QED Simulation of a Maser beyond the Mean-Field Approximation,"We here introduce a method for simulating, quantum mechanically, the dynamics\nof a maser where the strength of the magnetic field of the microwave mode being\namplified by stimulated emission varies over the volume of the maser's\nspatially extended gain medium. This is very often the case in real systems.\nOur method generalizes the well-known Tavis-Cummings (T-C) model of cavity\nquantum electrodynamics (QED) to encompass quantum emitters whose coupling\nstrengths to the maser's amplified mode vary over a distribution that can be\naccurately determined using an electromagnetic-field solver applied to the\nmaser cavity's geometry and composition. We then solve our generalized T-C\nmodel to second order in cumulant expansion using publicly available\nPython-based software. We apply our methodology to a specific, experimentally\nmeasured maser based on an optically pumped crystal of pentacene-doped\npara-terphenyl. We demonstrate that certain distinct quantum-mechanical\nfeatures exhibited by this maser's dynamics, most notably the observation of\nRabi-like flopping associated with the generation of spin-photon Dicke states,\ncan be accurately reproduced using our numerically solved model. The equivalent\nsimpler model, that invokes the mean-field approximation, fails to do so. By\nconstructing then solving for artificial (perfectly Gaussian) distributions, we\ngo on to explore how the performance of this type of maser is affected by the\nspread in spin-photon coupling strengths. Our methodology thereby enables the\nmaser's anatomy to be more rationally engineered.","[quant-ph, physics.app-ph]"
8,2024-12-30 18:43:21+00:00,Adversarial Attack and Defense for LoRa Device Identification and Authentication via Deep Learning,"LoRa provides long-range, energy-efficient communications in Internet of\nThings (IoT) applications that rely on Low-Power Wide-Area Network (LPWAN)\ncapabilities. Despite these merits, concerns persist regarding the security of\nLoRa networks, especially in situations where device identification and\nauthentication are imperative to secure the reliable access to the LoRa\nnetworks. This paper explores a deep learning (DL) approach to tackle these\nconcerns, focusing on two critical tasks, namely (i) identifying LoRa devices\nand (ii) classifying them to legitimate and rogue devices. Deep neural networks\n(DNNs), encompassing both convolutional and feedforward neural networks, are\ntrained for these tasks using actual LoRa signal data. In this setting, the\nadversaries may spoof rogue LoRa signals through the kernel density estimation\n(KDE) method based on legitimate device signals that are received by the\nadversaries. Two cases are considered, (i) training two separate classifiers,\none for each of the two tasks, and (ii) training a multi-task classifier for\nboth tasks. The vulnerabilities of the resulting DNNs to manipulations in input\nsamples are studied in form of untargeted and targeted adversarial attacks\nusing the Fast Gradient Sign Method (FGSM). Individual and common perturbations\nare considered against single-task and multi-task classifiers for the LoRa\nsignal analysis. To provide resilience against such attacks, a defense approach\nis presented by increasing the robustness of classifiers with adversarial\ntraining. Results quantify how vulnerable LoRa signal classification tasks are\nto adversarial attacks and emphasize the need to fortify IoT applications\nagainst these subtle yet effective threats.","[cs.NI, cs.AI, cs.CR, cs.LG, eess.SP]"
9,2024-12-30 18:41:29+00:00,Open RAN-Enabled Deep Learning-Assisted Mobility Management for Connected Vehicles,"Connected Vehicles (CVs) can leverage the unique features of 5G and future\n6G/NextG networks to enhance Intelligent Transportation System (ITS) services.\nHowever, even with advancements in cellular network generations, CV\napplications may experience communication interruptions in high-mobility\nscenarios due to frequent changes of serving base station, also known as\nhandovers (HOs). This paper proposes the adoption of Open Radio Access Network\n(Open RAN/O-RAN) and deep learning models for decision-making to prevent\nQuality of Service (QoS) degradation due to HOs and to ensure the timely\nconnectivity needed for CV services. The solution utilizes the O-RAN Software\nCommunity (OSC), an open-source O-RAN platform developed by the collaboration\nbetween the O-RAN Alliance and Linux Foundation, to develop xApps that are\nexecuted in the near-Real-Time RIC of OSC. To demonstrate the proposal's\neffectiveness, an integrated framework combining the OMNeT++ simulator and OSC\nwas created. Evaluations used real-world datasets in urban application\nscenarios, such as video streaming transmission and over-the-air (OTA) updates.\nResults indicate that the proposal achieved superior performance and reduced\nlatency compared to the standard 3GPP HO procedure.","[cs.NI, cs.AI]"


In [65]:
# Example abstract from API
abstract = df['abstract'][0]

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarization
summarization_result = summarizer(abstract)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [66]:

summarization_result[0]['summary_text']

"PERSE is a method for building an animatable personalized generative avatar from a reference portrait. Our avatar model enables facial attribute editing in a continuous and disentangled latent space to control each facial attribute, while preserving the individual's identity. Compared to previous approaches, wedemonstrate that PERSE generates high-quality avatars with interpolated \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0attributes."