## Step 1 : Load and Prepare The Dataset ??

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [4]:
dataset['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Step 2 : Process The Data ??

In [5]:
from transformers import AutoTokenizer

# Load The Tokenizer ?
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize The Dataset ?
def tokenize_function (examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [6]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [7]:
tokenized_datasets['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Step 3 : Set Up Training Data ??

In [8]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

training_args




TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False

## Step 4 : Initialize The Model ??

In [23]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load The Pre-Trained Model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Initialize The Trainer ??
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 6 : Train The Model ??

In [26]:
trainer.train()

KeyboardInterrupt: 

## Step 7 : Evaluate The Model ??

In [None]:
results = trainer.evaluate()

print(results)

## Step 8 : Save The Fine-Tuned Model ??

In [None]:
model.save_pretrained('./fine-tuned-model')

tokenizer.save_pretrained('./fine-tuned-tokenizer')

In [35]:
from transformers import pipeline
import arxiv
import pandas as pd

In [33]:
# Query To Fetch AI-Related Papers ??
query = 'ai OR artificial intelligence OR machine learning'

search = arxiv.Search(query=query, max_results=10, sort_by=arxiv.SortCriterion.SubmittedDate)

# Fetch Papers ??
papers = []

for result in search.results():
    papers.append({
        'published': result.published,
        'title': result.title,
        'abstract': result.summary,
        'categories': result.categories,
    })

# Convert To DataFrame ??
df = pd.DataFrame(papers)
pd.set_option('display.max_colwidth', None)
df.head(10)

  for result in search.results():


Unnamed: 0,published,title,abstract,categories
0,2025-06-06 17:59:50+00:00,TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation,"Modern Earth observation (EO) increasingly leverages deep learning to harness\nthe scale and diversity of satellite imagery across sensors and regions. While\nrecent foundation models have demonstrated promising generalization across EO\ntasks, many remain limited by the scale, geographical coverage, and spectral\ndiversity of their training data, factors critical for learning globally\ntransferable representations. In this work, we introduce TerraFM, a scalable\nself-supervised learning model that leverages globally distributed Sentinel-1\nand Sentinel-2 imagery, combined with large spatial tiles and land-cover aware\nsampling to enrich spatial and semantic coverage. By treating sensing\nmodalities as natural augmentations in our self-supervised approach, we unify\nradar and optical inputs via modality-specific patch embeddings and adaptive\ncross-attention fusion. Our training strategy integrates local-global\ncontrastive learning and introduces a dual-centering mechanism that\nincorporates class-frequency-aware regularization to address long-tailed\ndistributions in land cover.TerraFM achieves strong generalization on both\nclassification and segmentation tasks, outperforming prior models on GEO-Bench\nand Copernicus-Bench. Our code and pretrained models are publicly available at:\nhttps://github.com/mbzuai-oryx/TerraFM .",[cs.CV]
1,2025-06-06 17:59:28+00:00,Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias,"Diagnosing deep neural networks (DNNs) through the eigenspectrum of weight\nmatrices has been an active area of research in recent years. At a high level,\neigenspectrum analysis of DNNs involves measuring the heavytailness of the\nempirical spectral densities (ESD) of weight matrices. It provides insight into\nhow well a model is trained and can guide decisions on assigning better\nlayer-wise training hyperparameters. In this paper, we address a challenge\nassociated with such eigenspectrum methods: the impact of the aspect ratio of\nweight matrices on estimated heavytailness metrics. We demonstrate that\nmatrices of varying sizes (and aspect ratios) introduce a non-negligible bias\nin estimating heavytailness metrics, leading to inaccurate model diagnosis and\nlayer-wise hyperparameter assignment. To overcome this challenge, we propose\nFARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the\nweight matrices by subsampling submatrices with a fixed aspect ratio. Instead\nof measuring the heavytailness of the original ESD, we measure the average ESD\nof these subsampled submatrices. We show that measuring the heavytailness of\nthese submatrices with the fixed aspect ratio can effectively mitigate the\naspect ratio bias. We validate our approach across various optimization\ntechniques and application domains that involve eigenspectrum analysis of\nweights, including image classification in computer vision (CV) models,\nscientific machine learning (SciML) model training, and large language model\n(LLM) pruning. Our results show that despite its simplicity, FARMS uniformly\nimproves the accuracy of eigenspectrum analysis while enabling more effective\nlayer-wise hyperparameter assignment in these application domains. In one of\nthe LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model\nby 17.3% when compared with the state-of-the-art method.","[cs.LG, cs.AI]"
2,2025-06-06 17:58:54+00:00,Distillation Robustifies Unlearning,"Current LLM unlearning methods are not robust: they can be reverted easily\nwith a few steps of finetuning. This is true even for the idealized unlearning\nmethod of training to imitate an oracle model that was never exposed to\nunwanted information, suggesting that output-based finetuning is insufficient\nto achieve robust unlearning. In a similar vein, we find that training a\nrandomly initialized student to imitate an unlearned model transfers desired\nbehaviors while leaving undesired capabilities behind. In other words,\ndistillation robustifies unlearning. Building on this insight, we propose\nUnlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an\nunlearned model into a partially noised copy of itself. UNDO introduces a\ntunable tradeoff between compute cost and robustness, establishing a new Pareto\nfrontier on synthetic language and arithmetic tasks. At its strongest setting,\nUNDO matches the robustness of a model retrained from scratch with perfect data\nfiltering while using only 60-80% of the compute and requiring only 0.01% of\nthe pretraining data to be labeled. We also show that UNDO robustifies\nunlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP)\nbenchmark. Since distillation is widely used in practice, incorporating an\nunlearning step beforehand offers a convenient path to robust capability\nremoval.","[cs.LG, cs.AI]"
3,2025-06-06 17:58:36+00:00,Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding,"Despite recent progress in vision-language models (VLMs), holistic\nunderstanding of long-form video content remains a significant challenge,\npartly due to limitations in current benchmarks. Many focus on peripheral,\n``needle-in-a-haystack'' details, encouraging context-insensitive retrieval\nover deep comprehension. Others rely on large-scale, semi-automatically\ngenerated questions (often produced by language models themselves) that are\neasier for models to answer but fail to reflect genuine understanding. In this\npaper, we introduce MF$^2$, a new benchmark for evaluating whether models can\ncomprehend, consolidate, and recall key narrative information from full-length\nmovies (50-170 minutes long). MF$^2$ includes over 50 full-length,\nopen-licensed movies, each paired with manually constructed sets of claim pairs\n-- one true (fact) and one plausible but false (fib), totalling over 850 pairs.\nThese claims target core narrative elements such as character motivations and\nemotions, causal chains, and event order, and refer to memorable moments that\nhumans can recall without rewatching the movie. Instead of multiple-choice\nformats, we adopt a binary claim evaluation protocol: for each pair, models\nmust correctly identify both the true and false claims. This reduces biases\nlike answer ordering and enables a more precise assessment of reasoning. Our\nexperiments demonstrate that both open-weight and closed state-of-the-art\nmodels fall well short of human performance, underscoring the relative ease of\nthe task for humans and their superior ability to retain and reason over\ncritical narrative information -- an ability current VLMs lack.","[cs.CV, cs.CL, cs.LG]"
4,2025-06-06 17:51:26+00:00,Accurately simulating core-collapse self-interacting dark matter halos,"The properties of satellite halos provide a promising probe for dark matter\n(DM) physics. Observations motivate current efforts to explain surprisingly\ncompact DM halos. If DM is not collisionless but has strong self-interactions,\nhalos can undergo gravothermal collapse, leading to higher densities in the\ncentral region of the halo. However, it is challenging to model this collapse\nphase from first principles. To improve on this, we seek to better understand\nnumerical challenges and convergence properties of self-interacting dark matter\n(SIDM) N-body simulations in the collapse phase. Especially we aim for a better\nunderstanding of the evolution of satellite halos. To do so, we run SIDM N-body\nsimulations of a low mass halo in isolation and within an external\ngravitational potential. The simulation setup is motivated by the perturber of\nthe stellar stream GD-1. We find that the halo evolution is very sensitive to\nenergy conservation errors, and a too large SIDM kernel size can artificially\nspeed up the collapse. Moreover, we demonstrate that the King model can\ndescribe the density profile at small radii for the late stages that we have\nsimulated. Furthermore, for our highest-resolved simulation (N = 5x10^7) we\nmake the data public. It can serve as a benchmark. Overall, we find that the\ncurrent numerical methods do not suffer from convergence problems in the late\ncollapse phase and provide guidance on how to choose numerical parameters, e.g.\nthat the energy conservation error is better kept well below 1%. This allows to\nrun simulations of halos becoming concentrated enough to explain observations\nof GD-1 like stellar streams or strong gravitational lensing systems.","[astro-ph.CO, astro-ph.GA, hep-ph]"
5,2025-06-06 17:48:23+00:00,Cartridges: Lightweight and general-purpose long context representations via self-study,"Large language models are often used to answer queries grounded in large text\ncorpora (e.g. codebases, legal documents, or chat histories) by placing the\nentire corpus in the context window and leveraging in-context learning (ICL).\nAlthough current models support contexts of 100K-1M tokens, this setup is\ncostly to serve because the memory consumption of the KV cache scales with\ninput length. We explore an alternative: training a smaller KV cache offline on\neach corpus. At inference time, we load this trained KV cache, which we call a\nCartridge, and decode a response. Critically, the cost of training a Cartridge\ncan be amortized across all the queries referencing the same corpus. However,\nwe find that the naive approach of training the Cartridge with next-token\nprediction on the corpus is not competitive with ICL. Instead, we propose\nself-study, a training recipe in which we generate synthetic conversations\nabout the corpus and train the Cartridge with a context-distillation objective.\nWe find that Cartridges trained with self-study replicate the functionality of\nICL, while being significantly cheaper to serve. On challenging long-context\nbenchmarks, Cartridges trained with self-study match ICL performance while\nusing 38.6x less memory and enabling 26.4x higher throughput. Self-study also\nextends the model's effective context length (e.g. from 128k to 484k tokens on\nMTOB) and surprisingly, leads to Cartridges that can be composed at inference\ntime without retraining.","[cs.CL, cs.AI, cs.LG]"
6,2025-06-06 17:47:27+00:00,Integrating Complexity and Biological Realism: High-Performance Spiking Neural Networks for Breast Cancer Detection,"Spiking Neural Networks (SNNs) event-driven nature enables efficient encoding\nof spatial and temporal features, making them suitable for dynamic\ntime-dependent data processing. Despite their biological relevance, SNNs have\nseen limited application in medical image recognition due to difficulties in\nmatching the performance of conventional deep learning models. To address this,\nwe propose a novel breast cancer classification approach that combines SNNs\nwith Lempel-Ziv Complexity (LZC) a computationally efficient measure of\nsequence complexity. LZC enhances the interpretability and accuracy of\nspike-based models by capturing structural patterns in neural activity. Our\nstudy explores both biophysical Leaky Integrate-and-Fire (LIF) and\nprobabilistic Levy-Baxter (LB) neuron models under supervised, unsupervised,\nand hybrid learning regimes. Experiments were conducted on the Breast Cancer\nWisconsin dataset using numerical features derived from medical imaging.\nLB-based models consistently exceeded 90.00% accuracy, while LIF-based models\nreached over 85.00%. The highest accuracy of 98.25% was achieved using an\nANN-to-SNN conversion method applied to both neuron models comparable to\ntraditional deep learning with back-propagation, but at up to 100 times lower\ncomputational cost. This hybrid approach merges deep learning performance with\nthe efficiency and plausibility of SNNs, yielding top results at lower\ncomputational cost. We hypothesize that the synergy between temporal-coding,\nspike-sparsity, and LZC-driven complexity analysis enables more-efficient\nfeature extraction. Our findings demonstrate that SNNs combined with LZC offer\npromising, biologically plausible alternative to conventional neural networks\nin medical diagnostics, particularly for resource-constrained or real-time\nsystems.","[cs.NE, eess.IV, q-bio.NC]"
7,2025-06-06 17:43:00+00:00,PyGemini: Unified Software Development towards Maritime Autonomy Systems,"Ensuring the safety and certifiability of autonomous surface vessels (ASVs)\nrequires robust decision-making systems, supported by extensive simulation,\ntesting, and validation across a broad range of scenarios. However, the current\nlandscape of maritime autonomy development is fragmented -- relying on\ndisparate tools for communication, simulation, monitoring, and system\nintegration -- which hampers interdisciplinary collaboration and inhibits the\ncreation of compelling assurance cases, demanded by insurers and regulatory\nbodies. Furthermore, these disjointed tools often suffer from performance\nbottlenecks, vendor lock-in, and limited support for continuous integration\nworkflows. To address these challenges, we introduce PyGemini, a permissively\nlicensed, Python-native framework that builds on the legacy of Autoferry Gemini\nto unify maritime autonomy development. PyGemini introduces a novel\nConfiguration-Driven Development (CDD) process that fuses Behavior-Driven\nDevelopment (BDD), data-oriented design, and containerization to support\nmodular, maintainable, and scalable software architectures. The framework\nfunctions as a stand-alone application, cloud-based service, or embedded\nlibrary -- ensuring flexibility across research and operational contexts. We\ndemonstrate its versatility through a suite of maritime tools -- including 3D\ncontent generation for simulation and monitoring, scenario generation for\nautonomy validation and training, and generative artificial intelligence\npipelines for augmenting imagery -- thereby offering a scalable, maintainable,\nand performance-oriented foundation for future maritime robotics and autonomy\nresearch.","[cs.RO, cs.SE, cs.SY, eess.SY, D.2.11; I.6.2; I.2.9]"
8,2025-06-06 17:40:12+00:00,Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens,"Offline reinforcement learning (RL) is crucial when online exploration is\ncostly or unsafe but often struggles with high epistemic uncertainty due to\nlimited data. Existing methods rely on fixed conservative policies, restricting\nadaptivity and generalization. To address this, we propose Reflect-then-Plan\n(RefPlan), a novel doubly Bayesian offline model-based (MB) planning approach.\nRefPlan unifies uncertainty modeling and MB planning by recasting planning as\nBayesian posterior estimation. At deployment, it updates a belief over\nenvironment dynamics using real-time observations, incorporating uncertainty\ninto MB planning via marginalization. Empirical results on standard benchmarks\nshow that RefPlan significantly improves the performance of conservative\noffline RL policies. In particular, RefPlan maintains robust performance under\nhigh epistemic uncertainty and limited data, while demonstrating resilience to\nchanging environment dynamics, improving the flexibility, generalizability, and\nrobustness of offline-learned policies.","[cs.AI, cs.LG]"
9,2025-06-06 17:39:32+00:00,An Optimized Franz-Parisi Criterion and its Equivalence with SQ Lower Bounds,"Bandeira et al. (2022) introduced the Franz-Parisi (FP) criterion for\ncharacterizing the computational hard phases in statistical detection problems.\nThe FP criterion, based on an annealed version of the celebrated Franz-Parisi\npotential from statistical physics, was shown to be equivalent to low-degree\npolynomial (LDP) lower bounds for Gaussian additive models, thereby connecting\ntwo distinct approaches to understanding the computational hardness in\nstatistical inference. In this paper, we propose a refined FP criterion that\naims to better capture the geometric ``overlap"" structure of statistical\nmodels. Our main result establishes that this optimized FP criterion is\nequivalent to Statistical Query (SQ) lower bounds -- another foundational\nframework in computational complexity of statistical inference. Crucially, this\nequivalence holds under a mild, verifiable assumption satisfied by a broad\nclass of statistical models, including Gaussian additive models, planted sparse\nmodels, as well as non-Gaussian component analysis (NGCA), single-index (SI)\nmodels, and convex truncation detection settings. For instance, in the case of\nconvex truncation tasks, the assumption is equivalent with the Gaussian\ncorrelation inequality (Royen, 2014) from convex geometry.\n In addition to the above, our equivalence not only unifies and simplifies the\nderivation of several known SQ lower bounds -- such as for the NGCA model\n(Diakonikolas et al., 2017) and the SI model (Damian et al., 2024) -- but also\nyields new SQ lower bounds of independent interest, including for the\ncomputational gaps in mixed sparse linear regression (Arpino et al., 2023) and\nconvex truncation (De et al., 2023).","[math.ST, cond-mat.stat-mech, cs.CC, stat.ML, stat.TH]"


In [37]:
# Example abstract from API ??
abstract = df['abstract'][0]

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarization ??
summarization_result = summarizer(abstract)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]


KeyboardInterrupt



In [None]:
summarization_result[0]['summary_text']