Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New submissions for Friday, 21 June 2024 (showing 794 of 794 entries ) #1153

Open
LeeKyungwook opened this issue Jun 24, 2024 · 0 comments
Open

Comments

@LeeKyungwook
Copy link
Owner

Keyword: detection

Title:

      The Significance of Latent Data Divergence in Predicting System Degradation
  • Authors: Miguel Fernandes, Catarina Silva, Alberto Cardoso, Bernardete Ribeiro
  • Subjects: Subjects:
    Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Condition-Based Maintenance is pivotal in enabling the early detection of potential failures in engineering systems, where precise prediction of the Remaining Useful Life is essential for effective maintenance and operation. However, a predominant focus in the field centers on predicting the Remaining Useful Life using unprocessed or minimally processed data, frequently neglecting the intricate dynamics inherent in the dataset. In this work we introduce a novel methodology grounded in the analysis of statistical similarity within latent data from system components. Leveraging a specifically designed architecture based on a Vector Quantized Variational Autoencoder, we create a sequence of discrete vectors which is used to estimate system-specific priors. We infer the similarity between systems by evaluating the divergence of these priors, offering a nuanced understanding of individual system behaviors. The efficacy of our approach is demonstrated through experiments on the NASA commercial modular aero-propulsion system simulation (C-MAPSS) dataset. Our validation not only underscores the potential of our method in advancing the study of latent statistical divergence but also demonstrates its superiority over existing techniques.

Title:

      GROD: Enhancing Generalization of Transformer with Out-of-Distribution Detection
  • Authors: Yijin Zhou, Yuguang Wang
  • Subjects: Subjects:
    Machine Learning (cs.LG); Probability (math.PR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Transformer networks excel in natural language processing (NLP) and computer vision (CV) tasks. However, they face challenges in generalizing to Out-of-Distribution (OOD) datasets, that is, data whose distribution differs from that seen during training. The OOD detection aims to distinguish data that deviates from the expected distribution, while maintaining optimal performance on in-distribution (ID) data. This paper introduces a novel approach based on OOD detection, termed the Generate Rounded OOD Data (GROD) algorithm, which significantly bolsters the generalization performance of transformer networks across various tasks. GROD is motivated by our new OOD detection Probably Approximately Correct (PAC) Theory for transformer. The transformer has learnability in terms of OOD detection that is, when the data is sufficient the outlier can be well represented. By penalizing the misclassification of OOD data within the loss function and generating synthetic outliers, GROD guarantees learnability and refines the decision boundaries between inlier and outlier. This strategy demonstrates robust adaptability and general applicability across different data types. Evaluated across diverse OOD detection tasks in NLP and CV, GROD achieves SOTA regardless of data format. On average, it reduces the SOTA FPR@95 from 21.97% to 0.12%, and improves AUROC from 93.62% to 99.98% on image classification tasks, and the SOTA FPR@95 by 12.89% and AUROC by 2.27% in detecting semantic text outliers. The code is available at https://anonymous.4open.science/r/GROD-OOD-Detection-with-transformers-B70F.

Title:

      Skin Cancer Images Classification using Transfer Learning Techniques
  • Authors: Md Sirajul Islam, Sanjeev Panta
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Skin cancer is one of the most common and deadliest types of cancer. Early diagnosis of skin cancer at a benign stage is critical to reducing cancer mortality. To detect skin cancer at an earlier stage an automated system is compulsory that can save the life of many patients. Many previous studies have addressed the problem of skin cancer diagnosis using various deep learning and transfer learning models. However, existing literature has limitations in its accuracy and time-consuming procedure. In this work, we applied five different pre-trained transfer learning approaches for binary classification of skin cancer detection at benign and malignant stages. To increase the accuracy of these models we fine-tune different layers and activation functions. We used a publicly available ISIC dataset to evaluate transfer learning approaches. For model stability, data augmentation techniques are applied to improve the randomness of the input dataset. These approaches are evaluated using different hyperparameters such as batch sizes, epochs, and optimizers. The experimental results show that the ResNet-50 model provides an accuracy of 0.935, F1-score of 0.86, and precision of 0.94.

Title:

      As Advertised? Understanding the Impact of Influencer VPN Ads
  • Authors: Omer Akgul, Richard Roberts, Emma Shroyer, Dave Levin, Michelle L. Mazurek
  • Subjects: Subjects:
    Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Influencer VPN ads (sponsored segments) on YouTube often disseminate misleading information about both VPNs, and security & privacy more broadly. However, it remains unclear how (or whether) these ads affect users' perceptions and knowledge about VPNs. In this work, we explore the relationship between YouTube VPN ad exposure and users' mental models of VPNs, security, and privacy. We use a novel VPN ad detection model to calculate the ad exposure of 217 participants via their YouTube watch histories, and we develop scales to characterize their mental models in relation to claims commonly made in VPN ads. Through (pre-registered) regression-based analysis, we find that exposure to VPN ads is significantly correlated with familiarity with VPN brands and increased belief in (hyperbolic) threats. While not specific to VPNs, these threats are often discussed in VPN ads. In contrast, although many participants agree with both factual and misleading mental models of VPNs that often appear in ads, we find no significant correlation between exposure to VPN ads and these mental models. These findings suggest that, if VPN ads do impact mental models, then it is predominantly emotional (i.e., threat perceptions) rather than technical.

Title:

      A machine learning pipeline for automated insect monitoring
  • Authors: Aditya Jain, Fagner Cunha, Michael Bunsen, Léonard Pasi, Anna Viklund, Maxim Larrivée, David Rolnick
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Climate change and other anthropogenic factors have led to a catastrophic decline in insects, endangering both biodiversity and the ecosystem services on which human society depends. Data on insect abundance, however, remains woefully inadequate. Camera traps, conventionally used for monitoring terrestrial vertebrates, are now being modified for insects, especially moths. We describe a complete, open-source machine learning-based software pipeline for automated monitoring of moths via camera traps, including object detection, moth/non-moth classification, fine-grained identification of moth species, and tracking individuals. We believe that our tools, which are already in use across three continents, represent the future of massively scalable data collection in entomology.

Title:

      Real-time Yemeni Currency Detection
  • Authors: Edrees AL-Edreesi, Ghaleb Al-Gaphari
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Banknote recognition is a major problem faced by visually Challenged people. So we propose a application to help the visually Challenged people to identify the different types of Yemenian currencies through deep learning technique. As money has a significant role in daily life for any business transactions, real-time detection and recognition of banknotes become necessary for a person, especially blind or visually impaired, or for a system that sorts the data. This paper presents a real-time Yemeni currency detection system for visually impaired persons. The proposed system exploits the deep learning approach to facilitate the visually impaired people to prosperously recognize banknotes. For real-time recognition, we have deployed the system into a mobile application.

Title:

      NoiSec: Harnessing Noise for Security against Adversarial and Backdoor Attacks
  • Authors: Md Hasan Shahriar, Ning Wang, Y. Thomas Hou, Wenjing Lou
  • Subjects: Subjects:
    Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    The exponential adoption of machine learning (ML) is propelling the world into a future of intelligent automation and data-driven solutions. However, the proliferation of malicious data manipulation attacks against ML, namely adversarial and backdoor attacks, jeopardizes its reliability in safety-critical applications. The existing detection methods against such attacks are built upon assumptions, limiting them in diverse practical scenarios. Thus, motivated by the need for a more robust and unified defense mechanism, we investigate the shared traits of adversarial and backdoor attacks and propose NoiSec that leverages solely the noise, the foundational root cause of such attacks, to detect any malicious data alterations. NoiSec is a reconstruction-based detector that disentangles the noise from the test input, extracts the underlying features from the noise, and leverages them to recognize systematic malicious manipulation. Experimental evaluations conducted on the CIFAR10 dataset demonstrate the efficacy of NoiSec, achieving AUROC scores exceeding 0.954 and 0.852 under white-box and black-box adversarial attacks, respectively, and 0.992 against backdoor attacks. Notably, NoiSec maintains a high detection performance, keeping the false positive rate within only 1%. Comparative analyses against MagNet-based baselines reveal NoiSec's superior performance across various attack scenarios.

Title:

      A transformer boosted UNet for smoke segmentation in complex backgrounds in multispectral LandSat imagery
  • Authors: Jixue Liu, Jiuyong Li, Stefan Peters, Liang Zhao
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Many studies have been done to detect smokes from satellite imagery. However, these prior methods are not still effective in detecting various smokes in complex backgrounds. Smokes present challenges in detection due to variations in density, color, lighting, and backgrounds such as clouds, haze, and/or mist, as well as the contextual nature of thin smoke. This paper addresses these challenges by proposing a new segmentation model called VTrUNet which consists of a virtual band construction module to capture spectral patterns and a transformer boosted UNet to capture long range contextual features. The model takes imagery of six bands: red, green, blue, near infrared, and two shortwave infrared bands as input. To show the advantages of the proposed model, the paper presents extensive results for various possible model architectures improving UNet and draws interesting conclusions including that adding more modules to a model does not always lead to a better performance. The paper also compares the proposed model with very recently proposed and related models for smoke segmentation and shows that the proposed model performs the best and makes significant improvements on prediction performances

Title:

      PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model
  • Authors: Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang
  • Subjects: Subjects:
    Computation and Language (cs.CL); Machine Learning (cs.LG); Genomics (q-bio.GN)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.

Title:

      Utility Pole Fire Risk Inspection from 2D Street-Side Images
  • Authors: Rajanie Prabha, Kopal Nihar
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    In recent years, California's electrical grid has confronted mounting challenges stemming from aging infrastructure and a landscape increasingly susceptible to wildfires. This paper presents a comprehensive framework utilizing computer vision techniques to address wildfire risk within the state's electrical grid, with a particular focus on vulnerable utility poles. These poles are susceptible to fire outbreaks or structural failure during extreme weather events. The proposed pipeline harnesses readily available Google Street View imagery to identify utility poles and assess their proximity to surrounding vegetation, as well as to determine any inclination angles. The early detection of potential risks associated with utility poles is pivotal for forestalling wildfire ignitions and informing strategic investments, such as undergrounding vulnerable poles and powerlines. Moreover, this study underscores the significance of data-driven decision-making in bolstering grid resilience, particularly concerning Public Safety Power Shutoffs. By fostering collaboration among utilities, policymakers, and researchers, this pipeline aims to solidify the electric grid's resilience and safeguard communities against the escalating threat of wildfires.

Title:

      Enhancing supply chain security with automated machine learning
  • Authors: Haibo Wang, Lutfu S.Sua, Bahram Alidaee
  • Subjects: Subjects:
    Machine Learning (cs.LG); General Economics (econ.GN); Optimization and Control (math.OC)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    This study tackles the complexities of global supply chains, which are increasingly vulnerable to disruptions caused by port congestion, material shortages, and inflation. To address these challenges, we explore the application of machine learning methods, which excel in predicting and optimizing solutions based on large datasets. Our focus is on enhancing supply chain security through fraud detection, maintenance prediction, and material backorder forecasting. We introduce an automated machine learning framework that streamlines data analysis, model construction, and hyperparameter optimization for these tasks. By automating these processes, our framework improves the efficiency and effectiveness of supply chain security measures. Our research identifies key factors that influence machine learning performance, including sampling methods, categorical encoding, feature selection, and hyperparameter optimization. We demonstrate the importance of considering these factors when applying machine learning to supply chain challenges. Traditional mathematical programming models often struggle to cope with the complexity of large-scale supply chain problems. Our study shows that machine learning methods can provide a viable alternative, particularly when dealing with extensive datasets and complex patterns. The automated machine learning framework presented in this study offers a novel approach to supply chain security, contributing to the existing body of knowledge in the field. Its comprehensive automation of machine learning processes makes it a valuable contribution to the domain of supply chain management.

Title:

      Transferable Watermarking to Self-supervised Pre-trained Graph Encoders by Trigger Embeddings
  • Authors: Xiangyu Zhao, Hanzhou Wu, Xinpeng Zhang
  • Subjects: Subjects:
    Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Recent years have witnessed the prosperous development of Graph Self-supervised Learning (GSSL), which enables to pre-train transferable foundation graph encoders. However, the easy-to-plug-in nature of such encoders makes them vulnerable to copyright infringement. To address this issue, we develop a novel watermarking framework to protect graph encoders in GSSL settings. The key idea is to force the encoder to map a set of specially crafted trigger instances into a unique compact cluster in the outputted embedding space during model pre-training. Consequently, when the encoder is stolen and concatenated with any downstream classifiers, the resulting model inherits the backdoor of the encoder and predicts the trigger instances to be in a single category with high probability regardless of the ground truth. Experimental results have shown that, the embedded watermark can be transferred to various downstream tasks in black-box settings, including node classification, link prediction and community detection, which forms a reliable watermark verification system for GSSL in reality. This approach also shows satisfactory performance in terms of model fidelity, reliability and robustness.

Title:

      A Federated Learning Approach for Multi-stage Threat Analysis in Advanced Persistent Threat Campaigns
  • Authors: Florian Nelles, Abbas Yazdinejad, Ali Dehghantanha, Reza M. Parizi, Gautam Srivastava
  • Subjects: Subjects:
    Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Multi-stage threats like advanced persistent threats (APT) pose severe risks by stealing data and destroying infrastructure, with detection being challenging. APTs use novel attack vectors and evade signature-based detection by obfuscating their network presence, often going unnoticed due to their novelty. Although machine learning models offer high accuracy, they still struggle to identify true APT behavior, overwhelming analysts with excessive data. Effective detection requires training on multiple datasets from various clients, which introduces privacy issues under regulations like GDPR. To address these challenges, this paper proposes a novel 3-phase unsupervised federated learning (FL) framework to detect APTs. It identifies unique log event types, extracts suspicious patterns from related log events, and orders them by complexity and frequency. The framework ensures privacy through a federated approach and enhances security using Paillier's partial homomorphic encryption. Tested on the SoTM 34 dataset, our framework compares favorably against traditional methods, demonstrating efficient pattern extraction and analysis from log files, reducing analyst workload, and maintaining stringent data privacy. This approach addresses significant gaps in current methodologies, offering a robust solution to APT detection in compliance with privacy laws.

Title:

      Data Contamination Can Cross Language Barriers
  • Authors: Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang
  • Subjects: Subjects:
    Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{this https URL}.

Title:

      Media Forensics and Deepfake Systematic Survey
  • Authors: Nadeem Jabbar CH, Aqib Saghir, Ayaz Ahmad Meer, Salman Ahmad Sahi, Bilal Hassan, Siddiqui Muhammad Yasir
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Deepfake is a generative deep learning algorithm that creates or changes facial features in a very realistic way making it hard to differentiate the real from the fake features It can be used to make movies look better as well as to spread false information by imitating famous people In this paper many different ways to make a Deepfake are explained analyzed and separated categorically Using Deepfake datasets models are trained and tested for reliability through experiments Deepfakes are a type of facial manipulation that allow people to change their entire faces identities attributes and expressions The trends in the available Deepfake datasets are also discussed with a focus on how they have changed Using Deep learning a general Deepfake detection model is made Moreover the problems in making and detecting Deepfakes are also mentioned As a result of this survey it is expected that the development of new Deepfake based imaging tools will speed up in the future This survey gives indepth review of methods for manipulating images of face and various techniques to spot altered face images Four types of facial manipulation are specifically discussed which are attribute manipulation expression swap entire face synthesis and identity swap Across every manipulation category we yield information on manipulation techniques significant benchmarks for technical evaluation of counterfeit detection techniques available public databases and a summary of the outcomes of all such analyses From all of the topics in the survey we focus on the most recent development of Deepfake showing its advances and obstacles in detecting fake images

Title:

      M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and Atmosphere
  • Authors: Mengqiu Xu, Ming Wu, Kaixin Chen, Yixiang Huang, Mingrui Xu, Yujia Yang, Yiqing Feng, Yiying Guo, Bin Huang, Dongliang Chang, Zhenwei Shi, Chuang Zhang, Zhanyu Ma, Jun Guo
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Marine fog poses a significant hazard to global shipping, necessitating effective detection and forecasting to reduce economic losses. In recent years, several machine learning (ML) methods have demonstrated superior detection accuracy compared to traditional meteorological methods. However, most of these works are developed on proprietary datasets, and the few publicly accessible datasets are often limited to simplistic toy scenarios for research purposes. To advance the field, we have collected nearly a decade's worth of multi-modal data related to continuous marine fog stages from four series of geostationary meteorological satellites, along with meteorological observations and numerical analysis, covering 15 marine regions globally where maritime fog frequently occurs. Through pixel-level manual annotation by meteorological experts, we present the most comprehensive marine fog detection and forecasting dataset to date, named M4Fog, to bridge ocean and atmosphere. The dataset comprises 68,000 "super data cubes" along four dimensions: elements, latitude, longitude and time, with a temporal resolution of half an hour and a spatial resolution of 1 kilometer. Considering practical applications, we have defined and explored three meaningful tracks with multi-metric evaluation systems: static or dynamic marine fog detection, and spatio-temporal forecasting for cloud images. Extensive benchmarking and experiments demonstrate the rationality and effectiveness of the construction concept for proposed M4Fog. The data and codes are available to whole researchers through cloud platforms to develop ML-driven marine fog solutions and mitigate adverse impacts on human activities.

Title:

      PPT-GNN: A Practical Pre-Trained Spatio-Temporal Graph Neural Network for Network Security
  • Authors: Louis Van Langendonck, Ismael Castell-Uroz, Pere Barlet-Ros
  • Subjects: Subjects:
    Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Recent works have demonstrated the potential of Graph Neural Networks (GNN) for network intrusion detection. Despite their advantages, a significant gap persists between real-world scenarios, where detection speed is critical, and existing proposals, which operate on large graphs representing several hours of traffic. This gap results in unrealistic operational conditions and impractical detection delays. Moreover, existing models do not generalize well across different networks, hampering their deployment in production environments. To address these issues, we introduce PPTGNN, a practical spatio-temporal GNN for intrusion detection. PPTGNN enables near real-time predictions, while better capturing the spatio-temporal dynamics of network attacks. PPTGNN employs self-supervised pre-training for improved performance and reduced dependency on labeled data. We evaluate PPTGNN on three public datasets and show that it significantly outperforms state-of-the-art models, such as E-ResGAT and E-GraphSAGE, with an average accuracy improvement of 10.38%. Finally, we show that a pre-trained PPTGNN can easily be fine-tuned to unseen networks with minimal labeled examples. This highlights the potential of PPTGNN as a general, large-scale pre-trained model that can effectively operate in diverse network environments.

Title:

      Effective Edge-wise Representation Learning in Edge-Attributed Bipartite Graphs
  • Authors: Hewen Wang, Renchi Yang, Xiaokui Xiao
  • Subjects: Subjects:
    Machine Learning (cs.LG); Social and Information Networks (cs.SI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Graph representation learning (GRL) is to encode graph elements into informative vector representations, which can be used in downstream tasks for analyzing graph-structured data and has seen extensive applications in various domains. However, the majority of extant studies on GRL are geared towards generating node representations, which cannot be readily employed to perform edge-based analytics tasks in edge-attributed bipartite graphs (EABGs) that pervade the real world, e.g., spam review detection in customer-product reviews and identifying fraudulent transactions in user-merchant networks. Compared to node-wise GRL, learning edge representations (ERL) on such graphs is challenging due to the need to incorporate the structure and attribute semantics from the perspective of edges while considering the separate influence of two heterogeneous node sets U and V in bipartite graphs. To our knowledge, despite its importance, limited research has been devoted to this frontier, and existing workarounds all suffer from sub-par results. Motivated by this, this paper designs EAGLE, an effective ERL method for EABGs. Building on an in-depth and rigorous theoretical analysis, we propose the factorized feature propagation (FFP) scheme for edge representations with adequate incorporation of long-range dependencies of edges/features without incurring tremendous computation overheads. We further ameliorate FFP as a dual-view FFP by taking into account the influences from nodes in U and V severally in ERL. Extensive experiments on 5 real datasets showcase the effectiveness of the proposed EAGLE models in semi-supervised edge classification tasks. In particular, EAGLE can attain a considerable gain of at most 38.11% in AP and 1.86% in AUC when compared to the best baselines.

Title:

      Strengthening Layer Interaction via Dynamic Layer Attention
  • Authors: Kaishen Wang, Xun Xia, Jian Liu, Zhang Yi, Tao He
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    In recent years, employing layer attention to enhance interaction among hierarchical layers has proven to be a significant advancement in building network structures. In this paper, we delve into the distinction between layer attention and the general attention mechanism, noting that existing layer attention methods achieve layer interaction on fixed feature maps in a static manner. These static layer attention methods limit the ability for context feature extraction among layers. To restore the dynamic context representation capability of the attention mechanism, we propose a Dynamic Layer Attention (DLA) architecture. The DLA comprises dual paths, where the forward path utilizes an improved recurrent neural network block, named Dynamic Sharing Unit (DSU), for context feature extraction. The backward path updates features using these shared context representations. Finally, the attention mechanism is applied to these dynamically refreshed feature maps among layers. Experimental results demonstrate the effectiveness of the proposed DLA architecture, outperforming other state-of-the-art methods in image recognition and object detection tasks. Additionally, the DSU block has been evaluated as an efficient plugin in the proposed DLA architecture.The code is available at this https URL.

Title:

      Towards a multimodal framework for remote sensing image change retrieval and captioning
  • Authors: Roger Ferrod, Luigi Di Caro, Dino Ienco
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Recently, there has been increasing interest in multimodal applications that integrate text with other modalities, such as images, audio and video, to facilitate natural language interactions with multimodal AI systems. While applications involving standard modalities have been extensively explored, there is still a lack of investigation into specific data modalities such as remote sensing (RS) data. Despite the numerous potential applications of RS data, including environmental protection, disaster monitoring and land planning, available solutions are predominantly focused on specific tasks like classification, captioning and retrieval. These solutions often overlook the unique characteristics of RS data, such as its capability to systematically provide information on the same geographical areas over time. This ability enables continuous monitoring of changes in the underlying landscape. To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, in the context of change detection analysis, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities, in the context of bi-temporal change detection, while maintaining captioning performances that are comparable to the state of the art. We release the source code and pretrained weights at: this https URL.

Title:

      Lost in UNet: Improving Infrared Small Target Detection by Underappreciated Local Features
  • Authors: Wuzhou Quan, Wei Zhao, Weiming Wang, Haoran Xie, Fu Lee Wang, Mingqiang Wei
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Many targets are often very small in infrared images due to the long-distance imaging meachnism. UNet and its variants, as popular detection backbone networks, downsample the local features early and cause the irreversible loss of these local features, leading to both the missed and false detection of small targets in infrared images. We propose HintU, a novel network to recover the local features lost by various UNet-based methods for effective infrared small target detection. HintU has two key contributions. First, it introduces the "Hint" mechanism for the first time, i.e., leveraging the prior knowledge of target locations to highlight critical local features. Second, it improves the mainstream UNet-based architecture to preserve target pixels even after downsampling. HintU can shift the focus of various networks (e.g., vanilla UNet, UNet++, UIUNet, MiM+, and HCFNet) from the irrelevant background pixels to a more restricted area from the beginning. Experimental results on three datasets NUDT-SIRST, SIRSTv2 and IRSTD1K demonstrate that HintU enhances the performance of existing methods with only an additional 1.88 ms cost (on RTX Titan). Additionally, the explicit constraints of HintU enhance the generalization ability of UNet-based methods. Code is available at this https URL.

Title:

      Snowy Scenes,Clear Detections: A Robust Model for Traffic Light Detection in Adverse Weather Conditions
  • Authors: Shivank Garg, Abhishek Baghel, Amit Agarwal, Durga Toshniwal
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    With the rise of autonomous vehicles and advanced driver-assistance systems (ADAS), ensuring reliable object detection in all weather conditions is crucial for safety and efficiency. Adverse weather like snow, rain, and fog presents major challenges for current detection systems, often resulting in failures and potential safety risks. This paper introduces a novel framework and pipeline designed to improve object detection under such conditions, focusing on traffic signal detection where traditional methods often fail due to domain shifts caused by adverse weather. We provide a comprehensive analysis of the limitations of existing techniques. Our proposed pipeline significantly enhances detection accuracy in snow, rain, and fog. Results show a 40.8% improvement in average IoU and F1 scores compared to naive fine-tuning and a 22.4% performance increase in domain shift scenarios, such as training on artificial snow and testing on rain images.

Title:

      DF40: Toward Next-Generation Deepfake Detection
  • Authors: Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Li Yuan, Chengjie Wang, Shouhong Ding, Yunsheng Wu
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset (e.g., FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass" for navigating SoTA detectors. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? If not, what underlying factors contribute to this gap? In this work, we found the dataset (both train and test) can be the "primary culprit" due to: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery (face-swapping and face-reenactment) and entire image synthesis (AIGC). Most existing datasets only contain partial types, with limited forgery methods implemented; (2) forgery realism: The dominant training dataset, FF++, contains old forgery techniques from the past five years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection of nowadays' SoTA deepfakes; (3) evaluation protocol: Most detection works perform evaluations on one type, e.g., train and test on face-swapping only, which hinders the development of universal deepfake detectors. To address this dilemma, we construct a highly diverse and large-scale deepfake dataset called DF40, which comprises 40 distinct deepfake techniques. We then conduct comprehensive evaluations using 4 standard evaluation protocols and 7 representative detectors, resulting in over 2,000 evaluations. Through these evaluations, we analyze from various perspectives, leading to 12 new insightful findings contributing to the field. We also open up 5 valuable yet previously underexplored research questions to inspire future works.

Title:

      Semantic Enhanced Few-shot Object Detection
  • Authors: Zheng Wang, Yingjie Gao, Qingjie Liu, Yunhong Wang
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Few-shot object detection~(FSOD), which aims to detect novel objects with limited annotated instances, has made significant progress in recent years. However, existing methods still suffer from biased representations, especially for novel classes in extremely low-shot scenarios. During fine-tuning, a novel class may exploit knowledge from similar base classes to construct its own feature distribution, leading to classification confusion and performance degradation. To address these challenges, we propose a fine-tuning based FSOD framework that utilizes semantic embeddings for better detection. In our proposed method, we align the visual features with class name embeddings and replace the linear classifier with our semantic similarity classifier. Our method trains each region proposal to converge to the corresponding class embedding. Furthermore, we introduce a multimodal feature fusion to augment the vision-language communication, enabling a novel class to draw support explicitly from well-trained similar base classes. To prevent class confusion, we propose a semantic-aware max-margin loss, which adaptively applies a margin beyond similar classes. As a result, our method allows each novel class to construct a compact feature space without being confused with similar base classes. Extensive experiments on Pascal VOC and MS COCO demonstrate the superiority of our method.

Title:

      ModSec-Learn: Boosting ModSecurity with Machine Learning
  • Authors: Christian Scano, Giuseppe Floris, Biagio Montaruli, Luca Demetrio, Andrea Valenza, Luca Compagna, Davide Ariu, Luca Piras, Davide Balzarotti, Battista Biggio
  • Subjects: Subjects:
    Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    ModSecurity is widely recognized as the standard open-source Web Application Firewall (WAF), maintained by the OWASP Foundation. It detects malicious requests by matching them against the Core Rule Set (CRS), identifying well-known attack patterns. Each rule is manually assigned a weight based on the severity of the corresponding attack, and a request is blocked if the sum of the weights of matched rules exceeds a given threshold. However, we argue that this strategy is largely ineffective against web attacks, as detection is only based on heuristics and not customized on the application to protect. In this work, we overcome this issue by proposing a machine-learning model that uses the CRS rules as input features. Through training, ModSec-Learn is able to tune the contribution of each CRS rule to predictions, thus adapting the severity level to the web applications to protect. Our experiments show that ModSec-Learn achieves a significantly better trade-off between detection and false positive rates. Finally, we analyze how sparse regularization can reduce the number of rules that are relevant at inference time, by discarding more than 30% of the CRS rules. We release our open-source code and the dataset at this https URL and this https URL, respectively.

Title:

      satsuma: Structure-based Symmetry Breaking in SAT
  • Authors: Markus Anders, Sofia Brenner, Gaurav Rattan
  • Subjects: Subjects:
    Data Structures and Algorithms (cs.DS)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Symmetry reduction is crucial for solving many interesting SAT instances in practice. Numerous approaches have been proposed, which try to strike a balance between symmetry reduction and computational overhead. Arguably the most readily applicable method is the computation of static symmetry breaking constraints: a constraint restricting the search-space to non-symmetrical solutions is added to a given SAT instance. A distinct advantage of static symmetry breaking is that the SAT solver itself is not modified. A disadvantage is that the strength of symmetry reduction is usually limited. In order to boost symmetry reduction, the state-of-the-art tool BreakID [Devriendt et. al] pioneered the identification and tailored breaking of a particular substructure of symmetries, the so-called row interchangeability groups. In this paper, we propose a new symmetry breaking tool called satsuma. The core principle of our tool is to exploit more diverse but frequently occurring symmetry structures. This is enabled by new practical detection algorithms for row interchangeability, row-column symmetry, Johnson symmetry, and various combinations. Based on the resulting structural description, we then produce symmetry breaking constraints. We compare this new approach to BreakID on a range of instance families exhibiting symmetry. Our benchmarks suggest improved symmetry reduction in the presence of Johnson symmetry and comparable performance in the presence of row-column symmetry. Moreover, our implementation runs significantly faster, even though it identifies more diverse structures.

Title:

      Automated Bioacoustic Monitoring for South African Bird Species on Unlabeled Data
  • Authors: Michael Doell, Dominik Kuehn, Vanessa Suessle, Matthew J. Burnett, Colleen T. Downs, Andreas Weinmann, Elke Hergenroether
  • Subjects: Subjects:
    Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Analyses for biodiversity monitoring based on passive acoustic monitoring (PAM) recordings is time-consuming and challenged by the presence of background noise in recordings. Existing models for sound event detection (SED) worked only on certain avian species and the development of further models required labeled data. The developed framework automatically extracted labeled data from available platforms for selected avian species. The labeled data were embedded into recordings, including environmental sounds and noise, and were used to train convolutional recurrent neural network (CRNN) models. The models were evaluated on unprocessed real world data recorded in urban KwaZulu-Natal habitats. The Adapted SED-CRNN model reached a F1 score of 0.73, demonstrating its efficiency under noisy, real-world conditions. The proposed approach to automatically extract labeled data for chosen avian species enables an easy adaption of PAM to other species and habitats for future conservation projects.

Title:

      DDLNet: Boosting Remote Sensing Change Detection with Dual-Domain Learning
  • Authors: Xiaowen Ma, Jiawei Yang, Rui Che, Huanting Zhang, Wei Zhang
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Remote sensing change detection (RSCD) aims to identify the changes of interest in a region by analyzing multi-temporal remote sensing images, and has an outstanding value for local development monitoring. Existing RSCD methods are devoted to contextual modeling in the spatial domain to enhance the changes of interest. Despite the satisfactory performance achieved, the lack of knowledge in the frequency domain limits the further improvement of model performance. In this paper, we propose DDLNet, a RSCD network based on dual-domain learning (i.e., frequency and spatial domains). In particular, we design a Frequency-domain Enhancement Module (FEM) to capture frequency components from the input bi-temporal images using Discrete Cosine Transform (DCT) and thus enhance the changes of interest. Besides, we devise a Spatial-domain Recovery Module (SRM) to fuse spatiotemporal features for reconstructing spatial details of change representations. Extensive experiments on three benchmark RSCD datasets demonstrate that the proposed method achieves state-of-the-art performance and reaches a more satisfactory accuracy-efficiency trade-off. Our code is publicly available at this https URL.

Title:

      Concept Drift Visualization of SVM with Shifting Window
  • Authors: Honorius Galmeanu, Razvan Andonie
  • Subjects: Subjects:
    Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    In machine learning, concept drift is an evolution of information that invalidates the current data model. It happens when the statistical properties of the input data change over time in unforeseen ways. Concept drift detection is crucial when dealing with dynamically changing data. Its visualization can bring valuable insight into the data dynamics, especially for multidimensional data, and is related to visual knowledge discovery. We propose a novel visualization model based on parallel coordinates, denoted as parallel histograms through time. Our model represents histograms of feature distributions for successive time-shifted windows. The drift is shown as variations of these histograms, obtained by connecting the means of the distribution for successive time windows. We show how these diagrams can be used to explain the decision made by the machine learning model in choosing the drift point. By isolating the drift at the edges of successive time windows, there will be none (or reduced) drift within the adjacent windows. We illustrate this concept on both synthetic and real datasets. In our experiments, we use an incremental/decremental SVM with shifting window, introduced by us in previous work. With our proposed technique, in addition to detect the presence of concept drift, we can also depict it. This information can be further used to explain the change. mental results, opening the possibility for further investigations.

Title:

      Benchmarking Unsupervised Online IDS for Masquerade Attacks in CAN
  • Authors: Pablo Moriano, Steven C. Hespeler, Mingyan Li, Robert A. Bridges
  • Subjects: Subjects:
    Cryptography and Security (cs.CR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Vehicular controller area networks (CANs) are susceptible to masquerade attacks by malicious adversaries. In masquerade attacks, adversaries silence a targeted ID and then send malicious frames with forged content at the expected timing of benign frames. As masquerade attacks could seriously harm vehicle functionality and are the stealthiest attacks to detect in CAN, recent work has devoted attention to compare frameworks for detecting masquerade attacks in CAN. However, most existing works report offline evaluations using CAN logs already collected using simulations that do not comply with domain's real-time constraints. Here we contribute to advance the state of the art by introducing a benchmark study of four different non-deep learning (DL)-based unsupervised online intrusion detection systems (IDS) for masquerade attacks in CAN. Our approach differs from existing benchmarks in that we analyze the effect of controlling streaming data conditions in a sliding window setting. In doing so, we use realistic masquerade attacks being replayed from the ROAD dataset. We show that although benchmarked IDS are not effective at detecting every attack type, the method that relies on detecting changes at the hierarchical structure of clusters of time series produces the best results at the expense of higher computational overhead. We discuss limitations, open challenges, and how the benchmarked methods can be used for practical unsupervised online CAN IDS for masquerade attacks.

Title:

      A Graph Model and a Layout Algorithm for Knitting Patterns
  • Authors: Kathryn Gray, Brian Bell, Stephen Kobourov
  • Subjects: Subjects:
    Human-Computer Interaction (cs.HC)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Knitting, an ancient fiber art, creates a structured fabric consisting of loops or stitches. Publishing hand knitting patterns involves lengthy testing periods and numerous knitters. Modeling knitting patterns with graphs can help expedite error detection and pattern validation. In this paper, we describe how to model simple knitting patterns as planar graphs. We then design, implement, and evaluate a layout algorithm to visualize knitting patterns. Knitting patterns correspond to graphs with pre-specified edge lengths (e.g., uniform lengths, two lengths, etc.). This yields a natural graph layout optimization problem: realize a planar graph with pre-specified edge lengths, while ensuring there are no edge crossings. We quantitatively evaluate our algorithm using real knitting patterns of various sizes against three others; one created for knitting patterns, one that maintains planarity and optimizes edge lengths, and a popular force-directed algorithm.

Title:

      Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control
  • Authors: Alexander Blatt, Aravind Krishnan, Dietrich Klakow
  • Subjects: Subjects:
    Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Utilizing air-traffic control (ATC) data for downstream natural-language processing tasks requires preprocessing steps. Key steps are the transcription of the data via automatic speech recognition (ASR) and speaker diarization, respectively speaker role detection (SRD) to divide the transcripts into pilot and air-traffic controller (ATCO) transcripts. While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. Our study shows in which cases our joint system can outperform the two traditional approaches and in which cases the other architectures are preferable. We additionally evaluate how acoustic and lexical differences influence all architectures and show how to overcome them for our joint architecture.

Title:

      DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection
  • Authors: Zhuoxiao Chen, Zixin Wang, Sen Wang, Zi Huang, Yadan Luo
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    LiDAR-based 3D object detection has seen impressive advances in recent times. However, deploying trained 3D detectors in the real world often yields unsatisfactory performance when the distribution of the test data significantly deviates from the training data due to different weather conditions, object sizes, \textit{etc}. A key factor in this performance degradation is the diminished generalizability of pre-trained models, which creates a sharp loss landscape during training. Such sharpness, when encountered during testing, can precipitate significant performance declines, even with minor data variations. To address the aforementioned challenges, we propose \textbf{dual-perturbation optimization (DPO)} for \textbf{\underline{T}est-\underline{t}ime \underline{A}daptation in \underline{3}D \underline{O}bject \underline{D}etection (TTA-3OD)}. We minimize the sharpness to cultivate a flat loss landscape to ensure model resiliency to minor data variations, thereby enhancing the generalization of the adaptation process. To fully capture the inherent variability of the test point clouds, we further introduce adversarial perturbation to the input BEV features to better simulate the noisy test environment. As the dual perturbation strategy relies on trustworthy supervision signals, we utilize a reliable Hungarian matcher to filter out pseudo-labels sensitive to perturbations. Additionally, we introduce early Hungarian cutoff to avoid error accumulation from incorrect pseudo-labels by halting the adaptation process. Extensive experiments across three types of transfer tasks demonstrate that the proposed DPO significantly surpasses previous state-of-the-art approaches, specifically on Waymo $\rightarrow$ KITTI, outperforming the most competitive baseline by 57.72% in $\text{AP}_\text{3D}$ and reaching 91% of the fully supervised upper bound.

Title:

      Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events
  • Authors: Mohammad Abu Tami, Huthaifa I. Ashqar, Mohammed Elhenawy
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Traditional approaches to safety event analysis in autonomous systems have relied on complex machine learning models and extensive datasets for high accuracy and reliability. However, the advent of Multimodal Large Language Models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities, thereby providing automated analyses of driving videos. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts to ensure accurate, reliable, and actionable insights for hazard detection. By incorporating models like Gemini-Pro-Vision 1.5 and Llava, our methodology aims to automate the safety critical events and mitigate common issues such as hallucinations in MLLM outputs. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis, though further validation on larger datasets is necessary. Furthermore, more investigations are required to explore the performance enhancements of the proposed framework through few-shot learning and fine-tuned models. This research underscores the significance of MLLMs in advancing the analysis of the naturalistic driving videos by improving safety-critical event detecting and understanding the interaction with complex environments.

Title:

      The Use of Multimodal Large Language Models to Detect Objects from Thermal Images: Transportation Applications
  • Authors: Huthaifa I. Ashqar, Taqwa I. Alhadidi, Mohammed Elhenawy, Nour O. Khanfar
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    The integration of thermal imaging data with Multimodal Large Language Models (MLLMs) constitutes an exciting opportunity for improving the safety and functionality of autonomous driving systems and many Intelligent Transportation Systems (ITS) applications. This study investigates whether MLLMs can understand complex images from RGB and thermal cameras and detect objects directly. Our goals were to 1) assess the ability of the MLLM to learn from information from various sets, 2) detect objects and identify elements in thermal cameras, 3) determine whether two independent modality images show the same scene, and 4) learn all objects using different modalities. The findings showed that both GPT-4 and Gemini were effective in detecting and classifying objects in thermal images. Similarly, the Mean Absolute Percentage Error (MAPE) for pedestrian classification was 70.39% and 81.48%, respectively. Moreover, the MAPE for bike, car, and motorcycle detection were 78.4%, 55.81%, and 96.15%, respectively. Gemini produced MAPE of 66.53%, 59.35% and 78.18% respectively. This finding further demonstrates that MLLM can identify thermal images and can be employed in advanced imaging automation technologies for ITS applications.

Title:

      A-OctoMap: An Adaptive OctoMap for Online Motion Planning
  • Authors: Yihui Mao, Shuo Liu
  • Subjects: Subjects:
    Robotics (cs.RO); Graphics (cs.GR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Traditional robotic motion planning methods often struggle with fixed resolutions in dynamically changing environments. To address these challenges, we introduce the A-OctoMap, an adaptive Octo-Tree structure that enhances spatial representation and facilitates real-time, efficient motion planning. This novel framework allows for dynamic space partitioning and multi-resolution queries, significantly improving computational efficiency and precision. Key innovations include a tree-based data structure for enhanced geometric processing, real-time map updating for accurate trajectory planning, and efficient collision detection. Our extensive testing shows superior navigation safety and efficiency in complex settings compared to conventional methods. A-OctoMap sets a new standard for adaptive spatial mapping in autonomous systems, promising significant advancements in navigating unpredictable environments.

Title:

      EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations
  • Authors: Jie Ren, Yingqian Cui, Chen Chen, Vikash Sehwag, Yue Xing, Jiliang Tang, Lingjuan Lyu
  • Subjects: Subjects:
    Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Generative models, especially text-to-image diffusion models, have significantly advanced in their ability to generate images, benefiting from enhanced architectures, increased computational power, and large-scale datasets. While the datasets play an important role, their protection has remained as an unsolved issue. Current protection strategies, such as watermarks and membership inference, are either in high poison rate which is detrimental to image quality or suffer from low accuracy and robustness. In this work, we introduce a novel approach, EnTruth, which Enhances Traceability of unauthorized dataset usage utilizing template memorization. By strategically incorporating the template memorization, EnTruth can trigger the specific behavior in unauthorized models as the evidence of infringement. Our method is the first to investigate the positive application of memorization and use it for copyright protection, which turns a curse into a blessing and offers a pioneering perspective for unauthorized usage detection in generative models. Comprehensive experiments are provided to demonstrate its effectiveness in terms of data-alteration rate, accuracy, robustness and generation quality.

Title:

      Towards the in-situ Trunk Identification and Length Measurement of Sea Cucumbers via B\'{e}zier Curve Modelling
  • Authors: Shuaixin Liu, Kunqian Li, Yilin Ding, Kuangwei Xu, Qianli Jiang, Q. M. Jonathan Wu, Dalei Song
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    We introduce a novel vision-based framework for in-situ trunk identification and length measurement of sea cucumbers, which plays a crucial role in the monitoring of marine ranching resources and mechanized harvesting. To model sea cucumber trunk curves with varying degrees of bending, we utilize the parametric Bézier curve due to its computational simplicity, stability, and extensive range of transformation possibilities. Then, we propose an end-to-end unified framework that combines parametric Bézier curve modeling with the widely used You-Only-Look-Once (YOLO) pipeline, abbreviated as TISC-Net, and incorporates effective funnel activation and efficient multi-scale attention modules to enhance curve feature perception and learning. Furthermore, we propose incorporating trunk endpoint loss as an additional constraint to effectively mitigate the impact of endpoint deviations on the overall curve. Finally, by utilizing the depth information of pixels located along the trunk curve captured by a binocular camera, we propose accurately estimating the in-situ length of sea cucumbers through space curve integration. We established two challenging benchmark datasets for curve-based in-situ sea cucumber trunk identification. These datasets consist of over 1,000 real-world marine environment images of sea cucumbers, accompanied by Bézier format annotations. We conduct evaluation on SC-ISTI, for which our method achieves mAP50 above 0.9 on both object detection and trunk identification tasks. Extensive length measurement experiments demonstrate that the average absolute relative error is around 0.15.

Title:

      SSAD: Self-supervised Auxiliary Detection Framework for Panoramic X-ray based Dental Disease Diagnosis
  • Authors: Zijian Cai, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuechen Li, Linlin Shen, He Meng, Yongqiang Deng
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Panoramic X-ray is a simple and effective tool for diagnosing dental diseases in clinical practice. When deep learning models are developed to assist dentist in interpreting panoramic X-rays, most of their performance suffers from the limited annotated data, which requires dentist's expertise and a lot of time cost. Although self-supervised learning (SSL) has been proposed to address this challenge, the two-stage process of pretraining and fine-tuning requires even more training time and computational resources. In this paper, we present a self-supervised auxiliary detection (SSAD) framework, which is plug-and-play and compatible with any detectors. It consists of a reconstruction branch and a detection branch. Both branches are trained simultaneously, sharing the same encoder, without the need for finetuning. The reconstruction branch learns to restore the tooth texture of healthy or diseased teeth, while the detection branch utilizes these learned features for diagnosis. To enhance the encoder's ability to capture fine-grained features, we incorporate the image encoder of SAM to construct a texture consistency (TC) loss, which extracts image embedding from the input and output of reconstruction branch, and then enforces both embedding into the same feature space. Extensive experiments on the public DENTEX dataset through three detection tasks demonstrate that the proposed SSAD framework achieves state-of-the-art performance compared to mainstream object detection methods and SSL methods. The code is available at this https URL

Title:

      Image anomaly detection and prediction scheme based on SSA optimized ResNet50-BiGRU model
  • Authors: Qianhui Wan, Zecheng Zhang, Liheng Jiang, Zhaoqi Wang, Yan Zhou
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Image anomaly detection is a popular research direction, with many methods emerging in recent years due to rapid advancements in computing. The use of artificial intelligence for image anomaly detection has been widely studied. By analyzing images of athlete posture and movement, it is possible to predict injury status and suggest necessary adjustments. Most existing methods rely on convolutional networks to extract information from irrelevant pixel data, limiting model accuracy. This paper introduces a network combining Residual Network (ResNet) and Bidirectional Gated Recurrent Unit (BiGRU), which can predict potential injury types and provide early warnings by analyzing changes in muscle and bone poses from video images. To address the high complexity of this network, the Sparrow search algorithm was used for optimization. Experiments conducted on four datasets demonstrated that our model has the smallest error in image anomaly detection compared to other models, showing strong adaptability. This provides a new approach for anomaly detection and predictive analysis in images, contributing to the sustainable development of human health and performance.

Title:

      Seeing Through AI's Lens: Enhancing Human Skepticism Towards LLM-Generated Fake News
  • Authors: Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee
  • Subjects: Subjects:
    Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    LLMs offer valuable capabilities, yet they can be utilized by malicious users to disseminate deceptive information and generate fake news. The growing prevalence of LLMs poses difficulties in crafting detection approaches that remain effective across various text domains. Additionally, the absence of precautionary measures for AI-generated news on online social platforms is concerning. Therefore, there is an urgent need to improve people's ability to differentiate between news articles written by humans and those produced by LLMs. By providing cues in human-written and LLM-generated news, we can help individuals increase their skepticism towards fake LLM-generated news. This paper aims to elucidate simple markers that help individuals distinguish between articles penned by humans and those created by LLMs. To achieve this, we initially collected a dataset comprising 39k news articles authored by humans or generated by four distinct LLMs with varying degrees of fake. We then devise a metric named Entropy-Shift Authorship Signature (ESAS) based on the information theory and entropy principles. The proposed ESAS ranks terms or entities, like POS tagging, within news articles based on their relevance in discerning article authorship. We demonstrate the effectiveness of our metric by showing the high accuracy attained by a basic method, i.e., TF-IDF combined with logistic regression classifier, using a small set of terms with the highest ESAS score. Consequently, we introduce and scrutinize these top ESAS-ranked terms to aid individuals in strengthening their skepticism towards LLM-generated fake news.

Title:

      Leveraging eBPF and AI for Ransomware Nose Out
  • Authors: Arjun Sekar, Sameer G. Kulkarni, Joy Kuri
  • Subjects: Subjects:
    Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    In this work, we propose a two-phased approach for real-time detection and deterrence of ransomware. To achieve this, we leverage the capabilities of eBPF (Extended Berkeley Packet Filter) and artificial intelligence to develop both proactive and reactive methods. In the first phase, we utilize signature based detection, where we employ custom eBPF programs to trace the execution of new processes and perform hash-based analysis against a known ransomware dataset. In the second, we employ a behavior-based technique that focuses on monitoring the process activities using a custom eBPF program and the creation of ransom notes, a prominent indicator of ransomware activity through the use of Natural Language Processing (NLP). By leveraging low-level tracing capabilities of eBPF and integrating NLP based machine learning algorithms, our solution achieves an impressive 99.76% accuracy in identifying ransomware incidents within a few seconds on the onset of zero-day attacks.

Title:

      How to design a dataset compliant with an ML-based system ODD?
  • Authors: Cyril Cappi, Noémie Cohen, Mélanie Ducoffe, Christophe Gabreau, Laurent Gardes, Adrien Gauffriau, Jean-Brice Ginestet, Franck Mamalet, Vincent Mussot, Claire Pagetti, David Vigouroux
  • Subjects: Subjects:
    Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    This paper focuses on a Vision-based Landing task and presents the design and the validation of a dataset that would comply with the Operational Design Domain (ODD) of a Machine-Learning (ML) system. Relying on emerging certification standards, we describe the process for establishing ODDs at both the system and image levels. In the process, we present the translation of high-level system constraints into actionable image-level properties, allowing for the definition of verifiable Data Quality Requirements (DQRs). To illustrate this approach, we use the Landing Approach Runway Detection (LARD) dataset which combines synthetic imagery and real footage, and we focus on the steps required to verify the DQRs. The replicable framework presented in this paper addresses the challenges of designing a dataset compliant with the stringent needs of ML-based systems certification in safety-critical applications.

Title:

      Prompt Injection Attacks in Defended Systems
  • Authors: Daniil Khomsky, Narek Maloyan, Bulat Nutfullin
  • Subjects: Subjects:
    Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Large language models play a crucial role in modern natural language processing technologies. However, their extensive use also introduces potential security risks, such as the possibility of black-box attacks. These attacks can embed hidden malicious features into the model, leading to adverse consequences during its deployment. This paper investigates methods for black-box attacks on large language models with a three-tiered defense mechanism. It analyzes the challenges and significance of these attacks, highlighting their potential implications for language processing system security. Existing attack and defense methods are examined, evaluating their effectiveness and applicability across various scenarios. Special attention is given to the detection algorithm for black-box attacks, identifying hazardous vulnerabilities in language models and retrieving sensitive information. This research presents a methodology for vulnerability detection and the development of defensive strategies against black-box attacks on large language models.

Title:

      Detecting sexually explicit content in the context of the child sexual abuse materials (CSAM): end-to-end classifiers and region-based networks
  • Authors: Weronika Gutfeter, Joanna Gajewska, Andrzej Pacut
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Child sexual abuse materials (CSAM) pose a significant threat to the safety and well-being of children worldwide. Detecting and preventing the distribution of such materials is a critical task for law enforcement agencies and technology companies. As content moderation is often manual, developing an automated detection system can help reduce human reviewers' exposure to potentially harmful images and accelerate the process of counteracting. This study presents methods for classifying sexually explicit content, which plays a crucial role in the automated CSAM detection system. Several approaches are explored to solve the task: an end-to-end classifier, a classifier with person detection and a private body parts detector. All proposed methods are tested on the images obtained from the online tool for reporting illicit content. Due to legal constraints, access to the data is limited, and all algorithms are executed remotely on the isolated server. The end-to-end classifier yields the most promising results, with an accuracy of 90.17%, after augmenting the training set with the additional neutral samples and adult pornography. While detection-based methods may not achieve higher accuracy rates and cannot serve as a final classifier on their own, their inclusion in the system can be beneficial. Human body-oriented approaches generate results that are easier to interpret, and obtaining more interpretable results is essential when analyzing models that are trained without direct access to data.

Title:

      Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services
  • Authors: David Hartmann, Amin Oueslati, Dimitri Staufer
  • Subjects: Subjects:
    Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.

Title:

      Definition generation for lexical semantic change detection
  • Authors: Mariia Fedorova, Andrey Kutuzov, Yves Scherrer
  • Subjects: Subjects:
    Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.

Title:

      A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
  • Authors: Kyungbok Lee, You Zhang, Zhiyao Duan
  • Subjects: Subjects:
    Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.

Title:

      Live Video Captioning
  • Authors: Eduardo Blanco-Fernández, Carlos Gutiérrez-Álvarez, Nadia Nasri, Saturnino Maldonado-Bascón, Roberto J. López-Sastre
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Dense video captioning is the task that involves the detection and description of events within video sequences. While traditional approaches focus on offline solutions where the entire video of analysis is available for the captioning model, in this work we introduce a paradigm shift towards Live Video Captioning (LVC). In LVC, dense video captioning models must generate captions for video streams in an online manner, facing important constraints such as having to work with partial observations of the video, the need for temporal anticipation and, of course, ensuring ideally a real-time response. In this work we formally introduce the novel problem of LVC and propose new evaluation metrics tailored for the online scenario, demonstrating their superiority over traditional metrics. We also propose an LVC model integrating deformable transformers and temporal filtering to address the LVC new challenges. Experimental evaluations on the ActivityNet Captions dataset validate the effectiveness of our approach, highlighting its performance in LVC compared to state-of-the-art offline methods. Results of our model as well as an evaluation kit with the novel metrics integrated are made publicly available to encourage further research on LVC.

Title:

      aeon: a Python toolkit for learning from time series
  • Authors: Matthew Middlehurst, Ali Ismail-Fawaz, Antoine Guillaume, Christopher Holder, David Guijo Rubio, Guzal Bulatova, Leonidas Tsaprounis, Lukasz Mentel, Martin Walter, Patrick Schäfer, Anthony Bagnall
  • Subjects: Subjects:
    Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    aeon is a unified Python 3 library for all machine learning tasks involving time series. The package contains modules for time series forecasting, classification, extrinsic regression and clustering, as well as a variety of utilities, transformations and distance measures designed for time series data. aeon also has a number of experimental modules for tasks such as anomaly detection, similarity search and segmentation. aeon follows the scikit-learn API as much as possible to help new users and enable easy integration of aeon estimators with useful tools such as model selection and pipelines. It provides a broad library of time series algorithms, including efficient implementations of the very latest advances in research. Using a system of optional dependencies, aeon integrates a wide variety of packages into a single interface while keeping the core framework with minimal dependencies. The package is distributed under the 3-Clause BSD license and is available at this https URL aeon-toolkit/aeon. This version was submitted to the JMLR journal on 02 Nov 2023 for v0.5.0 of aeon. At the time of this preprint aeon has released v0.9.0, and has had substantial changes.

Title:

      LeYOLO, New Scalable and Efficient CNN Architecture for Object Detection
  • Authors: Lilian Hollard, Lucas Mohimont, Nathalie Gaveau, Luiz-Angelo Steffenel
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Computational efficiency in deep neural networks is critical for object detection, especially as newer models prioritize speed over efficient computation (FLOP). This evolution has somewhat left behind embedded and mobile-oriented AI object detection applications. In this paper, we focus on design choices of neural network architectures for efficient object detection computation based on FLOP and propose several optimizations to enhance the efficiency of YOLO-based models. Firstly, we introduce an efficient backbone scaling inspired by inverted bottlenecks and theoretical insights from the Information Bottleneck principle. Secondly, we present the Fast Pyramidal Architecture Network (FPAN), designed to facilitate fast multiscale feature sharing while reducing computational resources. Lastly, we propose a Decoupled Network-in-Network (DNiN) detection head engineered to deliver rapid yet lightweight computations for classification and regression tasks. Building upon these optimizations and leveraging more efficient backbones, this paper contributes to a new scaling paradigm for object detection and YOLO-centric models called LeYOLO. Our contribution consistently outperforms existing models in various resource constraints, achieving unprecedented accuracy and flop ratio. Notably, LeYOLO-Small achieves a competitive mAP score of 38.2% on the COCOval with just 4.5 FLOP(G), representing a 42% reduction in computational load compared to the latest state-of-the-art YOLOv9-Tiny model while achieving similar accuracy. Our novel model family achieves a FLOP-to-accuracy ratio previously unattained, offering scalability that spans from ultra-low neural network configurations (< 1 GFLOP) to efficient yet demanding object detection setups (> 4 GFLOPs) with 25.2, 31.3, 35.2, 38.2, 39.3 and 41 mAP for 0.66, 1.47, 2.53, 4.51, 5.8 and 8.4 FLOP(G).

Title:

      Revisiting Modularity Maximization for Graph Clustering: A Contrastive Learning Perspective
  • Authors: Yunfei Liu, Jintang Li, Yuehe Chen, Ruofan Wu, Ericbk Wang, Jing Zhou, Sheng Tian, Shuheng Shen, Xing Fu, Changhua Meng, Weiqiang Wang, Liang Chen
  • Subjects: Subjects:
    Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Graph clustering, a fundamental and challenging task in graph mining, aims to classify nodes in a graph into several disjoint clusters. In recent years, graph contrastive learning (GCL) has emerged as a dominant line of research in graph clustering and advances the new state-of-the-art. However, GCL-based methods heavily rely on graph augmentations and contrastive schemes, which may potentially introduce challenges such as semantic drift and scalability issues. Another promising line of research involves the adoption of modularity maximization, a popular and effective measure for community detection, as the guiding principle for clustering tasks. Despite the recent progress, the underlying mechanism of modularity maximization is still not well understood. In this work, we dig into the hidden success of modularity maximization for graph clustering. Our analysis reveals the strong connections between modularity maximization and graph contrastive learning, where positive and negative examples are naturally defined by modularity. In light of our results, we propose a community-aware graph clustering framework, coined MAGI, which leverages modularity maximization as a contrastive pretext task to effectively uncover the underlying information of communities in graphs, while avoiding the problem of semantic drift. Extensive experiments on multiple graph datasets verify the effectiveness of MAGI in terms of scalability and clustering performance compared to state-of-the-art graph clustering methods. Notably, MAGI easily scales a sufficiently large graph with 100M nodes while outperforming strong baselines.

Title:

      Examining the Implications of Deepfakes for Election Integrity
  • Authors: Hriday Ranka, Mokshit Surana, Neel Kothari, Veer Pariawala, Pratyay Banerjee, Aditya Surve, Sainath Reddy Sankepally, Raghav Jain, Jhagrut Lalwani, Swapneel Mehta
  • Subjects: Subjects:
    Computers and Society (cs.CY); Social and Information Networks (cs.SI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    It is becoming cheaper to launch disinformation operations at scale using AI-generated content, in particular 'deepfake' technology. We have observed instances of deepfakes in political campaigns, where generated content is employed to both bolster the credibility of certain narratives (reinforcing outcomes) and manipulate public perception to the detriment of targeted candidates or causes (adversarial outcomes). We discuss the threats from deepfakes in politics, highlight model specifications underlying different types of deepfake generation methods, and contribute an accessible evaluation of the efficacy of existing detection methods. We provide this as a summary for lawmakers and civil society actors to understand how the technology may be applied in light of existing policies regulating its use. We highlight the limitations of existing detection mechanisms and discuss the areas where policies and regulations are required to address the challenges of deepfakes.

Title:

      HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?
  • Authors: Ivan Karpukhin, Foma Shipilov, Andrey Savchenko
  • Subjects: Subjects:
    Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    In sequential event prediction, which finds applications in finance, retail, social networks, and healthcare, a crucial task is forecasting multiple future events within a specified time horizon. Traditionally, this has been addressed through autoregressive generation using next-event prediction models, such as Marked Temporal Point Processes. However, autoregressive methods use their own output for future predictions, potentially reducing quality as the prediction horizon extends. In this paper, we challenge traditional approaches by introducing a novel benchmark, HoTPP, specifically designed to evaluate a model's ability to predict event sequences over a horizon. This benchmark features a new metric inspired by object detection in computer vision, addressing the limitations of existing metrics in assessing models with imprecise time-step predictions. Our evaluations on established datasets employing various models demonstrate that high accuracy in next-event prediction does not necessarily translate to superior horizon prediction, and vice versa. HoTPP aims to serve as a valuable tool for developing more robust event sequence prediction methods, ultimately paving the way for further advancements in the field.

Title:

      Enhanced Bank Check Security: Introducing a Novel Dataset and Transformer-Based Approach for Detection and Verification
  • Authors: Muhammad Saif Ullah Khan, Tahira Shehzadi, Rabeya Noor, Didier Stricker, Muhammad Zeshan Afzal
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Automated signature verification on bank checks is critical for fraud prevention and ensuring transaction authenticity. This task is challenging due to the coexistence of signatures with other textual and graphical elements on real-world documents. Verification systems must first detect the signature and then validate its authenticity, a dual challenge often overlooked by current datasets and methodologies focusing only on verification. To address this gap, we introduce a novel dataset specifically designed for signature verification on bank checks. This dataset includes a variety of signature styles embedded within typical check elements, providing a realistic testing ground for advanced detection methods. Moreover, we propose a novel approach for writer-independent signature verification using an object detection network. Our detection-based verification method treats genuine and forged signatures as distinct classes within an object detection framework, effectively handling both detection and verification. We employ a DINO-based network augmented with a dilation module to detect and verify signatures on check images simultaneously. Our approach achieves an AP of 99.2 for genuine and 99.4 for forged signatures, a significant improvement over the DINO baseline, which scored 93.1 and 89.3 for genuine and forged signatures, respectively. This improvement highlights our dilation module's effectiveness in reducing both false positives and negatives. Our results demonstrate substantial advancements in detection-based signature verification technology, offering enhanced security and efficiency in financial document processing.

Title:

      Computation-Efficient Semi-Supervised Learning for ECG-based Cardiovascular Diseases Detection
  • Authors: Rushuang Zhou, Zijun Liu, Lei Clifton, David A. Clifton, Kannie W. Y. Chan, Yuan-Ting Zhang, Yining Dong
  • Subjects: Subjects:
    Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Label scarcity problem is the main challenge that hinders the wide application of deep learning systems in automatic cardiovascular diseases (CVDs) detection using electrocardiography (ECG). Tuning pre-trained models alleviates this problem by transferring knowledge learned from large datasets to downstream small datasets. However, bottlenecks in computational efficiency and CVDs detection performance limit its clinical applications. It is difficult to improve the detection performance without significantly sacrificing model computational efficiency. Here, we propose a computation-efficient semi-supervised learning paradigm (FastECG) for robust and computation-efficient CVDs detection using ECG. It enables a robust adaptation of pre-trained models on downstream datasets with limited supervision and high computational efficiency. First, a random-deactivation technique is developed to achieve robust and fast low-rank adaptation of pre-trained weights. Subsequently, we propose a one-shot rank allocation module to determine the optimal ranks for the update matrices of the pre-trained weights. Finally, a lightweight semi-supervised learning pipeline is introduced to enhance model performance by leveraging labeled and unlabeled data with high computational efficiency. Extensive experiments on four downstream ECG datasets demonstrate that FastECG not only outperforms the state-of-the-art methods in multi-label CVDs detection but also consumes fewer GPU footprints, training time, and parameter storage space. As such, this paradigm provides an effective solution for achieving high computational efficiency and robust detection performance in the clinical applications of pre-trained models under limited supervision.

Title:

      ATAC-Net: Zoomed view works better for Anomaly Detection
  • Authors: Shaurya Gupta, Neil Gautam, Anurag Malyala
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    The application of deep learning in visual anomaly detection has gained widespread popularity due to its potential use in quality control and manufacturing. Current standard methods are Unsupervised, where a clean dataset is utilised to detect deviations and flag anomalies during testing. However, incorporating a few samples when the type of anomalies is known beforehand can significantly enhance performance. Thus, we propose ATAC-Net, a framework that trains to detect anomalies from a minimal set of known prior anomalies. Furthermore, we introduce attention-guided cropping, which provides a closer view of suspect regions during the training phase. Our framework is a reliable and easy-to-understand system for detecting anomalies, and we substantiate its superiority to some of the current state-of-the-art techniques in a comparable setting.

Title:

      Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines
  • Authors: Xinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Wei An, Weidong Sheng, Li Liu
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at this https URL.

Title:

      Fantastic Copyrighted Beasts and How (Not) to Generate Them
  • Authors: Luxi He, Yangsibo Huang, Weijia Shi, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter Henderson
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.

Keyword: face recognition

Title:

      Liveness Detection in Computer Vision: Transformer-based Self-Supervised Learning for Face Anti-Spoofing
  • Authors: Arman Keresh, Pakizar Shamoi
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Face recognition systems are increasingly used in biometric security for convenience and effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos, or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the Vision Transformer (ViT) architecture, fine-tuned with the DINO framework. The DINO framework facilitates self-supervised learning, enabling the model to learn distinguishing features from unlabeled data. We compared the performance of the proposed fine-tuned ViT model using the DINO framework against a traditional CNN model, EfficientNet b2, on the face anti-spoofing task. Numerous tests on standard datasets show that the ViT model performs better than the CNN model in terms of accuracy and resistance to different spoofing methods. Additionally, we collected our own dataset from a biometric application to validate our findings further. This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.

Keyword: augmentation

Title:

      T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity Computation
  • Authors: Lihuan Li, Hao Xue, Yang Song, Flora Salim
  • Subjects: Subjects:
    Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Trajectory similarity computation is an essential technique for analyzing moving patterns of spatial data across various applications such as traffic management, wildlife tracking, and location-based services. Modern methods often apply deep learning techniques to approximate heuristic metrics but struggle to learn more robust and generalized representations from the vast amounts of unlabeled trajectory data. Recent approaches focus on self-supervised learning methods such as contrastive learning, which have made significant advancements in trajectory representation learning. However, contrastive learning-based methods heavily depend on manually pre-defined data augmentation schemes, limiting the diversity of generated trajectories and resulting in learning from such variations in 2D Euclidean space, which prevents capturing high-level semantic variations. To address these limitations, we propose T-JEPA, a self-supervised trajectory similarity computation method employing Joint-Embedding Predictive Architecture (JEPA) to enhance trajectory representation learning. T-JEPA samples and predicts trajectory information in representation space, enabling the model to infer the missing components of trajectories at high-level semantics without relying on domain knowledge or manual effort. Extensive experiments conducted on three urban trajectory datasets and two Foursquare datasets demonstrate the effectiveness of T-JEPA in trajectory similarity computation.

Title:

      Skin Cancer Images Classification using Transfer Learning Techniques
  • Authors: Md Sirajul Islam, Sanjeev Panta
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Skin cancer is one of the most common and deadliest types of cancer. Early diagnosis of skin cancer at a benign stage is critical to reducing cancer mortality. To detect skin cancer at an earlier stage an automated system is compulsory that can save the life of many patients. Many previous studies have addressed the problem of skin cancer diagnosis using various deep learning and transfer learning models. However, existing literature has limitations in its accuracy and time-consuming procedure. In this work, we applied five different pre-trained transfer learning approaches for binary classification of skin cancer detection at benign and malignant stages. To increase the accuracy of these models we fine-tune different layers and activation functions. We used a publicly available ISIC dataset to evaluate transfer learning approaches. For model stability, data augmentation techniques are applied to improve the randomness of the input dataset. These approaches are evaluated using different hyperparameters such as batch sizes, epochs, and optimizers. The experimental results show that the ResNet-50 model provides an accuracy of 0.935, F1-score of 0.86, and precision of 0.94.

Title:

      Class-specific Data Augmentation for Plant Stress Classification
  • Authors: Nasla Saleem, Aditya Balu, Talukder Zaki Jubery, Arti Singh, Asheesh K. Singh, Soumik Sarkar, Baskar Ganapathysubramanian
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Data augmentation is a powerful tool for improving deep learning-based image classifiers for plant stress identification and classification. However, selecting an effective set of augmentations from a large pool of candidates remains a key challenge, particularly in imbalanced and confounding datasets. We propose an approach for automated class-specific data augmentation using a genetic algorithm. We demonstrate the utility of our approach on soybean [Glycine max (L.) Merr] stress classification where symptoms are observed on leaves; a particularly challenging problem due to confounding classes in the dataset. Our approach yields substantial performance, achieving a mean-per-class accuracy of 97.61% and an overall accuracy of 98% on the soybean leaf stress dataset. Our method significantly improves the accuracy of the most challenging classes, with notable enhancements from 83.01% to 88.89% and from 85.71% to 94.05%, respectively. A key observation we make in this study is that high-performing augmentation strategies can be identified in a computationally efficient manner. We fine-tune only the linear layer of the baseline model with different augmentations, thereby reducing the computational burden associated with training classifiers from scratch for each augmentation policy while achieving exceptional performance. This research represents an advancement in automated data augmentation strategies for plant stress classification, particularly in the context of confounding datasets. Our findings contribute to the growing body of research in tailored augmentation techniques and their potential impact on disease management strategies, crop yields, and global food security. The proposed approach holds the potential to enhance the accuracy and efficiency of deep learning-based tools for managing plant stresses in agriculture.

Title:

      A New Approach for Evaluating and Improving the Performance of Segmentation Algorithms on Hard-to-Detect Blood Vessels
  • Authors: João Pedro Parella, Matheus Viana da Silva, Cesar Henrique Comin
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Many studies regarding the vasculature of biological tissues involve the segmentation of the blood vessels in a sample followed by the creation of a graph structure to model the vasculature. The graph is then used to extract relevant vascular properties. Small segmentation errors can lead to largely distinct connectivity patterns and a high degree of variability of the extracted properties. Nevertheless, global metrics such as Dice, precision, and recall are commonly applied for measuring the performance of blood vessel segmentation algorithms. These metrics might conceal important information about the accuracy at specific regions of a sample. To tackle this issue, we propose a local vessel salience (LVS) index to quantify the expected difficulty in segmenting specific blood vessel segments. The LVS index is calculated for each vessel pixel by comparing the local intensity of the vessel with the image background around the pixel. The index is then used for defining a new accuracy metric called low-salience recall (LSRecall), which quantifies the performance of segmentation algorithms on blood vessel segments having low salience. The perspective provided by the LVS index is used to define a data augmentation procedure that can be used to improve the segmentation performance of convolutional neural networks. We show that segmentation algorithms having high Dice and recall values can display very low LSRecall values, which reveals systematic errors of these algorithms for vessels having low salience. The proposed data augmentation procedure is able to improve the LSRecall of some samples by as much as 25%. The developed methodology opens up new possibilities for comparing the performance of segmentation algorithms regarding hard-to-detect blood vessels as well as their capabilities for vascular topology preservation.

Title:

      Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching
  • Authors: Zhuoran Li, Chunming Hu, Junfan Chen, Zhijun Chen, Xiaohui Guo, Richong Zhang
  • Subjects: Subjects:
    Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switching text samples will negatively hurt the models' cross-lingual transferability. To this end, we propose a Progressive Code-Switching (PCS) method to gradually generate moderately difficult code-switching examples for the model to discriminate from easy to hard. The idea is to incorporate progressively the preceding learned multilingual knowledge using easier code-switching data to guide model optimization on succeeding harder code-switching data. Specifically, we first design a difficulty measurer to measure the impact of replacing each word in a sentence based on the word relevance score. Then a code-switcher generates the code-switching data of increasing difficulty via a controllable temperature variable. In addition, a training scheduler decides when to sample harder code-switching data for model training. Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages.

Title:

      Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and M\"obius Spatial Augmentation
  • Authors: Zidong Cao, Jinjing Zhu, Weiming Zhang, Lin Wang
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Recently, Depth Anything Model (DAM) - a type of depth foundation model - reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. Möbius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.

Title:

      Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing
  • Authors: Xinbo Zhao, Yingxue Zhang, Xin Zhang, Yu Yang, Yiqun Xie, Yanhua Li, Jun Luo
  • Subjects: Subjects:
    Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urban data. However, standard offline RL faces two significant challenges: (1) data scarcity and data heterogeneity, and (2) distributional shift. In this paper, we introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. MODA addresses the challenges of data scarcity and heterogeneity in a multi-task urban setting through Contrastive Data Sharing among tasks. This technique involves extracting latent representations of human behaviors by contrasting positive and negative data pairs. It then shares data presenting similar representations with the target task, facilitating data augmentation for each task. Moreover, MODA develops a novel model-based multi-task offline RL algorithm. This algorithm constructs a robust Markov Decision Process (MDP) by integrating a dynamics model with a Generative Adversarial Network (GAN). Once the robust MDP is established, any online RL or planning algorithm can be applied. Extensive experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA. The results demonstrate that MODA exhibits significant improvements compared to state-of-the-art baselines, showcasing its capability in advancing urban decision-making processes. We also made our code available to the research community.

Title:

      Semi Supervised Heterogeneous Domain Adaptation via Disentanglement and Pseudo-Labelling
  • Authors: Cassio F. Dantas (EVERGREEN, INRAE), Raffaele Gaetano (EVERGREEN), Dino Ienco (EVERGREEN)
  • Subjects: Subjects:
    Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Semi-supervised domain adaptation methods leverage information from a source labelled domain with the goal of generalizing over a scarcely labelled target domain. While this setting already poses challenges due to potential distribution shifts between domains, an even more complex scenario arises when source and target data differs in modality representation (e.g. they are acquired by sensors with different characteristics). For instance, in remote sensing, images may be collected via various acquisition modes (e.g. optical or radar), different spectral characteristics (e.g. RGB or multi-spectral) and spatial resolutions. Such a setting is denoted as Semi-Supervised Heterogeneous Domain Adaptation (SSHDA) and it exhibits an even more severe distribution shift due to modality heterogeneity across this http URL cope with the challenging SSHDA setting, here we introduce SHeDD (Semi-supervised Heterogeneous Domain Adaptation via Disentanglement) an end-to-end neural framework tailored to learning a target domain classifier by leveraging both labelled and unlabelled data from heterogeneous data sources. SHeDD is designed to effectively disentangle domain-invariant representations, relevant for the downstream task, from domain-specific information, that can hinder the cross-modality transfer. Additionally, SHeDD adopts an augmentation-based consistency regularization mechanism that takes advantages of reliable pseudo-labels on the unlabelled target samples to further boost its generalization ability on the target domain. Empirical evaluations on two remote sensing benchmarks, encompassing heterogeneous data in terms of acquisition modes and spectral/spatial resolutions, demonstrate the quality of SHeDD compared to both baseline and state-of-the-art competing approaches. Our code is publicly available here: this https URL

Title:

      Augmenting Query and Passage for Retrieval-Augmented Generation using LLMs for Open-Domain Question Answering
  • Authors: Minsang Kim, Cheoneum Park, Seungjun Baek
  • Subjects: Subjects:
    Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Retrieval-augmented generation (RAG) has received much attention for Open-domain question-answering (ODQA) tasks as a means to compensate for the parametric knowledge of large language models (LLMs). While previous approaches focused on processing retrieved passages to remove irrelevant context, they still rely heavily on the quality of retrieved passages which can degrade if the question is ambiguous or complex. In this paper, we propose a simple yet efficient method called question and passage augmentation via LLMs for open-domain QA. Our method first decomposes the original questions into multiple-step sub-questions. By augmenting the original question with detailed sub-questions and planning, we are able to make the query more specific on what needs to be retrieved, improving the retrieval performance. In addition, to compensate for the case where the retrieved passages contain distracting information or divided opinions, we augment the retrieved passages with self-generated passages by LLMs to guide the answer extraction. Experimental results show that the proposed scheme outperforms the previous state-of-the-art and achieves significant performance gain over existing RAG methods.

Title:

      Revisiting Modularity Maximization for Graph Clustering: A Contrastive Learning Perspective
  • Authors: Yunfei Liu, Jintang Li, Yuehe Chen, Ruofan Wu, Ericbk Wang, Jing Zhou, Sheng Tian, Shuheng Shen, Xing Fu, Changhua Meng, Weiqiang Wang, Liang Chen
  • Subjects: Subjects:
    Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Graph clustering, a fundamental and challenging task in graph mining, aims to classify nodes in a graph into several disjoint clusters. In recent years, graph contrastive learning (GCL) has emerged as a dominant line of research in graph clustering and advances the new state-of-the-art. However, GCL-based methods heavily rely on graph augmentations and contrastive schemes, which may potentially introduce challenges such as semantic drift and scalability issues. Another promising line of research involves the adoption of modularity maximization, a popular and effective measure for community detection, as the guiding principle for clustering tasks. Despite the recent progress, the underlying mechanism of modularity maximization is still not well understood. In this work, we dig into the hidden success of modularity maximization for graph clustering. Our analysis reveals the strong connections between modularity maximization and graph contrastive learning, where positive and negative examples are naturally defined by modularity. In light of our results, we propose a community-aware graph clustering framework, coined MAGI, which leverages modularity maximization as a contrastive pretext task to effectively uncover the underlying information of communities in graphs, while avoiding the problem of semantic drift. Extensive experiments on multiple graph datasets verify the effectiveness of MAGI in terms of scalability and clustering performance compared to state-of-the-art graph clustering methods. Notably, MAGI easily scales a sufficiently large graph with 100M nodes while outperforming strong baselines.

Title:

      PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions
  • Authors: Sihan Ma, Jing Zhang, Qiong Cao, Dacheng Tao
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images, which is crucial for various applications such as human-machine interaction, embodied AI, and autonomous driving. While current models show promising results, they are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment and thus posing safety risks in practical scenarios. To address this issue, we introduce PoseBench, a comprehensive benchmark designed to evaluate the robustness of pose estimation models against real-world corruption. We evaluated 60 representative models, including top-down, bottom-up, heatmap-based, regression-based, and classification-based methods, across three datasets for human and animal pose estimation. Our evaluation involves 10 types of corruption in four categories: 1) blur and noise, 2) compression and color loss, 3) severe lighting, and 4) masks. Our findings reveal that state-of-the-art models are vulnerable to common real-world corruptions and exhibit distinct behaviors when tackling human and animal pose estimation tasks. To improve model robustness, we delve into various design considerations, including input resolution, pre-training datasets, backbone capacity, post-processing, and data augmentations. We hope that our benchmark will serve as a foundation for advancing research in robust pose estimation. The benchmark and source code will be released at this https URL

Title:

      SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
  • Authors: Josef Dai, Tianle Chen, Xuyao Wang, Ziran Yang, Taiye Chen, Jiaming Ji, Yaodong Yang
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    To mitigate the risk of harmful outputs from large vision models (LVMs), we introduce the SafeSora dataset to promote research on aligning text-to-video generation with human values. This dataset encompasses human preferences in text-to-video generation tasks along two primary dimensions: helpfulness and harmlessness. To capture in-depth human preferences and facilitate structured reasoning by crowdworkers, we subdivide helpfulness into 4 sub-dimensions and harmlessness into 12 sub-categories, serving as the basis for pilot annotations. The SafeSora dataset includes 14,711 unique prompts, 57,333 unique videos generated by 4 distinct LVMs, and 51,691 pairs of preference annotations labeled by humans. We further demonstrate the utility of the SafeSora dataset through several applications, including training the text-video moderation model and aligning LVMs with human preference by fine-tuning a prompt augmentation module or the diffusion model. These applications highlight its potential as the foundation for text-to-video alignment research, such as human preference modeling and the development and validation of alignment algorithms.

Title:

      Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation
  • Authors: Eyal Michaeli, Ohad Fried
  • Subjects: Subjects:
    Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract
    Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on Text2Image generation or Img2Img methods, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data. Code is available at this https URL.
@LeeKyungwook LeeKyungwook self-assigned this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant