# - Part 01: Initial Data Scraping and Loading 

## 🗒️ This notebook is divided in 4 sections:

1. Scraping the arXiv website for scientific papers using the arXiv API,
2. Performing some basic data cleaning and preprocessing,
3. Connect to the Hopsworks feature store,
4. Create feature groups and upload them to the feature store.

### arXiv Scraping

In this section, we scrape the arXiv website for papers in the category "cs.CV" (Computer Vision), "stat.ML" / "cs.LG" (Machine Learning) and "cs.AI" (Artificial Intelligence). The papers are then saved in a csv file.

In [2]:
import arxiv
import pandas as pd
from tqdm import tqdm

Let's start by defining a list of keywords that we will use to query the arXiv API.

In [3]:
query_keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training",
    "\"generative adversarial networks\"",
    "\"model compressions\"",
    "\"image segmentation\"",
    "\"few-shot learning\"",
    "\"natural language\"",
    "\"graph\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention\"",
    "\"tabular\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series\"",
    "\"molecule\"",
    "\"large language models\"",
    "\"llms\"",
    "\"language models\"",
    "\"image classification\"",
    "\"document image classification\"",
    "\"encoder\"",
    "\"decoder\"",
    "\"multimodal\"",
    "\"multimodal deep learning\"",
]


Afterwards, we define a function that creates a search object using the given query. It sets the maximum number of results for each category to 6000 and sorts them by the last updated date.

In [4]:
client = arxiv.Client(num_retries=20, page_size=500)


def query_with_keywords(query) -> tuple:
    """
    Query the arXiv API for research papers based on a specific query and filter results by selected categories.
    
    Args:
        query (str): The search query to be used for fetching research papers from arXiv.
    
    Returns:
        tuple: A tuple containing three lists - terms, titles, and abstracts of the filtered research papers.
        
            terms (list): A list of lists, where each inner list contains the categories associated with a research paper.
            titles (list): A list of titles of the research papers.
            abstracts (list): A list of abstracts (summaries) of the research papers.
            urls (list): A list of URLs for the papers' detail page on the arXiv website.
    """
    
    # Create a search object with the query and sorting parameters.
    search = arxiv.Search(
        query=query,
        max_results=6000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    
    # Initialize empty lists for terms, titles, abstracts, and urls.
    terms = []
    titles = []
    abstracts = []
    urls = []

    # For each result in the search...
    for res in tqdm(client.results(search), desc=query):
        # Check if the primary category of the result is in the specified list.
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG", "cs.AI"]:
            # If it is, append the result's categories, title, summary, and url to their respective lists.
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
            urls.append(res.entry_id)

    # Return the four lists.
    return terms, titles, abstracts, urls

In [5]:
all_titles = []
all_abstracts = []
all_terms = []
all_urls = []

for query in query_keywords:
    terms, titles, abstracts, urls = query_with_keywords(query)
    all_titles.extend(titles)
    all_abstracts.extend(abstracts)
    all_terms.extend(terms)
    all_urls.extend(urls)

"image segmentation": 0it [00:00, ?it/s]

"image segmentation": 4744it [01:46, 44.43it/s]
"self-supervised learning": 0it [00:02, ?it/s]
"representation learning": 6000it [02:46, 36.14it/s]
"image generation": 5105it [02:19, 36.50it/s]
"object detection": 6000it [02:13, 44.91it/s]
"transfer learning": 6000it [02:11, 45.54it/s]
"transformers": 4501it [01:49, 44.67it/s]Bozo feed; consider handling: document declared as utf-8, but parsed as iso-8859-2
"transformers": 6000it [02:09, 46.47it/s]
"adversarial training: 0it [00:02, ?it/s]
"generative adversarial networks": 6000it [02:03, 48.47it/s]
"model compressions": 1154it [00:22, 50.70it/s]
"image segmentation": 4744it [01:25, 55.55it/s]
"few-shot learning": 0it [00:03, ?it/s]
"natural language": 6000it [02:14, 44.74it/s]
"graph": 6000it [02:08, 46.77it/s]
"colorization": 6000it [02:11, 45.55it/s]
"depth estimation": 2039it [00:45, 45.07it/s]
"point cloud": 6000it [02:09, 46.49it/s]
"structured data": 2810it [01:34, 29.62it/s]
"optical flow": 2087it [00:35, 58.03it/s]
"reinforcem

Now, we create a pandas.DataFrame object to store the results.

In [6]:
arxiv_data = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls
})

In [7]:
arxiv_data_indexed = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls
})

In [8]:
arxiv_data_indexed.reset_index(inplace=True)
arxiv_data_indexed.rename(columns = {'index':'id'}, inplace=True)

### Data Preprocessing


In this part, we preprocess the data collected in the previous section. We start by removing duplicates and then we clean the text by removing punctuation, stopwords and lemmatizing the words.

In [9]:

import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aldir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aldir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\aldir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [10]:
# Setting pandas option to display the full content of DataFrame columns without truncation
pd.set_option('display.max_colwidth', None)

arxiv_data.head()

Unnamed: 0,titles,abstracts,terms,urls
0,HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization,"Tissue semantic segmentation is one of the key tasks in computational\npathology. To avoid the expensive and laborious acquisition of pixel-level\nannotations, a wide range of studies attempt to adopt the class activation map\n(CAM), a weakly-supervised learning scheme, to achieve pixel-level tissue\nsegmentation. However, CAM-based methods are prone to suffer from\nunder-activation and over-activation issues, leading to poor segmentation\nperformance. To address this problem, we propose a novel weakly-supervised\nsemantic segmentation framework for histopathological images based on\nimage-mixing synthesis and consistency regularization, dubbed HisynSeg.\nSpecifically, synthesized histopathological images with pixel-level masks are\ngenerated for fully-supervised model training, where two synthesis strategies\nare proposed based on Mosaic transformation and B\'ezier mask generation.\nBesides, an image filtering module is developed to guarantee the authenticity\nof the synthesized images. In order to further avoid the model overfitting to\nthe occasional synthesis artifacts, we additionally propose a novel\nself-supervised consistency regularization, which enables the real images\nwithout segmentation masks to supervise the training of the segmentation model.\nBy integrating the proposed techniques, the HisynSeg framework successfully\ntransforms the weakly-supervised semantic segmentation problem into a\nfully-supervised one, greatly improving the segmentation accuracy. Experimental\nresults on three datasets prove that the proposed method achieves a\nstate-of-the-art performance. Code is available at\nhttps://github.com/Vison307/HisynSeg.","[cs.CV, cs.AI]",http://arxiv.org/abs/2412.20924v1
1,Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation,"Accurate segmentation of wind turbine blade (WTB) images is critical for\neffective assessments, as it directly influences the performance of automated\ndamage detection systems. Despite advancements in large universal vision\nmodels, these models often underperform in domain-specific tasks like WTB\nsegmentation. To address this, we extend Intrinsic LoRA for image segmentation,\nand propose a novel dual-space augmentation strategy that integrates both\nimage-level and latent-space augmentations. The image-space augmentation is\nachieved through linear interpolation between image pairs, while the\nlatent-space augmentation is accomplished by introducing a noise-based latent\nprobabilistic model. Our approach significantly boosts segmentation accuracy,\nsurpassing current state-of-the-art methods in WTB image segmentation.","[cs.CV, cs.AI, cs.LG]",http://arxiv.org/abs/2412.20838v1
2,Solar Filaments Detection using Active Contours Without Edges,"In this article, an active contours without edges (ACWE)-based algorithm has\nbeen proposed for the detection of solar filaments in H-alpha full-disk solar\nimages. The overall algorithm consists of three main steps of image processing.\nThese are image pre-processing, image segmentation, and image post-processing.\nHere in the work, contours are initialized on the solar image and allowed to\ndeform based on the energy function. As soon as the contour reaches the\nboundary of the desired object, the energy function gets reduced, and the\ncontour stops evolving. The proposed algorithm has been applied to few\nbenchmark datasets and has been compared with the classical technique of object\ndetection. The results analysis indicates that the proposed algorithm\noutperforms the results obtained using the existing classical algorithm of\nobject detection.","[cs.CV, astro-ph.IM, astro-ph.SR, cs.AI, cs.LG]",http://arxiv.org/abs/2412.20749v1
3,TAVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation,"While large visual models (LVM) demonstrated significant potential in image\nunderstanding, due to the application of large-scale pre-training, the Segment\nAnything Model (SAM) has also achieved great success in the field of image\nsegmentation, supporting flexible interactive cues and strong learning\ncapabilities. However, SAM's performance often falls short in cross-domain and\nfew-shot applications. Previous work has performed poorly in transferring prior\nknowledge from base models to new applications. To tackle this issue, we\npropose a task-adaptive auto-visual prompt framework, a new paradigm for\nCross-dominan Few-shot segmentation (CD-FSS). First, a Multi-level Feature\nFusion (MFF) was used for integrated feature extraction as prior knowledge.\nBesides, we incorporate a Class Domain Task-Adaptive Auto-Prompt (CDTAP) module\nto enable class-domain agnostic feature extraction and generate high-quality,\nlearnable visual prompts. This significant advancement uses a unique generative\napproach to prompts alongside a comprehensive model structure and specialized\nprototype computation. While ensuring that the prior knowledge of SAM is not\ndiscarded, the new branch disentangles category and domain information through\nprototypes, guiding it in adapting the CD-FSS. Comprehensive experiments across\nfour cross-domain datasets demonstrate that our model outperforms the\nstate-of-the-art CD-FSS approach, achieving an average accuracy improvement of\n1.3\% in the 1-shot setting and 11.76\% in the 5-shot setting.",[cs.CV],http://arxiv.org/abs/2409.05393v2
4,Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation,"Although recent years have witnessed significant advancements in medical\nimage segmentation, the pervasive issue of domain shift among medical images\nfrom diverse centres hinders the effective deployment of pre-trained models.\nMany Test-time Adaptation (TTA) methods have been proposed to address this\nissue by fine-tuning pre-trained models with test data during inference. These\nmethods, however, often suffer from less-satisfactory optimization due to\nsuboptimal optimization direction (dictated by the gradient) and fixed\nstep-size (predicated on the learning rate). In this paper, we propose the\nGradient alignment-based Test-time adaptation (GraTa) method to improve both\nthe gradient direction and learning rate in the optimization procedure. Unlike\nconventional TTA methods, which primarily optimize the pseudo gradient derived\nfrom a self-supervised objective, our method incorporates an auxiliary gradient\nwith the pseudo one to facilitate gradient alignment. Such gradient alignment\nenables the model to excavate the similarities between different gradients and\ncorrect the gradient direction to approximate the empirical gradient related to\nthe current segmentation task. Additionally, we design a dynamic learning rate\nbased on the cosine similarity between the pseudo and auxiliary gradients,\nthereby empowering the adaptive fine-tuning of pre-trained models on diverse\ntest data. Extensive experiments establish the effectiveness of the proposed\ngradient alignment and dynamic learning rate and substantiate the superiority\nof our GraTa method over other state-of-the-art TTA methods on a benchmark\nmedical image segmentation task. The code and weights of pre-trained source\nmodels are available at https://github.com/Chen-Ziyang/GraTa.",[cs.CV],http://arxiv.org/abs/2408.07343v4


In [11]:
arxiv_data_indexed.head()

Unnamed: 0,id,titles,abstracts,terms,urls
0,0,HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization,"Tissue semantic segmentation is one of the key tasks in computational\npathology. To avoid the expensive and laborious acquisition of pixel-level\nannotations, a wide range of studies attempt to adopt the class activation map\n(CAM), a weakly-supervised learning scheme, to achieve pixel-level tissue\nsegmentation. However, CAM-based methods are prone to suffer from\nunder-activation and over-activation issues, leading to poor segmentation\nperformance. To address this problem, we propose a novel weakly-supervised\nsemantic segmentation framework for histopathological images based on\nimage-mixing synthesis and consistency regularization, dubbed HisynSeg.\nSpecifically, synthesized histopathological images with pixel-level masks are\ngenerated for fully-supervised model training, where two synthesis strategies\nare proposed based on Mosaic transformation and B\'ezier mask generation.\nBesides, an image filtering module is developed to guarantee the authenticity\nof the synthesized images. In order to further avoid the model overfitting to\nthe occasional synthesis artifacts, we additionally propose a novel\nself-supervised consistency regularization, which enables the real images\nwithout segmentation masks to supervise the training of the segmentation model.\nBy integrating the proposed techniques, the HisynSeg framework successfully\ntransforms the weakly-supervised semantic segmentation problem into a\nfully-supervised one, greatly improving the segmentation accuracy. Experimental\nresults on three datasets prove that the proposed method achieves a\nstate-of-the-art performance. Code is available at\nhttps://github.com/Vison307/HisynSeg.","[cs.CV, cs.AI]",http://arxiv.org/abs/2412.20924v1
1,1,Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation,"Accurate segmentation of wind turbine blade (WTB) images is critical for\neffective assessments, as it directly influences the performance of automated\ndamage detection systems. Despite advancements in large universal vision\nmodels, these models often underperform in domain-specific tasks like WTB\nsegmentation. To address this, we extend Intrinsic LoRA for image segmentation,\nand propose a novel dual-space augmentation strategy that integrates both\nimage-level and latent-space augmentations. The image-space augmentation is\nachieved through linear interpolation between image pairs, while the\nlatent-space augmentation is accomplished by introducing a noise-based latent\nprobabilistic model. Our approach significantly boosts segmentation accuracy,\nsurpassing current state-of-the-art methods in WTB image segmentation.","[cs.CV, cs.AI, cs.LG]",http://arxiv.org/abs/2412.20838v1
2,2,Solar Filaments Detection using Active Contours Without Edges,"In this article, an active contours without edges (ACWE)-based algorithm has\nbeen proposed for the detection of solar filaments in H-alpha full-disk solar\nimages. The overall algorithm consists of three main steps of image processing.\nThese are image pre-processing, image segmentation, and image post-processing.\nHere in the work, contours are initialized on the solar image and allowed to\ndeform based on the energy function. As soon as the contour reaches the\nboundary of the desired object, the energy function gets reduced, and the\ncontour stops evolving. The proposed algorithm has been applied to few\nbenchmark datasets and has been compared with the classical technique of object\ndetection. The results analysis indicates that the proposed algorithm\noutperforms the results obtained using the existing classical algorithm of\nobject detection.","[cs.CV, astro-ph.IM, astro-ph.SR, cs.AI, cs.LG]",http://arxiv.org/abs/2412.20749v1
3,3,TAVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation,"While large visual models (LVM) demonstrated significant potential in image\nunderstanding, due to the application of large-scale pre-training, the Segment\nAnything Model (SAM) has also achieved great success in the field of image\nsegmentation, supporting flexible interactive cues and strong learning\ncapabilities. However, SAM's performance often falls short in cross-domain and\nfew-shot applications. Previous work has performed poorly in transferring prior\nknowledge from base models to new applications. To tackle this issue, we\npropose a task-adaptive auto-visual prompt framework, a new paradigm for\nCross-dominan Few-shot segmentation (CD-FSS). First, a Multi-level Feature\nFusion (MFF) was used for integrated feature extraction as prior knowledge.\nBesides, we incorporate a Class Domain Task-Adaptive Auto-Prompt (CDTAP) module\nto enable class-domain agnostic feature extraction and generate high-quality,\nlearnable visual prompts. This significant advancement uses a unique generative\napproach to prompts alongside a comprehensive model structure and specialized\nprototype computation. While ensuring that the prior knowledge of SAM is not\ndiscarded, the new branch disentangles category and domain information through\nprototypes, guiding it in adapting the CD-FSS. Comprehensive experiments across\nfour cross-domain datasets demonstrate that our model outperforms the\nstate-of-the-art CD-FSS approach, achieving an average accuracy improvement of\n1.3\% in the 1-shot setting and 11.76\% in the 5-shot setting.",[cs.CV],http://arxiv.org/abs/2409.05393v2
4,4,Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation,"Although recent years have witnessed significant advancements in medical\nimage segmentation, the pervasive issue of domain shift among medical images\nfrom diverse centres hinders the effective deployment of pre-trained models.\nMany Test-time Adaptation (TTA) methods have been proposed to address this\nissue by fine-tuning pre-trained models with test data during inference. These\nmethods, however, often suffer from less-satisfactory optimization due to\nsuboptimal optimization direction (dictated by the gradient) and fixed\nstep-size (predicated on the learning rate). In this paper, we propose the\nGradient alignment-based Test-time adaptation (GraTa) method to improve both\nthe gradient direction and learning rate in the optimization procedure. Unlike\nconventional TTA methods, which primarily optimize the pseudo gradient derived\nfrom a self-supervised objective, our method incorporates an auxiliary gradient\nwith the pseudo one to facilitate gradient alignment. Such gradient alignment\nenables the model to excavate the similarities between different gradients and\ncorrect the gradient direction to approximate the empirical gradient related to\nthe current segmentation task. Additionally, we design a dynamic learning rate\nbased on the cosine similarity between the pseudo and auxiliary gradients,\nthereby empowering the adaptive fine-tuning of pre-trained models on diverse\ntest data. Extensive experiments establish the effectiveness of the proposed\ngradient alignment and dynamic learning rate and substantiate the superiority\nof our GraTa method over other state-of-the-art TTA methods on a benchmark\nmedical image segmentation task. The code and weights of pre-trained source\nmodels are available at https://github.com/Chen-Ziyang/GraTa.",[cs.CV],http://arxiv.org/abs/2408.07343v4


In [12]:
print(f"There are {len(arxiv_data_indexed)} rows in the dataset.")

There are 83939 rows in the dataset.


Real-world data is noisy. One of the most commonly observed source of noise is data duplication. Here we notice that our initial dataset has got about 20k duplicate entries.

In [13]:
total_duplicate_titles = sum(arxiv_data_indexed["titles"].duplicated())
print(f"There are {total_duplicate_titles} duplicate titles.")

There are 23376 duplicate titles.



Before proceeding further, we drop these entries.

In [14]:
arxiv_data_indexed = arxiv_data_indexed[~arxiv_data_indexed["titles"].duplicated()]
print(f"There are {len(arxiv_data_indexed)} rows in the deduplicated dataset.")

There are 60563 rows in the deduplicated dataset.


### Connecting to the Hopsworks Feature Store

Before creating a feature group, we need to connect to Hopsworks feature store.

In [3]:
from dotenv import load_dotenv
import os
import streamlit as st

In [4]:
# Load hopsworks API key from .env file or secrets.toml file
load_dotenv()

try:
    HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']
    # HOPSWORKS_API_KEY = st.secrets.HOPSWORKS.HOPSWORKS_API_KEY
except:
    raise Exception('Set environment variable HOPSWORKS_API_KEY')

In [8]:
import hopsworks
# import hsfs

# try:
#     project = hopsworks.login()
#     # connection = project.connection(
#     #     host='c.app.hopsworks.ai',
#     #     project='paperrecommendation',
#     #     api_key_value=HOPSWORKS_API_KEY,
#     # )
#     fs = project.get_feature_store()
#     print("Connected to the Hopsworks Feature Store")
# except Exception as e:
#     print(f"An error occurred: {e}")

try:
    project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
    print("Connected to the Hopsworks project")
    
    fs = project.get_feature_store()
    print("Connected to the Hopsworks Feature Store")
except Exception as e:
    print(f"An error occurred: {e}")

AttributeError: module 'hsfs' has no attribute 'hopsworks_udf'

### Creating feature groups and uploading them to the Feature Store

A feature group can be seen as a collection of conceptually related features. In this case, we will create 1 feature group representing the scientific paper information.

In [18]:
paper_info_fg = fs.get_or_create_feature_group(
    name="papers_info",
    version=1,
    description="Scientific papers info for recommendations.",
    primary_key=['id'],
)

NameError: name 'fs' is not defined

At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent, we need to populate it with its associated data using the `insert` function.

In [None]:
try:
    paper_info_fg.insert(arxiv_data_indexed, overwrite=True)
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
feature_descriptions = [
    {"name": "id", "description": "Scientific paper IDs"}, 
    {"name": "titles", "description": "Scientific paper titles"}, 
    {"name": "abstracts", "description": "Scientific paper abstracts"}, 
    {"name": "terms", "description": "Scientific paper categories"}, 
    {"name": "urls", "description": "URLs to scientific paper detail pages"}, 
]

for desc in feature_descriptions: 
    paper_info_fg.update_feature_description(desc["name"], desc["description"])

The feature group is now accessible and searchable in the UI