# Arxiv analysis for topic detection

The goal of the document is to analyze the database https://huggingface.co/datasets/real-jiakai/arxiver-with-category

The document is structured as following:

0) Load Prerequisites
1) Gaining Insights on the Database
2) Exploring the Relevant Pre-processing for Feeding a Machine Learning Model
3) Exploring a Few Hyperparameters
4) Training a Deep Learning model with Transformers/HuggingFace/PyTorch
5) Prompt engineering of OpenAI API

==> Files (stored in $MODEL_PATH) and inference codes of this document are integrated in the gradio code (ui.py).


In [1]:
# Change global variables to your environment
MODEL_PATH = "/home/pierrick/project/ARXIV/models/"
OPENAI_API_KEY = "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"

# Load Prerequisites

## Some imports for loading dataset / english

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from collections import Counter

import re
import nltk
from nltk.corpus import words

  from .autonotebook import tqdm as notebook_tqdm


## Load the dataset in memory

In [3]:
# Load the dataset from Hugging Face
dataset = load_dataset("real-jiakai/arxiver-with-category")

# Convert dataset to a Pandas DataFrame
df = pd.DataFrame(dataset['train'])  # Using the 'train' split

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'abstract', 'authors', 'published_date', 'link', 'markdown', 'primary_category', 'categories'],
        num_rows: 63357
    })
})

In [5]:
len(df)

63357

## Load english

In [6]:
# Download word list if not available
nltk.download("words")

# Load a set of valid English words for quick lookup
english_words = set(words.words())
print("# words:", len(english_words))

# words: 235892


[nltk_data] Downloading package words to /home/pierrick/nltk_data...
[nltk_data]   Package words is already up-to-date!


## My utilities for processing english

In [7]:
def clean_english(text): # <-------------- interpretable but create an accuracy loss. So i don t use it anymore (from 50% to 43% on titles)
    text=text.replace("-", " ") #  usefull in many situations ex: "zero-shot", "discrete-event"
    
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize: split by whitespace
    tokens = cleaned_text.lower().split()
    
    # Filter: Keep only words in the English dictionary
    english_text = [word for word in tokens if word in english_words]
    return " ".join(english_text)

def ratio_english_words(text):
    text=text.replace("-", " ")
    num_start=len(text.split())
    num_after=len(clean_english(text).split())

    if num_start==0:
        return 1.
    ratio = num_after/float(num_start)
    return min(ratio, 1.)


# Testing my functions
raw_text = "   Hello 4ll ! I am a self-instruct robot. ##### 404 "
print("Raw text from document:", raw_text)
print("Ratio of clean text:", ratio_english_words(raw_text))
print("Clean text `clean_english`:", clean_english(raw_text))



Raw text from document:    Hello 4ll ! I am a self-instruct robot. ##### 404 
Ratio of clean text: 0.6363636363636364
Clean text `clean_english`: hello i am a self instruct robot


# Data quality check before training ML

## Simple insights

In [8]:
def title(txt):
    print(f"\033[31m \033[1m *** {txt.upper()} *** \033[0m \033[0m")

title("PANDAS INFO")
print(df.info())

title("FIRST 5 rows")
print(df.head(5))

title("MISSING VALUE COUNT")
missing_values = df.isnull().sum()
print("Missing Values:", missing_values)


title("ENGLISH WORDS RATIO")
text_columns = ["title", "abstract", "markdown"]
for col in text_columns:
    stats = df[col].apply(lambda v: ratio_english_words(v[:1000])).describe()
    print(f"Column '{col}' :")
    print(stats)


title("INTERESTING FIELD : CHARA. LENGTH")
text_columns = ["title", "abstract", "markdown"]
for col in text_columns:
    stats = df[col].str.len().describe()
    print(col, ": ")
    print(stats)

title("RIMARY CATEG. BALANCE")
primary_freq=Counter(df["primary_category"])
print(primary_freq)
print("#Primary category: ", len(primary_freq))

title("SUPER PRIMARY CATEG. BALANCE")
df["primary_category_super"]=df["primary_category"].apply(lambda v: v.split(".")[0])
super_primary_freq=Counter(df["primary_category_super"])
print(super_primary_freq)
print("#Super category: ", len(super_primary_freq))


[31m [1m *** PANDAS INFO *** [0m [0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63357 entries, 0 to 63356
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id                63357 non-null  object        
 1   title             63357 non-null  object        
 2   abstract          63357 non-null  object        
 3   authors           63357 non-null  object        
 4   published_date    63357 non-null  datetime64[ns]
 5   link              63357 non-null  object        
 6   markdown          63357 non-null  object        
 7   primary_category  63357 non-null  object        
 8   categories        63357 non-null  object        
dtypes: datetime64[ns](1), object(8)
memory usage: 4.4+ MB
None
[31m [1m *** FIRST 5 ROWS *** [0m [0m
           id                                              title  \
0  2302.12141  Characterizing the nucleus of comet 162P/Sidin...   
1  2301.08449  

# Processing

In [9]:
df['markdown'] = df['markdown'].str[:10000] # Free memory, keep the beginning (should contain the introduction)
df["title_plus_abstract"] = df["title"] + " " + df["abstract"]
df["title_plus_abstract_plus_intro"] = df["title"] + " " + df["abstract"] + " " + df["markdown"]

# Class Balancing

Many questions occur:
* How many data samples per class for balancing between granularity ?
* What is the best field to predict the class (title, abstract, markdown) ?
* Is there any low hanging fruit ? (Example: open source model doing it already)

In [10]:
threshold=100
reject_class="other"

df["balanced_category"] = df["primary_category"]
for category, count in primary_freq.items():
    if count < threshold:
        # retrieve the rare category and use the super_category label instead
        super_category = df.loc[df["primary_category"] == category, "primary_category_super"]
        df.loc[df["primary_category"] == category, "balanced_category"] = super_category

title("BALANCED CATEG. FREQUENCY : FROM SMALL NODE TO PARENT NODE")
balanced_freq = Counter(df["balanced_category"])
print(balanced_freq)

# WARNING: after this aggregation the following case may occur:
# "cond-mat.other"  is below the threshold and other cond-mat.XXX classes are beyond the threshold. 
# It creates a situation where we are only renaming the class from "cond-mat.other" to "cond-mat".


title("BALANCED CATEG. FREQUENCY 2 : FROM NODE TO REJECT NODE")
for category, count in balanced_freq.items():
    if count < threshold:
        df.loc[df["balanced_category"] == category] = reject_class
balanced_freq2 = Counter(df["balanced_category"])
print(balanced_freq2)
print("#classes :", len(balanced_freq2))
# We have now the guarantee to have 0 or 1 class below the threshold named "other"

[31m [1m *** BALANCED CATEG. FREQUENCY : FROM SMALL NODE TO PARENT NODE *** [0m [0m
Counter({'cs.CV': 5448, 'cs.LG': 5351, 'cs.CL': 4321, 'quant-ph': 2930, 'hep-ph': 1642, 'cs.RO': 1632, 'cs.AI': 1196, 'gr-qc': 1187, 'astro-ph.GA': 1138, 'cs.CR': 1122, 'cond-mat.mtrl-sci': 1041, 'math.CO': 1031, 'astro-ph.HE': 961, 'cond-mat.mes-hall': 960, 'hep-th': 918, 'math.AP': 914, 'eess.SY': 892, 'eess.IV': 864, 'astro-ph.SR': 794, 'physics.optics': 774, 'math.OC': 749, 'cs.SE': 742, 'cond-mat.str-el': 729, 'math.NA': 728, 'cs.HC': 708, 'math.NT': 702, 'physics.flu-dyn': 692, 'eess.SP': 689, 'stat.ME': 668, 'math.PR': 604, 'astro-ph.EP': 592, 'astro-ph.CO': 591, 'math.AG': 572, 'cond-mat.stat-mech': 567, 'cs.IR': 548, 'stat.ML': 542, 'cs.IT': 538, 'cond-mat.soft': 533, 'math.DG': 488, 'cs.SD': 471, 'math': 469, 'cs': 449, 'cs.DC': 445, 'math.DS': 434, 'cs.NI': 433, 'cs.DS': 424, 'cs.CY': 413, 'nucl-th': 412, 'physics': 403, 'cs.LO': 397, 'eess.AS': 394, 'cond-mat.supr-con': 381, 'astro-ph.IM

  df.loc[df["balanced_category"] == category] = reject_class


# ML Experiments

In [12]:
RS=42 # random seed

import warnings
import os
# Suppress the sklearn FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

## Evaluate different data sources

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, top_k_accuracy_score, f1_score
from sklearn.pipeline import make_pipeline
import time

def ml_exp(X,y,model):
    # 3. Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=RS)
    

    # 5. Train the model
    model.fit(X_train, y_train)
    
    # 6. Make predictions on the test set
    start_time=time.time()
    y_test_pred = model.predict(X_test)
    y_test_pred_proba = model.predict_proba(X_test)  # Get probability scores
    enlapsed_time=time.time()-start_time
    
    # 7. Evaluate the model
    accuracy = accuracy_score(y_test, y_test_pred)
    top5_accuracy = top_k_accuracy_score(y_test, y_test_pred_proba, k=5)
    f1 = f1_score(y_test, y_test_pred, average="weighted")  # Weighted handles imbalance
    result = {"acc": accuracy, "top5": top5_accuracy, "f1":f1, "inftime": enlapsed_time}
    return result

def display_exp(result:dict[str,float]):
    accuracy=result["acc"]
    top5_accuracy=result["top5"]
    f1=result["f1"]
    enlapsed_time=result["inftime"]
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Top-5 Accuracy: {top5_accuracy:.4f}")
    print(f"  F1-Score (Weighted): {f1:.4f}")
    print(f"  Inference time (sec): {enlapsed_time:.4f}")
    

# 4. TF-IDF Vectorization + Logistic Regression Pipeline
def get_model():
    model = make_pipeline(
        TfidfVectorizer(stop_words='english', max_features=10000),  # TF-IDF vectorization with stop words removed
        LogisticRegression(max_iter=1000, multi_class='ovr', random_state=RS, n_jobs=-1)  # Logistic Regression model
    )
    return model

In [14]:
for data_source in ["title", "abstract", "markdown", "title_plus_abstract", "title_plus_abstract_plus_intro"]:
    # Warning: slow with "markdown"
    
    X = df[data_source]  # features to predict balanced_category
    y = df["balanced_category"]  # Target variable

    model = get_model()
    exp_result = ml_exp(X,y, model)
    title(f"Data Source: {data_source}")
    display_exp(exp_result)

[31m [1m *** DATA SOURCE: TITLE *** [0m [0m
  Accuracy: 0.5120
  Top-5 Accuracy: 0.7783
  F1-Score (Weighted): 0.4735
  Inference time (sec): 0.2118
[31m [1m *** DATA SOURCE: ABSTRACT *** [0m [0m
  Accuracy: 0.6394
  Top-5 Accuracy: 0.9072
  F1-Score (Weighted): 0.6116
  Inference time (sec): 0.7125
[31m [1m *** DATA SOURCE: MARKDOWN *** [0m [0m
  Accuracy: 0.6533
  Top-5 Accuracy: 0.9146
  F1-Score (Weighted): 0.6275
  Inference time (sec): 5.8566
[31m [1m *** DATA SOURCE: TITLE_PLUS_ABSTRACT *** [0m [0m
  Accuracy: 0.6441
  Top-5 Accuracy: 0.9138
  F1-Score (Weighted): 0.6168
  Inference time (sec): 0.7842
[31m [1m *** DATA SOURCE: TITLE_PLUS_ABSTRACT_PLUS_INTRO *** [0m [0m
  Accuracy: 0.6659
  Top-5 Accuracy: 0.9211
  F1-Score (Weighted): 0.6421
  Inference time (sec): 6.2802


## Evaluate cleaning

In [15]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Function to remove stop words
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))  # Set of English stopwords
    words = text.split()  # Split the text into individual words
    filtered_words = [word for word in words if word.lower() not in stop_words]  # Remove stopwords
    return ' '.join(filtered_words)  # Join the remaining words into a cleaned text

raw_text = "   Hello 4ll ! I am a self-instruct robot. ##### 404 "
print("Raw text from document:", raw_text)
print("Clean text `clean_english`:", clean_english(raw_text))
print("Clean text `remove_stopwords`:", remove_stopwords(raw_text))

Raw text from document:    Hello 4ll ! I am a self-instruct robot. ##### 404 
Clean text `clean_english`: hello i am a self instruct robot
Clean text `remove_stopwords`: Hello 4ll ! self-instruct robot. ##### 404


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierrick/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
data_source="abstract"
df["clean1"] = df[data_source].apply(lambda v: clean_english(v))

df["clean2"] = df[data_source].apply(lambda v: remove_stopwords(v))

for source_name in [data_source, "clean1", "clean2"]:
    title(f"CLEANING STRATEGY: {source_name}")
    X = df[source_name]  # features to predict balanced_category
    y = df["balanced_category"]  # Target variable
    model = get_model()
    res = ml_exp(X,y,model)
    display_exp(res)

# Conclusion: The cleaning functions are not used afterwards because it does not really improve the accuracy

[31m [1m *** CLEANING STRATEGY: ABSTRACT *** [0m [0m
  Accuracy: 0.6394
  Top-5 Accuracy: 0.9072
  F1-Score (Weighted): 0.6116
  Inference time (sec): 0.8147
[31m [1m *** CLEANING STRATEGY: CLEAN1 *** [0m [0m
  Accuracy: 0.5956
  Top-5 Accuracy: 0.8832
  F1-Score (Weighted): 0.5661
  Inference time (sec): 0.5655
[31m [1m *** CLEANING STRATEGY: CLEAN2 *** [0m [0m
  Accuracy: 0.6402
  Top-5 Accuracy: 0.9069
  F1-Score (Weighted): 0.6128
  Inference time (sec): 0.5980


## Evaluate summarization

In [17]:
from transformers import pipeline

# I put on the CPU for avoiding tone of cuda warnings and it seems not slower
summarizer_model="t5-small"
#summarizer_model="facebook/bart-large-cnn" # good but slow
summarizer = pipeline("summarization", model=summarizer_model, device=-1) 

text = """
Hello 4ll ! I am a self-instruct robot. ##### 404  . I will tell you a story about a fox I have heard once.
The quick brown fox jumped over the lazy dog! But wait!! There's more: the fox didn't stop at the dog. The fox... bla bla.
"""

def summary_and_clean(text):
    max_length=100 # max #words
    min_length=50
    chucnk_size=512
    if len(text.split()) > chucnk_size:  # If it's too long, split into chunks
        chunks = [text[i:i + chucnk_size] for i in range(0, len(text), chucnk_size)]
        summaries = []
        for chunk in chunks:
            summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
            summaries.append(summary[0]['summary_text'])
        return ' '.join(summaries)  # Join all chunk summaries
    else:
        # Otherwise summarize normally
        summary = summarizer(text, max_length=max_length, min_length=min_length, do_sample=False)
        return summary[0]['summary_text']
    return out

summarized=summary_and_clean(text)
print("Summary:", summarized, " #words in:", len(text), " #words out:", len(summarized))

Device set to use cpu
Your max_length is set to 100, but your input_length is only 82. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)


Summary: the quick brown fox jumped over the lazy dog! but wait!! There's more: the fox didn't stop at the dog . bla bla is a self-instructing robot. it's not a robot .  #words in: 232  #words out: 159


In [18]:
data_source = "abstract"

print("model training ...")
N = 3 # <--- It takes several minutes with N=3 !!! 
X = df["abstract"][N:]
y = df["balanced_category"][N:]
model = get_model()
model.fit(X, y)

print("summarization ...") # /!\ SLOW !!!
X2 = []
for i in range(N):
    markdown = df["markdown"][i]
    summarized_markdown = summary_and_clean(markdown)
    X2.append(summarized_markdown)


print("test markdown vs summarized_markdown")
num_correct_markdown = 0
num_correct_summarized = 0
for i in range(N):
    correct_class = df["balanced_category"][i]
    markdown = df["markdown"][i]
    markdown_summarized = X2[i]
    
    # Predict with the original markdown
    out1 = model.predict([markdown])[0]
    
    # Predict with the summarized markdown
    out2 = model.predict([markdown_summarized])[0]
    
    # Check if predictions match the correct class
    if out1 == correct_class:
        num_correct_markdown += 1
    if out2 == correct_class:
        num_correct_summarized += 1

print("num_correct_markdown: ", num_correct_markdown)
print("num_correct_summarized: ", num_correct_summarized)



model training ...
summarization ...


Your max_length is set to 100, but your input_length is only 68. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=34)
Your max_length is set to 100, but your input_length is only 82. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)


test markdown vs summarized_markdown
num_correct_markdown:  1
num_correct_summarized:  2


## Evaluate different ML algo (default hyperparameters)

In [19]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, top_k_accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import pandas as pd

X = df["abstract"] # <--- abstract is enough for quick experimentations
y = df["balanced_category"]

# 4. Model Definitions
tfidf=TfidfVectorizer(stop_words='english', max_features=10000)
models = {
    "LogisticRegression": make_pipeline(tfidf, LogisticRegression(max_iter=1000, multi_class='ovr', random_state=RS, n_jobs=-1)),
    "RandomForest": make_pipeline(tfidf, RandomForestClassifier(max_depth=50,random_state=RS, n_jobs=-1)),
    "MLP": make_pipeline(tfidf, MLPClassifier(max_iter=10, random_state=RS))
}

for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    res = ml_exp(X,y, model)
    display_exp(res)

# Conclusion MLP: good ratio accuracy vs time

Evaluating LogisticRegression...
  Accuracy: 0.6394
  Top-5 Accuracy: 0.9072
  F1-Score (Weighted): 0.6116
  Inference time (sec): 1.1218
Evaluating RandomForest...
  Accuracy: 0.4998
  Top-5 Accuracy: 0.7792
  F1-Score (Weighted): 0.4284
  Inference time (sec): 1.3379
Evaluating MLP...




  Accuracy: 0.6667
  Top-5 Accuracy: 0.9339
  F1-Score (Weighted): 0.6594
  Inference time (sec): 0.8981


## Hyperparameter tuning

In [20]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score, top_k_accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
import numpy as np

# Assuming df is your dataframe and the target column is 'balanced_category'
X = df["abstract"]
y = df["balanced_category"]

# MLP Model Pipeline
mlp_model = make_pipeline(TfidfVectorizer(stop_words='english', max_features=10000), 
                          MLPClassifier(max_iter=1, random_state=42))

# Define the hyperparameter distributions to sample from
param_dist = {
    'mlpclassifier__hidden_layer_sizes': [(100,), (200,), (400,), (100, 100)],
    'mlpclassifier__alpha': [0., 0.0001, 0.001, 0.01, 0.1],
    'mlpclassifier__learning_rate': ['constant', 'invscaling', 'adaptive'],
    'mlpclassifier__learning_rate_init': [0.001, 0.01, 0.1],
}

# Set up RandomizedSearchCV to sample 200 combinations
random_search = RandomizedSearchCV(estimator=mlp_model, param_distributions=param_dist, 
                                   n_iter=10, cv=2, n_jobs=-1, verbose=1, random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X, y)

# Best parameters found by RandomizedSearchCV
print("Best parameters found: ", random_search.best_params_)
best_model = random_search.best_estimator_
print(best_model)

Fitting 2 folds for each of 10 candidates, totalling 20 fits




Best parameters found:  {'mlpclassifier__learning_rate_init': 0.01, 'mlpclassifier__learning_rate': 'constant', 'mlpclassifier__hidden_layer_sizes': (400,), 'mlpclassifier__alpha': 0.0}
Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(max_features=10000, stop_words='english')),
                ('mlpclassifier',
                 MLPClassifier(alpha=0.0, hidden_layer_sizes=(400,),
                               learning_rate_init=0.01, max_iter=1,
                               random_state=42))])




## FINAL TRAINING!!!

In [21]:
# Get the best estimator
best_model = random_search.best_estimator_
best_model.named_steps['mlpclassifier'].max_iter = 1 # <--- does not improve
X = df["markdown"]
y = df["balanced_category"]
res = ml_exp(X,y, best_model)
display_exp(res)



  Accuracy: 0.6769
  Top-5 Accuracy: 0.9315
  F1-Score (Weighted): 0.6679
  Inference time (sec): 5.8001


# Model saving / restoring

## Saving it

In [22]:
# Save the model (Pickle)
import pickle
with open(f"{MODEL_PATH}/model.pkl", 'wb') as f:
    pickle.dump(best_model, f)

# Save the class names
class_names = y.unique()  # Get the class names
with open(f"/{MODEL_PATH}/class_names.pkl", 'wb') as f:
    pickle.dump(class_names, f)

## Restoring it (code to use in gradio)

In [23]:
with open(f"/{MODEL_PATH}/model.pkl", 'rb') as f:
    loaded_model = pickle.load(f)

with open(f"/{MODEL_PATH}/class_names.pkl", 'rb') as f:
    class_names = pickle.load(f)

## Using it (code to use in gradio)

In [24]:
# Take some data
text=df["abstract"][0]
label=df["balanced_category"][0]
print(text)
print("Expected class:", label)

Comet 162P/Siding Spring is a large Jupiter-family comet with extensive
archival lightcurve data. We report new r-band nucleus lightcurves for this
comet, acquired in 2018, 2021 and 2022. With the addition of these lightcurves,
the phase angles at which the nucleus has been observed range from $0.39^\circ$
to $16.33^\circ$. We absolutely-calibrate the comet lightcurves to r-band
Pan-STARRS 1 magnitudes, and use these lightcurves to create a convex shape
model of the nucleus by convex lightcurve inversion. The best-fitting shape
model for 162P has axis ratios $a/b = 1.56$ and $b/c = 2.33$, sidereal period
$P = 32.864\pm0.001$ h, and a rotation pole oriented towards ecliptic longitude
$\lambda_E = 118^\circ \pm 26^\circ$ and latitude
$\beta_E=-50^\circ\pm21^\circ$. We constrain the possible nucleus elongation to
lie within $1.4 < a/b < 2.0$ and discuss tentative evidence that 162P may have
a bilobed structure. Using the shape model to correct the lightcurves for
rotational effects, we de

In [25]:
predicted_probs = loaded_model.predict_proba([text])
top_5_indices = predicted_probs.argsort()[0][-5:][::-1]  # Get the indices of top 5 predictions

top_5_classes = class_names[top_5_indices]  # Get the corresponding class names
top_5_probs = predicted_probs[0][top_5_indices]  # Get the corresponding probabilities

# Print the top 5 predicted classes and their probabilities
strings_out=[]
for cls, prob in zip(top_5_classes, top_5_probs):
    strings_out.append(f"{cls}: {prob*100:.2f}%")
out="\n".join(strings_out)
print(out)

physics.plasm-ph: 93.20%
cond-mat.supr-con: 1.10%
cs.SI: 0.54%
math.AP: 0.48%
cs.CY: 0.46%


# DEEP LEARNING

## Training preparation

In [11]:
import os
#os.environ["CUDA_LAUNCH_BLOCKING"]="1"

import torch
print(torch.cuda.is_available())  # Should be True
print(torch.cuda.device_count())  # Should be > 0
print(torch.cuda.get_device_name(0))  # Should print GPU name

True
1
NVIDIA RTX A1000 6GB Laptop GPU


In [12]:
from evaluate import load  # Correct way to load metrics in Hugging Face
import numpy as np
from sklearn.metrics import top_k_accuracy_score

# Load evaluation metrics
accuracy_metric = load("accuracy")
f1_metric = load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  # Get top-1 prediction

    # Compute accuracy
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)

    # Compute F1-score (for classification)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="weighted")

    # Compute Top-5 accuracy (for multi-class classification)
    top5_acc = top_k_accuracy_score(labels, logits, k=5, labels=np.arange(logits.shape[1]))

    return {
        "accuracy": accuracy["accuracy"],
        "f1": f1["f1"],
        "top5_accuracy": top5_acc,
    }


In [26]:
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# ===========================
# LOAD DATASET
# ===========================
X = df["abstract"].tolist()
y = df["balanced_category"].tolist()
batch_size=64

# ===========================
# CONVERT LABELS TO NUMBERS
# ===========================
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)  # Convert text labels to integers
num_labels = len(label_encoder.classes_)  # Number of unique classes

# Save mapping (optional)
label_mapping = {index: label for index, label in enumerate(label_encoder.classes_)}
print("Label Mapping:", label_mapping)

# ===========================
# TOKENIZATION (USING BERT TOKENIZER)
# ===========================
#MODEL_NAME = "bert-base-uncased"  # Pretrained BERT model . Too large
MODEL_NAME = "prajjwal1/bert-tiny"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Convert text into tokenized format
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Convert dataset into Hugging Face Dataset format
df_hf = pd.DataFrame({"text": X, "label": y_encoded})
dataset = Dataset.from_pandas(df_hf)

# Apply tokenization
dataset = dataset.map(tokenize_function, batched=True)

# ===========================
# TRAIN-TEST SPLIT
# ===========================
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

# ===========================
# LOAD BERT CLASSIFIER
# ===========================
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

# ===========================
# TRAINING PARAMETERS (Optimized for CPU)
# ===========================
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=batch_size,  
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5, # <--- tune it. 10 -> 20 minutes
    logging_dir=f"/{MODEL_PATH}/pytorch_logs",
    output_dir=f"/{MODEL_PATH}/pytorch_checkpoint",
    logging_steps=10,
    learning_rate=0.001,
    weight_decay=0.01,
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics # custom metrics
)

Label Mapping: {0: np.str_('astro-ph.CO'), 1: np.str_('astro-ph.EP'), 2: np.str_('astro-ph.GA'), 3: np.str_('astro-ph.HE'), 4: np.str_('astro-ph.IM'), 5: np.str_('astro-ph.SR'), 6: np.str_('cond-mat.dis-nn'), 7: np.str_('cond-mat.mes-hall'), 8: np.str_('cond-mat.mtrl-sci'), 9: np.str_('cond-mat.quant-gas'), 10: np.str_('cond-mat.soft'), 11: np.str_('cond-mat.stat-mech'), 12: np.str_('cond-mat.str-el'), 13: np.str_('cond-mat.supr-con'), 14: np.str_('cs'), 15: np.str_('cs.AI'), 16: np.str_('cs.AR'), 17: np.str_('cs.CC'), 18: np.str_('cs.CE'), 19: np.str_('cs.CL'), 20: np.str_('cs.CR'), 21: np.str_('cs.CV'), 22: np.str_('cs.CY'), 23: np.str_('cs.DB'), 24: np.str_('cs.DC'), 25: np.str_('cs.DL'), 26: np.str_('cs.DS'), 27: np.str_('cs.FL'), 28: np.str_('cs.GR'), 29: np.str_('cs.GT'), 30: np.str_('cs.HC'), 31: np.str_('cs.IR'), 32: np.str_('cs.IT'), 33: np.str_('cs.LG'), 34: np.str_('cs.LO'), 35: np.str_('cs.MA'), 36: np.str_('cs.NE'), 37: np.str_('cs.NI'), 38: np.str_('cs.PL'), 39: np.str_('

Map: 100%|███████████████████████| 63357/63357 [00:12<00:00, 5141.35 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training

Takes ~20 minutes on with my Nvidia A1000 6G GPU Laptop

In [27]:
trainer.train()
results = trainer.evaluate()
print("Evaluation Results:", results)

Epoch,Training Loss,Validation Loss,Accuracy,F1,Top5 Accuracy
1,1.8396,1.680987,0.54072,0.488678,0.846275
2,1.3734,1.467601,0.593434,0.567939,0.875631
3,0.9804,1.421646,0.610322,0.593514,0.882734
4,0.6894,1.457634,0.611427,0.600043,0.882891
5,0.5699,1.515708,0.612847,0.603271,0.881155


Evaluation Results: {'eval_loss': 1.4216464757919312, 'eval_accuracy': 0.6103219696969697, 'eval_f1': 0.5935139287044443, 'eval_top5_accuracy': 0.8827335858585859, 'eval_runtime': 6.5289, 'eval_samples_per_second': 970.448, 'eval_steps_per_second': 15.163, 'epoch': 5.0}


In [28]:
import pickle
# Save the trained model
model.save_pretrained(f"/{MODEL_PATH}/saved_model")
tokenizer.save_pretrained(f"/{MODEL_PATH}/saved_model")
with open(f"/{MODEL_PATH}/label_encoder.pkl", "wb") as f:
    pickle.dump(label_encoder, f)

## Inference mode (CODE TO PUT IN GRADIO)

In [29]:
# Restore
model = AutoModelForSequenceClassification.from_pretrained(f"/{MODEL_PATH}/saved_model")
tokenizer = AutoTokenizer.from_pretrained(f"/{MODEL_PATH}/saved_model")
with open(f"/{MODEL_PATH}/label_encoder.pkl", "rb") as f:
    label_encoder = pickle.load(f)


# Inference function
def predict_text(text):
    tokens = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    with torch.no_grad():
        output = model(**tokens)
    logits = output.logits
    top5_probs, top5_indices = torch.topk(logits, k=5, dim=1)  # Get top 5 classes
    top5_labels = label_encoder.inverse_transform(top5_indices.cpu().numpy().flatten())
    top5_probs = torch.nn.functional.softmax(top5_probs, dim=1).cpu().numpy().flatten()
    res = [(label, float(prob)) for label, prob in zip(top5_labels, top5_probs)]
    res = sorted(res, reverse=True, key=lambda p:p[1])

    lines=[]
    for pair in res:
        label=f"{pair[0]}: {pair[1]*100:.2f}%"
        lines.append(label)
    out="\n".join(lines)
    return out



In [30]:
# Take some data
text=df["abstract"][0]
label=df["balanced_category"][0]
print(text)
print("Expected class:", label)

# Example Prediction
predicted_category = predict_text(text)
print("Predicted Category:", predicted_category)


Comet 162P/Siding Spring is a large Jupiter-family comet with extensive
archival lightcurve data. We report new r-band nucleus lightcurves for this
comet, acquired in 2018, 2021 and 2022. With the addition of these lightcurves,
the phase angles at which the nucleus has been observed range from $0.39^\circ$
to $16.33^\circ$. We absolutely-calibrate the comet lightcurves to r-band
Pan-STARRS 1 magnitudes, and use these lightcurves to create a convex shape
model of the nucleus by convex lightcurve inversion. The best-fitting shape
model for 162P has axis ratios $a/b = 1.56$ and $b/c = 2.33$, sidereal period
$P = 32.864\pm0.001$ h, and a rotation pole oriented towards ecliptic longitude
$\lambda_E = 118^\circ \pm 26^\circ$ and latitude
$\beta_E=-50^\circ\pm21^\circ$. We constrain the possible nucleus elongation to
lie within $1.4 < a/b < 2.0$ and discuss tentative evidence that 162P may have
a bilobed structure. Using the shape model to correct the lightcurves for
rotational effects, we de

# LLM API

## Load model for pre-processing / post-processing (code to put in gradio directly)

In [21]:
import pickle
import json
import requests

with open(f"/tmp/class_names.pkl", 'rb') as f:
    class_names = pickle.load(f)
print(class_names)

['astro-ph.EP' 'physics.plasm-ph' 'math.RA' 'cs.CL' 'math.AP' 'eess.SY'
 'stat.ME' 'math.FA' 'physics.optics' 'eess.IV' 'cs.AI' 'hep-th' 'cs.CV'
 'nucl-th' 'cs' 'astro-ph.HE' 'cs.RO' 'cs.LG' 'astro-ph.SR' 'cs.CY'
 'cs.LO' 'astro-ph.GA' 'physics.app-ph' 'cs.IT' 'cs.SE' 'physics.bio-ph'
 'physics' 'cond-mat.mtrl-sci' 'quant-ph' 'econ.GN' 'physics.acc-ph'
 'math.PR' 'math.AG' 'cond-mat.str-el' 'hep-ph' 'cond-mat.soft' 'gr-qc'
 'q-bio.QM' 'math.OC' 'physics.flu-dyn' 'math.DG' 'cs.HC' 'cs.PL' 'cs.DB'
 'cs.CE' 'cs.NI' 'math.NA' 'cs.GT' 'cs.GR' 'math.AT' 'cs.DC' 'hep-ex'
 'math.NT' 'cs.DL' 'astro-ph.CO' 'eess.AS' 'cond-mat.quant-gas'
 'cond-mat.mes-hall' 'physics.chem-ph' 'q-bio.NC' 'cs.CR' 'math-ph'
 'stat.ML' 'math.GR' 'physics.ao-ph' 'nucl-ex' 'math' 'math.GM' 'q-fin'
 'astro-ph.IM' 'eess.SP' 'math.ST' 'nlin' 'physics.atom-ph' 'cs.DS'
 'cs.SD' 'math.DS' 'math.RT' 'cond-mat.stat-mech' 'stat' 'cs.CC'
 'physics.geo-ph' 'math.CO' 'cs.IR' 'math.LO' 'cond-mat.supr-con'
 'math.GT' 'math.CV' 'q-bi

## Inference (code to put in gradio)

In [32]:

def inference(sample_text):

    # OpenRouter API Endpoint
    API_URL = "https://openrouter.ai/api/v1/chat/completions"
    
    # Choose a free model (Mistral is free)
    MODEL_NAME = "mistralai/mistral-7b-instruct"
    
    
    
    # Define Prompt for ArXiv Category Classification
    prompt = f"""
    You are an expert in scientific paper classification. Given the abstract below, classify it into one of the official arXiv categories. 
    
    Below the paper:
    {sample_text}
    
    Return only a category code among: {class_names}.
    Your answer should be only 20 characters maximum.
    """
    
    # Prepare API Request
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": MODEL_NAME,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0.1,  # Lower temperature for more deterministic responses
        "max_tokens": 50
    }
    
    # Send Request
    response = requests.post(API_URL, headers=headers, json=payload)
    
    # Exctract the response
    if response.status_code == 200:
        result = response.json()
        llm_output = result["choices"][0]["message"]["content"].strip()
        class_name="other"
        for class_name in class_names:
            if class_name in llm_output:
                break
        out = class_name
    else:
        out = response.text

    return out


## Evaluation

/!\ WARNING: Call it only once !!!!

Don't spam too much my OpenAI account please :(

In [33]:
import time

num_good_answer=0
num_questions=10 

start_time=time.time()
for i in range(num_questions):
    text=df["abstract"][i]
    label=df["balanced_category"][i]
    pred = inference(text)
    if label==pred:
        num_good_answer+=1

enlapsed_time=time.time()-start_time
enlapsed_time_if_tested_like_other_models = (enlapsed_time/num_questions)*(0.1*len(df))

In [35]:
print("Estimated accuracy: ", float(num_good_answer)/num_questions)
print("Estimated inference time for test set:", enlapsed_time_if_tested_like_other_models)

Estimated accuracy:  0.7
Estimated inference time for test set: 6709.236008477212
