In [2]:
import pandas as pd

df = pd.read_csv('data/arxiv_large.csv')
df


Unnamed: 0,title,summary,comment,authors,category,split
0,Gamma-Ray Bursts as the Death Throes of Massiv...,It is proposed that gamma-ray bursts are creat...,14 pages,"Ramesh Narayan, Bohdan Paczyński, Tsvi Piran",physics,train
1,Gravitational Lensing and the Variability of G,The four observables associated with gravitati...,13 pages plus figures (not included),"Lawrence Krauss, Martin White",physics,test
2,The Ptolemaic Gamma-Ray Burst Universe,The BATSE experiment on GRO has demonstrated t...,10 pages (Replaced to provide omitted line.),J. I. Katz,physics,train
3,Expanding Photospheres of Type II Supernovae a...,We use the Expanding Photosphere Method to det...,21 pages,"B P Schmidt, R P Kirshner, R G Eastman",physics,val
4,Radiation Transfer in Gamma-Ray Bursts,We have calculated gamma-ray radiative transpo...,24 pages,"B. J. Carrigan, J. I. Katz",physics,train
...,...,...,...,...,...,...
665795,A Truthful Owner-Assisted Scoring Mechanism,Alice (owner) has knowledge of the underlying ...,A (significantly) extended version of arXiv: 2...,Weijie J. Su,computer science,test
665796,"Smooth Calibration, Leaky Forecasts, Finite Re...",We propose to smooth out the calibration score...,http://www.ma.huji.ac.il/hart/publ.html#calib-eq,"Dean P. Foster, Sergiu Hart",economics,train
665797,Forecast Hedging and Calibration,Calibration means that forecasts and average r...,http://www.ma.huji.ac.il/hart/publ.html#calib-int,"Dean P. Foster, Sergiu Hart",economics,train
665798,The Isotonic Mechanism for Exponential Family ...,"In 2023, the International Conference on Machi...",,"Yuling Yan, Weijie J. Su, Jianqing Fan",mathematics,train


In [4]:
# Create a mapping dictionary for categories
category_mapping = {
    'mathematics': 'mathematics',
    'computer science': 'computer science', 
    'economics': 'economics',
    'statistics': 'statistics',
    'quantitative biology': 'quantitative biology',
    'quantitative finance': 'quantitative finance',
    'electrical engineering and systems science': 'electrical engineering and systems science'
}

# Apply mapping - anything not in the dictionary will be mapped to 'physics'
df['category'] = df['category'].map(lambda x: category_mapping.get(x, 'physics'))


In [5]:
# Count samples per category
category_counts = df['category'].value_counts()

# Find the category with minimum count
min_category = category_counts.idxmin()
min_count = category_counts.min()

print(f"Category with fewest samples: {min_category}")
print(f"Number of samples: {min_count}")


Category with fewest samples: economics
Number of samples: 7787


In [6]:

# Create balanced dataset by taking first min_count samples from each category
balanced_df = pd.concat([
    df[df['category'] == cat].head(min_count) 
    for cat in df['category'].unique()
])

# Verify counts are equal
print("\nSamples per category in balanced dataset:")
print(balanced_df['category'].value_counts())



Samples per category in balanced dataset:
category
physics                                       7787
computer science                              7787
mathematics                                   7787
quantitative biology                          7787
quantitative finance                          7787
electrical engineering and systems science    7787
economics                                     7787
statistics                                    7787
Name: count, dtype: int64


In [7]:
# Save balanced dataset to CSV file
balanced_df.to_csv('./data/arxiv_balanced.csv', index=False)
