In [1]:
import pandas as pd
import re
from ftfy import fix_text


In [2]:
df = pd.read_csv("results.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  100 non-null    int64 
 1   id          100 non-null    object
 2   title       100 non-null    object
 3   abstract    100 non-null    object
 4   published   100 non-null    object
 5   pdf_url     100 non-null    object
 6   full_text   99 non-null     object
dtypes: int64(1), object(6)
memory usage: 5.6+ KB


In [4]:
def remove_abstract(text):
    # Match abstract headings with optional spaces or colons after "Abstract" (case insensitive)
    abstract_patterns = [
        r'(?i)\babstract\b[:\s]*\n?',  # Matches "Abstract", "Abstract:", "abstract" (case insensitive)
    ]
    
    # Compile regex with case insensitivity
    pattern = re.compile('|'.join(abstract_patterns), re.IGNORECASE)

    # Remove the abstract heading (including the optional newline character)
    text = pattern.sub('', text, count=1)  # Only remove the first occurrence

    # Now, remove everything after the abstract up to the next major section (like "1 Introduction")
    text = re.sub(r'(?s)^(.*?)(?=\n\s*(?:1\s*Introduction|\d+\.\s*\w+))', '', text)

    return text.strip()

In [5]:
def clean_text(text):
    # Remove hyphenation from words split across lines
    text = re.sub(r'(\w+)-\n(\w+)', r'\1\2', text)
    
    # Replace newlines within paragraphs with a space
    text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
    
    # Normalize multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [6]:
def remove_headers_footers(text):
    text = re.sub(r'\n?\s*\d+\s*\n', '\n', text)  # Remove standalone numbers (page numbers)
    text = re.sub(r'^\s*arXiv:.*\n', '', text, flags=re.MULTILINE)
    return text


In [7]:
def fix_encoding(text):
    return fix_text(text)

In [8]:
def preprocess_pdf_text(text):
    text = remove_abstract(text)
    text = fix_encoding(text)
    text = remove_headers_footers(text)
    text = clean_text(text)
    
    return text

In [9]:
df = df.dropna()
df['clean'] = df['full_text'].apply(preprocess_pdf_text)

In [10]:
for index, row in df[0:10].iterrows():
    print(f"Paper ID: {row['id']}\n")  # Print the paper ID (if available)
    print(f"Abstract: {row['abstract']}\n")
    # print(f"Full Text:\n{row['full_text']}\n")  # Print the full text
    # print("-" * 80)  # Separator for readability
    print(f"Clean Text: \n{row['clean']}\n")
    print("-" * 100)  # Separator for readability

Paper ID: 9907025v1

Abstract: We provide a lower bound construction showing that the union of unit balls in
three-dimensional space has quadratic complexity, even if they all contain the
origin. This settles a conjecture of Sharir.

Clean Text: 
1 Introduction The union of a set of n balls in R 3 has quadratic complexity Θ(n ), even if they all have the same radius. All the already known constructions have balls scattered around, however, and Sharir posed the problem whether a quadratic complexity could be achieved if all the balls (of same radius) contained the origin. In this note, we show a construction of n unit balls, all containing the origin, whose union has complexity Θ(n ). As a trivial observation, we observe that the centers are arbitrarily close to the origin in our construction. In fact, if the centers are forced to be at least pairwise ε apart, for some constant ε > 0, then no more than O( ε 3 ) can meet in a single point, and hence the union has complexity at most O( ε 