In [58]:
# Importing the required libraries. 
import requests
import xml.etree.ElementTree as ET
import csv
import re
import pandas as pd

In this step, we import all the necessary libraries for fetching data from the ArXiv API, processing XML responses, cleaning LaTeX-formatted text, and managing CSV files. The libraries include:
- `requests` for sending HTTP requests to fetch data.
- `xml.etree.ElementTree` for parsing the XML response from the ArXiv API.
- `csv` for saving the filtered papers into a CSV file.
- `re` for regular expression-based text processing to clean LaTeX formatting.
- `pandas` for managing and reading CSV files.

In [59]:
# Cleaning the file

def clean_latex(text):
    text = re.sub(r'\$.*?\$', '', text)  # Remove inline math
    text = re.sub(r'\\begin{.*?}.*?\\end{.*?}', '', text, flags=re.DOTALL)  # Remove math environments
    text = re.sub(r'\\[a-zA-Z]+{.*?}', '', text)  # Remove commands with arguments
    text = re.sub(r'\\[a-zA-Z]+\s*', '', text)  # Remove standalone commands
    text = text.replace('{', '').replace('}', '')  # Remove curly braces
    text = re.sub(r'\\cite{.*?}', '', text)  # Remove citations
    text = re.sub(r'(Figure|Table) \d+', '', text)  # Remove figure/table references
    return text

#### Cleaning the LaTeX Formatted Text

This function `clean_latex()` removes LaTeX formatting from the summary of each paper. It specifically targets:
- Inline math expressions and LaTeX environments (e.g., equations).
- LaTeX commands such as `\textbf{}` or `\section{}`.
- Citations, figure/table references, and other irrelevant technical terms.

The goal is to make the text more readable and remove any non-relevant LaTeX syntax.


In [60]:
# Define the ArXiv API endpoint and query parameters
# https://arxiv.org/category_taxonomy Link 

url = "http://export.arxiv.org/api/query"
params = {
    'search_query': 'cat:cs.LG',
    'start': 0,
    'max_results': 230,  # Fetch 230 papers
    'sortBy': 'relevance',
    'sortOrder': 'descending'
}

#### Setting Up the ArXiv API Query

Here, we define the API endpoint and the query parameters. The `params` dictionary contains:
- `search_query`: Filtering papers in the `cs.LG` (Machine Learning) category.
- `max_results`: Fetching 230 papers to ensure that after filtering by keywords, we end up with a sufficient number of relevant papers (20 papers in this case).
- `sortBy` and `sortOrder`: Sorting papers by relevance in descending order.

This ensures we only fetch the most relevant papers related to machine learning.


In [61]:
# Send request to ArXiv API

response = requests.get(url, params=params)

#### Sending the Request to the ArXiv API

We use the `requests.get()` method to send a request to the ArXiv API with the defined query parameters. The response is an XML document containing metadata about the papers.


In [62]:
# Parsing the response

def parse_papers(root):
    papers = []
    for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
        title = entry.find('{http://www.w3.org/2005/Atom}title').text.strip()
        summary = entry.find('{http://www.w3.org/2005/Atom}summary').text.strip()
        link = entry.find('{http://www.w3.org/2005/Atom}id').text.strip()
        papers.append({'title': title, 'summary': summary, 'link': link})
    return papers

root = ET.fromstring(response.content)
papers = parse_papers(root)

#### Parsing the Response

This function, `parse_papers()`, processes the XML response to extract the title, summary, and link of each paper. The function loops through the entries in the XML and stores the relevant information in a list of dictionaries for further processing.


In [63]:
keywords = [
    'model selection', 'cross-validation', 'hyperparameter tuning', 'grid search',
    'Bayesian optimization', 'train-test split', 'performance metrics', 'k-fold cross-validation',
    'leave-one-out cross-validation', 'regularization', 'L1 regularization', 'L2 regularization',
    'AUC-ROC', 'hyperparameter optimization', 'early stopping', 'overfitting prevention',
    'bias-variance tradeoff', 'dropout', 'weight decay'
]

#### Defining Filtering Keywords

Here, we define a list of keywords related to model selection and machine learning techniques. These keywords are used to filter the paper's summaries and retain only the ones that discuss topics such as:
- Model selection techniques (e.g., cross-validation, hyperparameter tuning).
- Regularization methods and optimization strategies.


In [64]:
# Function to filter papers by keywords

def filter_papers_by_keywords(papers, keywords):
    filtered_papers = [paper for paper in papers if any(kw in paper['summary'].lower() for kw in keywords)]
    return filtered_papers

filtered_papers = filter_papers_by_keywords(papers, keywords)

#### Filtering Papers by Keywords

The function `filter_papers_by_keywords()` filters the papers by checking if any of the keywords from the list appear in the summary. This ensures that only the papers discussing relevant machine learning topics are retained.


In [65]:
for paper in filtered_papers:
    paper['cleaned_summary'] = clean_latex(paper['summary'])

#### Cleaning the Summaries of Filtered Papers

For each of the filtered papers, we clean the summary using the `clean_latex()` function to remove any LaTeX formatting and improve readability. This helps make the summaries more accessible and easier to understand.


In [66]:
# Display filtered papers

for idx, paper in enumerate(filtered_papers, 1):
    print(f"Filtered Paper {idx}: {paper['title']}\n")
    print(f"Link: {paper['link']}\n")
    print(f"Original Summary: {paper['summary']}\n")
    print(f"Cleaned Summary: {paper['cleaned_summary']}\n")

Filtered Paper 1: Efficient algorithms for decision tree cross-validation

Link: http://arxiv.org/abs/cs/0110036v1

Original Summary: Cross-validation is a useful and generally applicable technique often
employed in machine learning, including decision tree induction. An important
disadvantage of straightforward implementation of the technique is its
computational overhead. In this paper we show that, for decision trees, the
computational overhead of cross-validation can be reduced significantly by
integrating the cross-validation with the normal decision tree induction
process. We discuss how existing decision tree algorithms can be adapted to
this aim, and provide an analysis of the speedups these adaptations may yield.
The analysis is supported by experimental results.

Cleaned Summary: Cross-validation is a useful and generally applicable technique often
employed in machine learning, including decision tree induction. An important
disadvantage of straightforward implementation of t

#### Displaying Filtered and Cleaned Papers

This step prints the filtered and cleaned papers. For each paper, we display:
- The title
- The original summary
- The cleaned summary
- A link to the paper

This allows us to quickly verify that we have the correct papers and summaries.


In [67]:
def save_to_csv(papers, filename="filtered_papers.csv"):
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        fieldnames = ['title', 'link', 'cleaned_summary']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for paper in papers:
            writer.writerow({'title': paper['title'], 'link': paper['link'], 'cleaned_summary': paper['cleaned_summary']})

save_to_csv(filtered_papers)

#### Saving Filtered Papers to a CSV File

After filtering and cleaning the papers, we save the results into a CSV file named `filtered_papers.csv`. Each row contains:
- The title of the paper.
- The link to the paper.
- The cleaned summary.

This ensures that the results can be easily accessed and analyzed later.


In [68]:
def count_papers(filename="filtered_papers.csv"):
    df = pd.read_csv(filename)
    return len(df)

total_papers = count_papers()
print(f"Total number of filtered papers: {total_papers}")

Total number of filtered papers: 20


#### Counting the Number of Papers in the CSV

In this final step, we read the CSV file to count how many papers have been saved. This provides a quick check to ensure that the correct number of papers (20) has been fetched and saved.
