In [1]:
%load_ext autoreload
%autoreload 2

import sys; sys.path.insert(0, '..')
from src import config,  preprocessing, arxiv_utils

import pandas as pd
import numpy as np
import time

First, load the various parts of the dataset:
- `raw_data.csv` contains selected papers from my Zotero archive
- `papers_cited.csv` contains the list of papers I have cited when writing my own articles

In [82]:
# Import data
all_df = pd.read_csv(config.path_data_raw / "zotero_raw_data.csv")
cited_df = pd.read_csv(config.path_data_raw / "papers_cited.csv")

# Filter data to only journal articles and keep relevant information
all_only_articles_df = all_df.loc[all_df["Item Type"] == "journalArticle"][
    ["Author", "Title", "Publication Year", "Extra"]
]
# Extract arXiv id from the Extra section
all_only_articles_df["arxiv_id"] = all_only_articles_df.apply(
    lambda x: arxiv_utils.extract_arxiv_identifier(x["Extra"]), axis=1
)

Match every article from the zotero library to arXiv and returns a standardized output. At the rate of 3s per request, this is very slow and should only be run once per raw data sample. Output is saved to the `interim` data folder.

In [84]:
Nstart = 0
Nend = len(all_only_articles_df) - 1
article_per_pass = 15
Narticles = Nend - Nstart + 1
Npasses = Narticles // article_per_pass
library_arxiv_df = None
# UNCOMMENT TO RUN FETCHING
for i in range(Npasses + 1):
    nmin = i * article_per_pass
    nmax = min((i + 1) * article_per_pass, Nend)
    df_tmp = arxiv_utils.get_all_article_by_id_or_title(all_only_articles_df.iloc[nmin:nmax])
    library_arxiv_df = df_tmp if library_arxiv_df is None else pd.concat([library_arxiv_df, df_tmp])
    library_arxiv_df.to_csv(config.path_data_interim / "zotero_arxiv_data.csv", index=False)

Loading the arXiv entries matching the library

In [85]:
library_arxiv_df = pd.read_csv(config.path_data_interim / "zotero_arxiv_data.csv")

In [86]:
# Define the citation status and interest status, arXiv id
cited_titles = list(cited_df["Title"])
library_arxiv_df["is_cited_by_my_papers"] = library_arxiv_df["title"].isin(cited_titles)
library_arxiv_df["is_in_library"] = True

In [87]:
# Check for missing values in the fields of interest
display(library_arxiv_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 667 entries, 0 to 666
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id                     667 non-null    object
 1   title                  667 non-null    object
 2   authors                667 non-null    object
 3   primary_category       667 non-null    object
 4   categories             667 non-null    object
 5   summary                667 non-null    object
 6   published              667 non-null    object
 7   doi                    667 non-null    object
 8   is_cited_by_my_papers  667 non-null    bool  
 9   is_in_library          667 non-null    bool  
dtypes: bool(2), object(8)
memory usage: 43.1+ KB


None

I have so far extracted from my Zotero library various papers I was interested in (papers I cited, papers on TODO lists, etc...). Now I want to get a random sample of papers to pad the dataset and get a less skewed distribution. While not a perfect method, I take a random sample of arxiv papers (sample size is the size of the selected papers in the previous section), picked in my usual categories (hep-th, gr-qc, cond-mat.str-el) with random years (each query selects 25 papers), while checking they were not in my collection.

In [88]:
Npapers = 5000
n_per_sample = 25
n_requests = Npapers // n_per_sample
print(f"Estimated time to run: {3 * n_requests / 60} min.")
random_samples = [arxiv_utils.get_random_articles(n_articles=n_per_sample) for i in range(n_requests)]
random_samples_dfs = [arxiv_utils.build_arxiv_df(entries) for entries in random_samples]

Estimated time to run: 10.0 min.


Now we can build the full dataset containing the papers from my library and the random samples. We first concatenate all entries and remove duplicates by arXiv id.

In [105]:
random_arxiv_df = pd.concat(random_samples_dfs, ignore_index=True)
random_arxiv_df["is_cited_by_my_papers"] = False
random_arxiv_df["is_in_library"] = False

# Find duplicate id's
dup_ids = list(set(library_arxiv_df["id"]).intersection(set(random_arxiv_df["id"])))
print(f"Amont of random papers already in the library to drop {len(dup_ids)}.")

full_df = pd.concat([library_arxiv_df, random_arxiv_df.loc[-random_arxiv_df["id"].isin(dup_ids)]], ignore_index=True)
full_df_unique = full_df.drop_duplicates(subset="id", keep="first")
print(f"Amont of duplicates dropped {len(full_df) - len(full_df_unique)}.")

# Save to file
full_df_unique.to_csv(config.path_data_merged, index=False)

Amont of random papers already in the library to drop 4.
Amont of duplicates dropped 17.
