# Using paperXai to get your personal arXiv daily digest

The goal of this notebook is to help you use this package and to understand the different components of the pipeline. Briefly, the pipeline can be decomposed into the following sections which will correspond to the sections of this notebook.

- [Fetching the latest arXiv papers](#fetching-the-latest-arxiv-papers)
- [Embedding predefined user questions and sections](#embedding-predefined-user-questions-and-sections)
- [Semantic retrieval](#semantic-retrieval)
- [Generating an automatic report](#generating-an-automatic-report)
- [Sending the personalized newsletter](#sending-the-personalized-newsletter)

For any comments, feel free to reach out directly (see [here](https://sebastianpartarrieu.github.io/)), or via an issue on the github repository.

In [1]:
# initial imports
%load_ext autoreload
%autoreload 2

from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import openai

import paperxai.constants as constants
import paperxai.credentials as credentials
from paperxai.llms import OpenAI

In [2]:
openai.api_key = credentials.OPENAI_API_KEY

## Fetching the latest arXiv papers

See script `scripts/get_arxiv_papers.py` to fetch the documents using the arXiv API.

In [2]:
#!python ../scripts/get_arxiv_papers.py --max_results 1000

In [6]:
df_articles = pd.read_csv(constants.ROOT_DIR + "/data/arxiv/base_papers.csv")
df_new_articles = pd.read_csv(constants.ROOT_DIR + "/data/arxiv/current_papers.csv")

## Embedding predefined user questions and sections

### Embedding articles

Let's embed the different articles we've obtained. We pay little attention here to performance, if you want to run this on a larger dataset, it may be worth batching calls and saving the embeddings in a dedicated array (vector database is definitely not useful at this scale).

In [10]:
openai_model = OpenAI(
    chat_model="gpt-3.5-turbo",
    embedding_model="text-embedding-ada-002",
    temperature=0.0,
    max_tokens=1000,
)

ConnectionError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f74f4b14c70>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -3] Temporary failure in name resolution)"))

In [None]:
df_new_articles["Embeddings"] = df_new_articles["String_representation"].apply(
    lambda x: openai_model.get_embeddings(x)
)

In [27]:
# df_new_articles.to_csv(
#     constants.ROOT_DIR + "/data/arxiv/current_papers_with_embeddings.csv", index=False
# )

In [36]:
article_embeddings = np.vstack(df_new_articles["Embeddings"].values)
np.save(constants.ROOT_DIR + "/data/arxiv/article_embeddings.npy", article_embeddings)

## Semantic retrieval

In [8]:
from paperxai.report.retriever import ReportRetriever

In [None]:
report = ReportRetriever(language_model=)

In [7]:
df_new_articles

Unnamed: 0,Title,URL,Abstract,Authors,Published Date,Category,Paper ID,String_representation,Embeddings
0,Benchmarking and Analyzing Generative Data for...,http://arxiv.org/abs/2307.13697v1,Advancements in large pre-trained generative m...,"Bo Li, Haotian Liu, Liangyu Chen, Yong Jae Lee...",2023-07-25T17:59:59Z,cs.CV,2307.13697v1,Title: Benchmarking and Analyzing Generative D...,[-0.03855739 0.01224458 -0.00347183 ... -0.01...
1,Evaluating Large Language Models for Radiology...,http://arxiv.org/abs/2307.13693v1,The rise of large language models (LLMs) has m...,"Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yuto...",2023-07-25T17:57:18Z,cs.CL,2307.13693v1,Title: Evaluating Large Language Models for Ra...,[-0.01048704 0.03709306 0.02674473 ... -0.00...
2,ARB: Advanced Reasoning Benchmark for Large La...,http://arxiv.org/abs/2307.13692v1,Large Language Models (LLMs) have demonstrated...,"Tomohiro Sawada, Daniel Paleka, Alexander Havr...",2023-07-25T17:55:19Z,cs.CL,2307.13692v1,Title: ARB: Advanced Reasoning Benchmark for L...,[ 0.0065059 0.0015553 -0.00759198 ... -0.03...
3,The Visual Language of Fabrics,http://arxiv.org/abs/2307.13681v1,"We introduce text2fabric, a novel dataset that...","Valentin Deschaintre, Julia Guerrero-Viu, Dieg...",2023-07-25T17:39:39Z,cs.GR,2307.13681v1,Title: The Visual Language of Fabrics\nAbstrac...,[-0.01458119 0.01483235 0.00258485 ... -0.01...
4,High Probability Analysis for Non-Convex Stoch...,http://arxiv.org/abs/2307.13680v1,Gradient clipping is a commonly used technique...,"Shaojie Li, Yong Liu",2023-07-25T17:36:56Z,cs.LG,2307.13680v1,Title: High Probability Analysis for Non-Conve...,[-0.00973936 -0.01733373 0.02210196 ... -0.01...
...,...,...,...,...,...,...,...,...,...
700,Technical Challenges of Deploying Reinforcemen...,http://arxiv.org/abs/2307.11105v1,"Going from research to production, especially ...","Jonas Gillberg, Joakim Bergdahl, Alessandro Se...",2023-07-19T18:19:23Z,cs.SE,2307.11105v1,Title: Technical Challenges of Deploying Reinf...,[ 0.00047117 -0.0138482 0.01550069 ... -0.04...
701,TokenFlow: Consistent Diffusion Features for C...,http://arxiv.org/abs/2307.10373v2,The generative AI revolution has recently expa...,"Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali D...",2023-07-19T18:00:03Z,cs.CV,2307.10373v2,Title: TokenFlow: Consistent Diffusion Feature...,[-0.02155961 -0.00051334 -0.02008521 ... -0.01...
702,A Decision Making Framework for Recommended Ma...,http://arxiv.org/abs/2307.10085v2,With the rapid development of global road tran...,"Haoyu Sun, Yan Yan",2023-07-19T15:55:25Z,cs.AI,2307.10085v2,Title: A Decision Making Framework for Recomme...,[ 0.0290125 0.00111448 -0.00454689 ... -0.01...
703,An Empirical Study on Fertility Proposals Usin...,http://arxiv.org/abs/2307.10025v2,Fertility issues are closely related to popula...,Yulin Zhou,2023-07-19T15:09:50Z,cs.HC,2307.10025v2,Title: An Empirical Study on Fertility Proposa...,[ 0.00313848 -0.0060077 0.0023272 ... -0.00...


## Generating an automatic report

## Sending the personalized newsletter