# Using paperXai to get your personal arXiv daily digest

The goal of this notebook is to help you use this package and to understand the different components of the pipeline. Briefly, the pipeline can be decomposed into the following sections which will correspond to the sections of this notebook.

- [Fetching the latest arXiv papers](#fetching-the-latest-arxiv-papers)
- [Embedding predefined user questions and sections](#embedding-predefined-user-questions-and-sections)
- [Semantic retrieval](#semantic-retrieval--generating-an-automatic-report)
- [Sending the personalized newsletter](#sending-the-personalized-newsletter)

For any comments, feel free to reach out directly (see [here](https://sebastianpartarrieu.github.io/)), or via an issue on the github repository.


In [69]:
# initial imports
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import openai

import paperxai.constants as constants
import paperxai.credentials as credentials
from paperxai.llms import OpenAI
from paperxai.papers import Arxiv
from paperxai.report.retriever import ReportRetriever
from paperxai.prompt.base import Prompt
from paperxai.loading import load_config

In [2]:
openai.api_key = credentials.OPENAI_API_KEY

## Fetching the latest arXiv papers

See script `scripts/get_arxiv_papers.py` if you want to run this as a script. **Make sure** to tweak the `config.yml` file to change the questions according to what you want to learn/track in the latest papers.


In [6]:
#!python ../scripts/get_arxiv_papers.py --max_results 1000

In [None]:
config = load_config("../config.yml")
arxiv = Arxiv()
arxiv.get_papers(categories=config["arxiv-categories"], max_results=1000)
arxiv.write_papers()

In [16]:
df_papers = pd.read_csv(constants.ROOT_DIR + "/data/arxiv/current_papers.csv",
                        parse_dates=["Published Date"])

## Embedding predefined user questions and sections


### Embedding articles

Let's embed the different articles we've obtained. We pay little attention here to performance, if you want to run this on a larger dataset, it may be worth batching calls and saving the embeddings in a dedicated array (vector database is definitely not useful at this scale).


In [40]:
openai_model = OpenAI(
    chat_model="gpt-3.5-turbo",
    embedding_model="text-embedding-ada-002",
    temperature=0.0,
    max_tokens=1000,
)

ConnectionError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f74f4b14c70>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -3] Temporary failure in name resolution)"))

In [41]:
df_papers["Embeddings"] = df_papers["String_representation"].apply(
    lambda x: openai_model.get_embeddings(text=x)
)
paper_embeddings = df_papers["Embeddings"].values
paper_embeddings = np.vstack(paper_embeddings)
np.save(constants.ROOT_DIR + "/data/arxiv/papers_embeddings.npy", paper_embeddings)

In [45]:
paper_embeddings.shape

(981, 1536)

In [3]:
# df_paper = pd.read_csv(constants.ROOT_DIR + "/data/arxiv/current_papers_with_embeddings.csv")
# article_embeddings = np.load(constants.ROOT_DIR + "/data/arxiv/article_embeddings.npy")

## Semantic retrieval & Generating an automatic report


In [47]:
prompter = Prompt()
report_retriever = ReportRetriever(
    language_model=openai_model,
    prompter=prompter,
    papers_embedding=paper_embeddings,
    df_papers=df_papers,
)

In [51]:
report = report_retriever.create_report()
report_retriever.print_report()

Getting responses for section: Large Language Model inference optimization
Answering question: What are the latest developments around large language inference and quantization?
Answering question: What are the latest developments around large language inference memory optimization?
Getting responses for section: Large Language Model training optimization
Answering question: What are the latest developments around large language models and distributed training across multiple GPUs?
Getting responses for section: Large Language Model and medicine
Answering question: What are the latest developments around large language models and medicine?
Section: Large Language Model inference optimization

Question: What are the latest developments around large language inference and quantization?
LLM response: The latest developments around large language inference and quantization include exploring the feasibility of quantum natural language processing algorithms on noisy intermediate-scale quantu

In [72]:
report_retriever.write_report(format="html")

HTML is saved to /display/reports/2023-08-04-report.html, open it in your browser to view the report


## Sending the personalized newsletter
