# Using paperXai to get your personal arXiv daily digest

The goal of this notebook is to help you use this package and to understand the different components of the pipeline. Briefly, the pipeline can be decomposed into the following sections which will correspond to the sections of this notebook.

- [Fetching the latest arXiv papers](#fetching-the-latest-arxiv-papers)
- [Embedding predefined user questions and sections](#embedding-predefined-user-questions-and-sections)
- [Semantic retrieval](#semantic-retrieval--generating-an-automatic-report)
- [Sending the personalized newsletter](#sending-the-personalized-newsletter)

For any comments, feel free to reach out directly (see [here](https://sebastianpartarrieu.github.io/)), or via an issue on the github repository.


In [1]:
# initial imports
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import openai

import paperxai.constants as constants
import paperxai.credentials as credentials
from paperxai.llms import OpenAI
from paperxai.papers import Arxiv
from paperxai.report.retriever import ReportRetriever
from paperxai.prompt.base import Prompt
from paperxai.loading import load_config

In [2]:
openai.api_key = credentials.OPENAI_API_KEY

## Fetching the latest arXiv papers

See script `scripts/get_arxiv_papers.py` if you want to run this as a script. **Make sure** to tweak the `config.yml` file to change the questions according to what you want to learn/track in the latest papers.


In [4]:
#!python ../scripts/get_arxiv_papers.py --max_results 1000

In [None]:
config = load_config("../config.yml")
arxiv = Arxiv()
arxiv.get_papers(categories=config["arxiv-categories"], max_results=1000)
arxiv.write_papers()

In [5]:
df_articles = pd.read_csv(constants.ROOT_DIR + "/data/arxiv/base_papers.csv")
df_new_articles = pd.read_csv(constants.ROOT_DIR + "/data/arxiv/current_papers.csv")

## Embedding predefined user questions and sections


### Embedding articles

Let's embed the different articles we've obtained. We pay little attention here to performance, if you want to run this on a larger dataset, it may be worth batching calls and saving the embeddings in a dedicated array (vector database is definitely not useful at this scale).


In [4]:
openai_model = OpenAI(
    chat_model="gpt-3.5-turbo",
    embedding_model="text-embedding-ada-002",
    temperature=0.0,
    max_tokens=1000,
)

In [11]:
df_new_articles["Embeddings"] = df_new_articles["String_representation"].apply(
    lambda x: openai_model.get_embeddings(x)
)

df_new_articles.to_csv(
    constants.ROOT_DIR + "/data/arxiv/current_papers_with_embeddings.csv", index=False
)

article_embeddings = np.vstack(df_new_articles["Embeddings"].values)
np.save(constants.ROOT_DIR + "/data/arxiv/article_embeddings.npy", article_embeddings)

In [3]:
df_new_articles = pd.read_csv(constants.ROOT_DIR + "/data/arxiv/current_papers_with_embeddings.csv")
article_embeddings = np.load(constants.ROOT_DIR + "/data/arxiv/article_embeddings.npy")

## Semantic retrieval & Generating an automatic report


In [5]:
prompter = Prompt()
report_retriever = ReportRetriever(
    language_model=openai_model,
    prompter=prompter,
    papers_embedding=article_embeddings,
    df_papers=df_new_articles,
)

In [6]:
report_retriever.create_report()

Getting responses for section: Large Language Model inference optimization
Answering question: What are the latest developments around large language inference and quantization?
Answering question: What are the latest developments around large language inference and Mixture of Experts?
Answering question: What are the latest developments around large language inference memory optimization?
Getting responses for section: Large Language Model training optimization
Answering question: What are the latest developments around large language models and distributed training across multiple GPUs?
Answering question: What are the latest developments around large language models and the tradeoff between model size and number of training tokens?


{'Large Language Model inference optimization': {'questions': ['What are the latest developments around large language inference and quantization?',
   'What are the latest developments around large language inference and Mixture of Experts?',
   'What are the latest developments around large language inference memory optimization?'],
  'chat_responses': ['The latest developments around large language inference and quantization include the introduction of the QuIP method, which utilizes incoherence processing to improve quantization algorithms for large language models (Chee, 2023), the proposal of multilevel large language models that unify generic and specific models to improve performance based on user input and internet information (Gong, 2023), and the development of myQASR, a personalized mixed-precision quantization method for automatic speech recognition models that tailors quantization schemes for diverse users under any memory requirement (Fish, 2023).',
   'The latest develo

In [9]:
report_retriever.print_report()

Section: Large Language Model inference optimization

Question: What are the latest developments around large language inference and quantization?
LLM response: The latest developments around large language inference and quantization include the introduction of the QuIP method, which utilizes incoherence processing to improve quantization algorithms for large language models (Chee, 2023), the proposal of multilevel large language models that unify generic and specific models to improve performance based on user input and internet information (Gong, 2023), and the development of myQASR, a personalized mixed-precision quantization method for automatic speech recognition models that tailors quantization schemes for diverse users under any memory requirement (Fish, 2023).
Question: What are the latest developments around large language inference and Mixture of Experts?
LLM response: The latest developments around large language inference and Mixture of Experts include the unification of ge

In [21]:
## HTML newsletter template to format the report with (i) the sections, questions and answers and (ii) all the papers retrieved for each question
HTML_newsletter_template = """
<html>
<head>
<style>
body {
    font-family: Arial, Helvetica, sans-serif;
    font-size: 14px;
    line-height: 1.5;
    color: #333333;
    background-color: #ffffff;
    margin: 0;
    padding: 0;
    -webkit-text-size-adjust: none;
    -ms-text-size-adjust: none;
}
table {
    border-collapse: collapse;
    border-spacing: 0;
}
td {
    padding: 0;
}
img {
    border: 0;
    -ms-interpolation-mode: bicubic;
}
a {
    color: #ee6a56;
    text-decoration: underline;
}
h1 {
    font-family: Arial, Helvetica, sans-serif;
    font-size: 28px;
    line-height: 1.2;
    color: #333333;
    font-weight: bold;
    margin-top: 0;
    margin-bottom: 0;
    text-align: center;
}
</style>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
    <tr>
        <td align="center" bgcolor="#ffffff" style="padding: 40px 0 30px 0;">
            <img src="https://images.unsplash.com/photo-1580584126903-c17d41830450?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1639&q=80" width="300" height="230" style="display: block;" />
        </td>
    </tr>
    <tr>
        <td bgcolor="#ffffff" style="padding: 40px 30px 40px 30px;">
            <table border="0" cellpadding="0" cellspacing="0" width="100%">
                <tr>
                    <td style="padding: 20px 0 30px 0;">
                        <h1>PaperXAI Report</h1>
                    </td>
                </tr>
                <tr>
                    <td style="padding: 0 0 20px 0;">
                        <h2>Introduction</h2>
                        <p>Hi there,</p>
                        <p>Here is your PaperXAI report. We hope you enjoy it!</p>
                    </td>
                </tr>
                <tr>
                    <td style="padding: 0 0 20px 0;">
                        <h2>Report</h2>
                        <p>{report_string}</p>
                    </td>
                </tr>
            </table>
        </td>
    </tr>
</table>
</body>
</html>
"""

In [22]:
## format the report using the report template

def format_save_report(report_string: str, HTML_template: str) -> None:
    """
    Format the report using the report template
    """
    report_HTML = HTML_template.replace("{report_string}", report_string)
    #save HTML to file
    with open("../display/report.html", "w") as f:
        f.write(report_HTML)
    print("HTML is saved to display/report.html, open it in your browser to view the report")
    

In [23]:
format_save_report(report_retriever.format_report(), HTML_newsletter_template)

'\n<html>\n<head>\n<style>\nbody {\n    font-family: Arial, Helvetica, sans-serif;\n    font-size: 14px;\n    line-height: 1.5;\n    color: #333333;\n    background-color: #ffffff;\n    margin: 0;\n    padding: 0;\n    -webkit-text-size-adjust: none;\n    -ms-text-size-adjust: none;\n}\ntable {\n    border-collapse: collapse;\n    border-spacing: 0;\n}\ntd {\n    padding: 0;\n}\nimg {\n    border: 0;\n    -ms-interpolation-mode: bicubic;\n}\na {\n    color: #ee6a56;\n    text-decoration: underline;\n}\nh1 {\n    font-family: Arial, Helvetica, sans-serif;\n    font-size: 28px;\n    line-height: 1.2;\n    color: #333333;\n    font-weight: bold;\n    margin-top: 0;\n    margin-bottom: 0;\n    text-align: center;\n}\n</style>\n</head>\n<body>\n<table border="0" cellpadding="0" cellspacing="0" width="100%">\n    <tr>\n        <td align="center" bgcolor="#ffffff" style="padding: 40px 0 30px 0;">\n            <img src="https://images.unsplash.com/photo-1580584126903-c17d41830450?ixlib=rb-4.0.

In [19]:
report_retriever.format_report()

"<h2> Section: Large Language Model inference optimization</h2><h3> Question: What are the latest developments around large language inference and quantization?</h3><p> LLM response: The latest developments around large language inference and quantization include the introduction of the QuIP method, which utilizes incoherence processing to improve quantization algorithms for large language models (Chee, 2023), the proposal of multilevel large language models that unify generic and specific models to improve performance based on user input and internet information (Gong, 2023), and the development of myQASR, a personalized mixed-precision quantization method for automatic speech recognition models that tailors quantization schemes for diverse users under any memory requirement (Fish, 2023).</p><h4> Papers </h4><ul><li> QuIP: 2-Bit Quantization of Large Language Models With Guarantees. Chee et al. 2023-07-25</li><li> Multilevel Large Language Models for Everyone. Gong et al. 2023-07-25</

## Sending the personalized newsletter
