# Data Prep for Research RAG

Seed the RAG with [jamescalam/ai-arxiv2](https://huggingface.co/datasets/jamescalam/ai-arxiv2) dataset. Convert the data to the [canopy format](https://github.com/pinecone-io/canopy/blob/main/README.md#2-uploading-data).

Also see:
- [canopy-data-prep.ipynb](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/canopy/00-canopy-data-prep.ipynb)

In [None]:
!pip install datasets

In [5]:
from datasets import load_dataset

In [10]:
data = load_dataset("jamescalam/ai-arxiv2", split="train")

In [11]:
data

Dataset({
    features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
    num_rows: 2673
})

## Conver data for canopy

Columns:
- id: unique identifier for the document
- text: the text of the document, in utf-8 encoding.
- source: the source of the document, can be any string, or null. This will be used as a reference in the generated context.
- metadata: optional metadata for the document

In [13]:
data = data.map(lambda x: {
    "id": x["id"],
    "text": x["content"],
    "source": x["source"],
    "metadata": {
        "title": x["title"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "content",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references"
])
data

Map:   0%|          | 0/2673 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'source', 'text', 'metadata'],
    num_rows: 2673
})

In [14]:
data.to_json("ai_arxiv2.jsonl", orient="records", lines=True)

Creating json from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

213337664

## Use Canopy to upload

From here we can switch across to Canopy CLI (or other method) and run:

```
canopy
canopy upsert ./ai_arxiv2.jsonl
```
