# Download the data

We will just download FinGPT's data due to the following (from the README in the outer directory):

For the datasets we used, download our processed instruction tuning data from huggingface. Take FinRED dataset as an example:
```
import datasets

dataset = datasets.load_dataset('FinGPT/fingpt-finred')
# save to local disk space (recommended)
dataset.save_to_disk('data/fingpt-finred')
```
Then `finred` became an available task option for training.

We use different datasets at different phases of our instruction tuning paradigm.
- Task-specific Instruction Tuning: `sentiment-train / finred-re / ner / headline`
- Multi-task Instruction Tuning: `sentiment-train & finred & ner & headline`
- Zero-shot Aimed Instruction Tuning: `finred-cls & ner-cls & headline-cls -> sentiment-cls (test)`

You may download the datasets according to your needs. We also provide processed datasets for ConvFinQA and FinEval, but they are not used in our final work.

### prepare data from scratch
To prepare training data from raw data, you should follow `data/prepate_data.ipynb`. 

We don't include any source data from other open-source financial datasets in our repository. So if you want to do it from scratch, you need to find the corresponding source data and put them in `data/` before you start. 


## Instruction Tuning Datasets and Models
The datasets we used, and the **multi-task financial LLM** models are available at <https://huggingface.co/FinGPT>

[Our Code](https://github.com/AI4Finance-Foundation/FinGPT/tree/master/fingpt/FinGPT_Benchmark)
  
  | Datasets | Train Rows |  Test Rows |Description  |
  | --------- | ----------------- | ------------ | --------------------- |
  | [fingpt-sentiment-train](https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train) | 76.8K | N/A|Sentiment Analysis Training Instructions |
  | [fingpt-finred](https://huggingface.co/datasets/FinGPT/fingpt-finred)| 27.6k | 5.11k | Financial Relation Extraction Instructions |
  | [fingpt-headline](https://huggingface.co/datasets/FinGPT/fingpt-headline) | 82.2k | 20.5k | Financial Headline Analysis Instructions|
  | [fingpt-ner](https://huggingface.co/datasets/FinGPT/fingpt-ner) | 511   | 98  | Financial Named-Entity Recognition Instructions|
  | [fingpt-fiqa_qa](https://huggingface.co/datasets/FinGPT/fingpt-fiqa_qa) | 17.1k   | N/A  | Financial Q&A Instructions|
  | [fingpt-fineval](https://huggingface.co/datasets/FinGPT/fingpt-fineval) | 1.06k   | 265  | Chinese Multiple-Choice Questions Instructions|




In [3]:
import os
import shutil
file_paths = ['fingpt-sentiment-train', 'fingpt-finred-re', 'fingpt-headline', 'fingpt-ner', 'fingpt-fiqa_qa', 'fingpt-fineval']

for file_path in file_paths:
    if os.path.exists(file_path):
        shutil.rmtree(file_path)


In [4]:
# Load and save the datasets
import datasets

dataset = datasets.load_dataset("FinGPT/fingpt-finred")
dataset.save_to_disk('fingpt-finred-re')

Saving the dataset (1/1 shards): 100%|██████████| 27558/27558 [00:00<00:00, 765793.87 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 5112/5112 [00:00<00:00, 104355.93 examples/s]


In [None]:
dataset = datasets.load_dataset("FinGPT/fingpt-finred")
dataset.save_to_disk('fingpt-finred')

dataset = datasets.load_dataset("FinGPT/fingpt-headline")
dataset.save_to_disk('fingpt-headline')

dataset = datasets.load_dataset("FinGPT/fingpt-ner")
dataset.save_to_disk('fingpt-ner')

dataset = datasets.load_dataset("FinGPT/fingpt-fiqa_qa")
dataset.save_to_disk('fingpt-fiqa_qa')

dataset = datasets.load_dataset("FinGPT/fingpt-fineval")
dataset.save_to_disk('fingpt-fineval')