# Download the data

We will just download FinGPT's data due to the following (from the README in the outer directory):

For the datasets we used, download our processed instruction tuning data from huggingface. Take FinRED dataset as an example:
```
import datasets

dataset = datasets.load_dataset('FinGPT/fingpt-finred')
# save to local disk space (recommended)
dataset.save_to_disk('data/fingpt-finred')
```
Then `finred` became an available task option for training.

We use different datasets at different phases of our instruction tuning paradigm.
- Task-specific Instruction Tuning: `sentiment-train / finred-re / ner / headline`
- Multi-task Instruction Tuning: `sentiment-train & finred & ner & headline`
- Zero-shot Aimed Instruction Tuning: `finred-cls & ner-cls & headline-cls -> sentiment-cls (test)`

You may download the datasets according to your needs. We also provide processed datasets for ConvFinQA and FinEval, but they are not used in our final work.

### prepare data from scratch
To prepare training data from raw data, you should follow `data/prepate_data.ipynb`. 

We don't include any source data from other open-source financial datasets in our repository. So if you want to do it from scratch, you need to find the corresponding source data and put them in `data/` before you start. 


In [8]:
import os
import shutil
file_paths = ['twitter-financial-news-sentiment', 'news_with_gpt_instructions', 'financial_phrasebank-sentences_50agree', 'fiqa-2018']

for file_path in file_paths:
    if os.path.exists(file_path):
        shutil.rmtree(file_path)


In [7]:
# Load and save the datasets
from datasets import load_dataset

dataset = load_dataset('pauri32/fiqa-2018')
dataset.save_to_disk('fiqa-2018')

dataset = load_dataset('financial_phrasebank', 'sentences_50agree')
dataset.save_to_disk('financial_phrasebank-sentences_50agree')

dataset = load_dataset('zeroshot/twitter-financial-news-sentiment')
dataset.save_to_disk('twitter-financial-news-sentiment')

dataset = load_dataset('oliverwang15/news_with_gpt_instructions')
dataset.save_to_disk('news_with_gpt_instructions')

Downloading data: 100%|██████████| 161k/161k [00:00<00:00, 644kB/s]
Downloading data: 100%|██████████| 16.7k/16.7k [00:00<00:00, 153kB/s]
Downloading data: 100%|██████████| 25.3k/25.3k [00:00<00:00, 247kB/s]
Generating train split: 961 examples [00:00, 8255.54 examples/s]
Generating validation split: 102 examples [00:00, 23061.78 examples/s]
Generating test split: 150 examples [00:00, 28996.89 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 961/961 [00:00<00:00, 124832.80 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 102/102 [00:00<00:00, 18930.88 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 150/150 [00:00<00:00, 27232.20 examples/s]
Downloading data: 100%|██████████| 392k/392k [00:00<00:00, 1.17MB/s]
Generating train split: 100%|██████████| 4846/4846 [00:00<00:00, 161690.26 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 4846/4846 [00:00<00:00, 183019.51 examples/s]
Downloading readme: 100%|██████████| 1.39k/1.39k [00:00<00:

In [6]:
# dataset = datasets.load_dataset('FinGPT/fingpt-finred')
# dataset.save_to_disk('fingpt-finred-re')

# dataset = datasets.load_dataset('FinGPT/fingpt-headline')
# dataset.save_to_disk('fingpt-headline')

# dataset = datasets.load_dataset('FinGPT/fingpt-ner')
# dataset.save_to_disk('fingpt-ner')

# dataset = datasets.load_dataset('pauri32/fiqa-2018')
# dataset.save_to_disk('fiqa-2018')


# dataset = datasets.load_dataset('FinGPT/fingpt-fineval')
# dataset.save_to_disk('fingpt-fineval')

Saving the dataset (1/1 shards): 100%|██████████| 27558/27558 [00:00<00:00, 854696.79 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 5112/5112 [00:00<00:00, 548566.80 examples/s]
