### In this section we'll see how the 🤗 Datasets library can be used to load datasets that arn't available on the hugging face hub

In [1]:
# install git-lfs (required for using git)
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs

# necessary packages to use huggingface with paperspace
!pip install -q --upgrade transformers torch torchvision torchaudio
!pip install -q tokenizers==0.13.3 evaluate
!pip install -q bitsandbytes transformers accelerate gradio thread6

[0m

### Working with local and remote datasets

- 🤗 Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:

In [3]:
# we'll use the 'prettytables' package for formatting our data
!pip install prettytable

Collecting prettytable
  Downloading prettytable-3.9.0-py3-none-any.whl (27 kB)
Installing collected packages: prettytable
Successfully installed prettytable-3.9.0
[0m

In [31]:
from IPython.display import HTML, display

def display_table(data):
    # Add some basic CSS for styling
    html = """
    <style>
        table {
            border-collapse: collapse;
            width: 46%;
            margin: 20px 0;
            font-size: 16px;
        }
        th, td {
            padding: 10px !important;
            border-bottom: 1px solid #ddd;
            background-color: #f2f2f2;
            text-align: center !important;
        }
        /* Center-align the column labels */
        th {
            background-color: #4CAF50;
            color: white;
            text-align: center !important;
        }
        tr:hover {
            background-color: #e0e0e0;
        }
        th {
            background-color: #4CAF50;
            color: white;
        }
        /* Add these styles to adjust column widths */
        td:nth-child(1), th:nth-child(1) {
            width: 20%;  /* Adjust as needed */
        }
        td:nth-child(2), th:nth-child(2) {
            width: 10%;  /* Adjust as needed */
        }
    </style>
    <table>
    """
    
    for i, row in enumerate(data):
        html += "<tr>"
        for field in row:
            if i == 0:
                html += "<th><h4>%s</h4></th>" % field
            else:
                html += "<td><h4>%s</h4></td>" % field
        html += "</tr>"
    html += "</table>"
    
    display(HTML(html))

# Define your table data
data = [
    ["Data format", "Loading script", "Example"],
    ["CSV & TSV", "csv", 'load_dataset("csv", data_files="my_file.csv")'],
    ["Text files", "text", 'load_dataset("text", data_files="my_file.txt")'],
    ["JSON & JSON Lines", "json", 'load_dataset("json", data_files="my_file.jsonl")'],
    ["Pickled DataFrames", "pandas", 'load_dataset("pandas", data_files="my_dataframe.pkl")']
]

# Display the table using the function
display_table(data)


Data format,Loading script,Example
CSV & TSV,csv,"load_dataset(""csv"", data_files=""my_file.csv"")"
Text files,text,"load_dataset(""text"", data_files=""my_file.txt"")"
JSON & JSON Lines,json,"load_dataset(""json"", data_files=""my_file.jsonl"")"
Pickled DataFrames,pandas,"load_dataset(""pandas"", data_files=""my_dataframe.pkl"")"


- As shown in the table, for each data format we just need to specify the type of loading script in the 'load_dataset()' funciton, along with a 'data_files' argument that specifies the path to one or more files.

- Let's start by loading a dataset from local files; later we'll see how to do the same with remote files

### Loading a local dataset

- For this example we'll use the [SQuAD-it](url="https://github.com/crux82/squad-it/") dataset, which is a large-scale dataset for question answering in Italian

- The training and test splits are hosted on GitHub, so we can download them with a simple 'wget' command

In [35]:
# Create the 'data' directory if it doesn't already exist
!mkdir -p data

# Navigate into the 'data' directory
# the '%' makes it persistant when using jupyter notebooks
%cd data

# This will download two compressed files call 'SQuAD_it-train.json.gz' and 'SQuAD_it-test.json.gz' ->
# -> which we can decompress with the Linux gzip command
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

/notebooks/NLP_huggingface/Chapter_5/data
--2023-09-15 20:53:03--  https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz [following]
--2023-09-15 20:53:03--  https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7725286 (7.4M) [application/octet-stream]
Saving to: ‘SQuAD_it-train.json.gz’


2023-09-15 20:53:03 (295 MB/s) - ‘SQuAD_it-train.json.gz’ saved [7725286/7725286]

--2023-09-15 20:53:04--  https://github.com/

In [37]:
# decompresses the 'json.gz' files
!gzip -dkv SQuAD_it-*.json.gz

SQuAD_it-test.json.gz:	 87.5% -- created SQuAD_it-test.json
SQuAD_it-train.json.gz:	 82.3% -- created SQuAD_it-train.json


- We can see that the compressed files have been replaced with 'SQuAD_it-train.json', and 'SQuAD_it-test.json', and that the data is stored in the JSON format

In [38]:
ls -lh

total 58M
-rw-r--r-- 1 root root 8.0M Sep 15 20:53 SQuAD_it-test.json
-rw-r--r-- 1 root root 1.1M Sep 15 20:53 SQuAD_it-test.json.gz
-rw-r--r-- 1 root root  42M Sep 15 20:53 SQuAD_it-train.json
-rw-r--r-- 1 root root 7.4M Sep 15 20:53 SQuAD_it-train.json.gz


##### To load a Json file with the 'load_dataset()' function, we just need to know if we're dealing with ordinary JSON (similar to a nested dictionary) or JSON Lines(line-seperated JSON)
- Like many questions answering datasets, 'SQuAD-it' uses the nested format, with all the text stored in a data field . This means we can load the dataset by specifying the field argument as follows.

In [39]:
from datasets import load_dataset

squad_it_dataset = load_dataset('json', data_files = "SQuAD_it-train.json", field = "data")

Using custom data configuration default-3654a07a1266403d


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-3654a07a1266403d/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-3654a07a1266403d/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

- By default, loading local files creates a DatasetDict object with a train split. We can see this by inspecting the squad_it_dataset object:

In [43]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

In [None]:
squad_it_dataset["train"][1]

- We've loaded our first local dataset!
- But while this worked for the training set, what we really want is to include both the train and test splits in a single 'DatasetDict' object so we can apply 'Dataset.map() functions across both splits at once
- To do this, we can provide a dictionary to the data_files argument that maps each split name to a file associated with that split:

In [46]:
# merging both datasets into one
data_files = {'train': 'SQuAD_it-train.json', 'test': 'SQuAD_it-test.json'}
squad_it_dataset = load_dataset('json', data_files = data_files, field = 'data')
squad_it_dataset


Using custom data configuration default-ebe61a3d1552e88f


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-ebe61a3d1552e88f/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ebe61a3d1552e88f/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

- This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.

- The data_files argument of the load_dataset() funcition is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths.

- You can also glob files that match a specified pattern according to the rules used by the Unix shell(you can glob all the JSON files in a directory as a single split by setting 'data_files = "*.json")

- The loading script in 🤗 Datasets actually support automatic decompression of the input files, so we could have skipped the use of gzip by pointing the 'data_files' argument directly to the compressed files

In [47]:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files = data_files, field = "data")

Using custom data configuration default-49a878f2f428dbbf


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-49a878f2f428dbbf/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-49a878f2f428dbbf/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [50]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

1. Now you don't manually have to decompress gzip files anymore!
    - The decompression applies to other common formats such as '.zip' and '.Tar'
    - you just need to point 'data_files' to the compressed files and you're good to go
    
### Loading a remote dataset

- If you're working as a data scientist or coder in a company, there's a good chance the datasets you want to analyze are stored on some remote server
- Loading remote files is as simple as loading local ones
- instead of pointing 'data_files' to the local files 'path' we point instead to one or more 'URLs'

In [51]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    'train': url + "SQuAD_it-train.json.gz",
    'test': url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files = data_files, field = "data")

Using custom data configuration default-57dcee3ea6992346


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-57dcee3ea6992346/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/7.73M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-57dcee3ea6992346/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [54]:
# Exercise
# Repeat what was taught but using your own dataset found online

# get url (didn't know what was inside .zip file so didn't want to run load_dataset() on it)
# !wget https://archive.ics.uci.edu/static/public/109/wine.zip # commented out

# gzip is only for .gz files. since its a .zip file we'll use 'unzip'
!unzip wine.zip

# im coming back to this one, "DONT USE THE WINE DATSET, its probably because its old '1991', had some shit about converting a .db file to a json file and something broke"

Archive:  wine.zip
  inflating: Index                   
  inflating: wine.data               
  inflating: wine.names              
