# Handling Local Data
To load datasets that are stored either on your laptop or on a remote server, we can still use the `load_dataset()` function. This time, we just need to specify the type of loading script in the `load_dataset()` function, along with a `data_files=''` argument that specifies the path to one or more files.

!["load_dataset()"](data/chapter_5/load_dataset.png "load_dataset()")

### Loading a local dataset

| Data format | Loading script | Example |
|-------------|----------------|---------|
| CSV & TSV |`csv`|`load_dataset("csv", data_files="my_file.csv")`|
| Text files |`text`|`load_dataset("text", data_files="my_file.txt")`|
| JSON & JSON Lines |`json`|`load_dataset("json", data_files="my_file.json")`|
| Pickled DataFrames |`pandas`|`load_dataset("pandas", data_files="my_dataframe.pkl")`|

For this example, let's use the [SQuAD-it](https://github.com/crux82/squad-it/) dataset, which is a large-scale **json** dataset for question answering in Italian. It's hosted on GitHub, let's first download it in our `data/chapter_5` dir using `wget` and then decompress these compressed files `SQuAD_it-train.json.gz`, `SQuAD_it-test.json.gz` using `gzip`:

In [None]:
!cd data/chapter_5 && wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!cd data/chapter_5 && wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

!cd data/chapter_5 && gzip -dkv SQuAD_it-*.json.gz

Now that we have our data in the `JSON` format, we can simply use the `load_dataset()` function, we just need to know if we’re dealing with **ordinary JSON** (*similar to a nested dictionary*) or **JSON Lines** (*line-separated JSON*). Like many question answering datasets, **SQuAD-it** uses the *nested format*, with all the text stored in a **data field**. This means we can load the dataset by specifying the `field='data'` argument:

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="data/chapter_5/SQuAD_it-train.json", field="data")

squad_it_dataset

As we can see, by default, loading local files creates a `DatasetDict` object with only a **train** split. But, what we really want is to include both the **train** and **test** splits in a single `DatasetDict` object so we can apply `Dataset.map()` functions across both splits at once. To do this, we can provide a dictionary to the 
```python
data_files={"train":"path to the training data", "test":"path to the testing data"}
```
argument that maps each split name to a file associated with that split:

In [None]:
data_files = {
    "train":"data/chapter_5/SQuAD_it-train.json",
    "test":"data/chapter_5/SQuAD_it-test.json"
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

The loading scripts in Datasets actually support automatic decompression of the input files, so we could have skipped the use of gzip by pointing the `data_files` argument directly to the compressed files:
```python
data_files = {
    "train": "data/chapter_5/SQuAD_it-train.json.gz", 
    "test": "data/chapter_5/SQuAD_it-test.json.gz"
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
```
This can be useful if you don’t want to manually decompress many `GZIP` files. The automatic decompression also applies to other common formats like `ZIP` and `TAR`, so you just need to point `data_files` to the compressed files.

> The `data_files` argument is also quite flexible and can be either *a single file path*, *a list of file paths*, or *a dictionary* that maps split names to file paths. You can also *glob files* that match a *specified pattern* according to the rules used by the `Unix shell` (e.g., you can glob all the `JSON` files in a directory as a single split by setting `data_files="*.json"`). See the [Datasets documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) for more details.

### Loading a remote dataset

Fortunately, loading *remote files* is just as simple as loading *local* ones!
<br />
Instead of providing a path to *local files*, we point the `data_files` argument to **one or more URLs** where the *remote files* are stored.

In [None]:
url =  "https://github.com/crux82/squad-it/raw/master/"

data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

# Data Manipulation

The `DatasetDict` object comes with a lot of functionalities to manipulate the original dataset.
<br />
For this example, we’ll use the [Drug Review Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29) that’s hosted on the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient’s satisfaction.

In [None]:
!cd data/chapter_5/ && wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!cd data/chapter_5/ && unzip drugsCom_raw.zip

As we can see, this the data is in the `TSV` format which is a variant of `CSV` that uses tabs instead of commas as the separator. So, when loading these files using `load_dataset()`, we use the specify `csv` as the *loading script* and most importantly the `delimiter=\t` argument:

In [None]:
from datasets import load_dataset

data_files = {
    "train" : "data/chapter_5/drugsComTrain_raw.tsv",
    "test" : "data/chapter_5/drugsComTest_raw.tsv"
}

drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Now that we have the `DatasetDict` object, we can create a random sample to get a quick feel for the type of data you’re working with and to do so we simply have to chain the `Dataset.shuffle()` and `Dataset.select()` function to first randomly shuffle the data  (we can also pass the `seed` argument to later use the same shuffle) and select/see the first *n* data elements:

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))

drug_sample[:3]

From above we can see before passing this data to the model or even for tokenisation we need to perform few pre-processing steps:
  + The `Unnamed: 0` column needs to be renamed to `patient_id`.
  + The `condition` column includes a mix of *uppercase* and *lowercase* labels.
  + The `reviews` are of varying length and contain a mix of Python line separators `(\r\n)` as well as HTML character codes like `&\#039;`.

So, we can use the in-built functions like the, `rename_column()` - to rename the column name, `map()` and `filter()` - to map all the `condition` column values to lowercase, and also filter out the special characters.

In [None]:
import html

# rename the column name
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0",
    new_column_name="patient_id"
)

# map conditon column values to lowercase
def lowercase_condition(data):
    return {"condition": [row.lower() for row in data["condition"]]}
    # return {"condition": data["condition"].lower()} # if not using batched=True in the map() function
    

# let's first remove all the rows with null values, otherwise the above
# function will throw an error
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

# map lowercasse
drug_dataset = drug_dataset.map(lowercase_condition, batched=True)


# unescape all the HTML special characters in our corpus
drug_dataset =  drug_dataset.map(
    lambda x: {"review": [html.unescape(row) for row in x["review"]]},
    batched=True
)


drug_dataset["train"][:2]

>In Python, `lambda` functions are small functions that you can define without explicitly naming them. They take the general form `lambda <arguments> : <expression>`,
where `lambda` is one of Python’s special keywords, `<arguments>` is a list/set of *comma-separated values* that define the *inputs* to the function, and `<expression>` represents the operations you wish to execute. For example, we can define a simple lambda function that squares a number as follows: `lambda x : x * x`
To apply this function to an input, we need to wrap it and the input in parentheses:
`(lambda x: x * x)(3) -> 9`

### From Datasets to DataFrames and back

We can use the the `set_format()` function of the `DatasetDict` object to convert it into a different dataframe such as *Pandas*, *NumPy*, *PyTorch*, *TensorFlow*, and *JAX*. To convert it back to the `DatasetDict` object, we simply need to call the `reset_format()` function

In [None]:
drug_dataset.set_format("pandas")

drug_dataset["train"][:3]

In [None]:
drug_dataset.reset_format()

drug_dataset["train"][:3]

### Creating a validation set
The `DatasetDict` object also provides a `Dataset.train_test_split()` function that is based on the famous functionality from `scikit-learn` which can be used to further split the data into a train-validation-test format.


In [None]:
# 80-20 percent train-validation split on the training dataset
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=41)

# name the 20% split data as the validation
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")

# Add the orignal test dataset
drug_dataset_clean["test"] = drug_dataset["test"]

drug_dataset_clean

### Saving a dataset
To save a dataset to disk:

| Data format | Function |
|-------------|----------|
|*Arrow*|`Dataset.save_to_disk()`|
|*CSV*|`Dataset.to_csv()`|
|*JSON*|`Dataset.to_json()`|

For example, let’s save our cleaned dataset in the Arrow format:

In [None]:
drug_dataset_clean.save_to_disk("data/chapter_5/drug-reviews")

!ls data/chapter_5/drug-reviews/*

Once the dataset is saved, we can load it by using the load_from_disk() function as follows:

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("data/chapter_5/drug-reviews")
drug_dataset_reloaded

For the **CSV** and **JSON** formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the `DatasetDict` object. This saves each split in JSON Lines format, where each row in the dataset is stored as a single line of JSON.

In [None]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"data/chapter_5/drug-reviews-{split}.jsonl")

And to load the data we can simply use the `load_dataset()` function:

In [None]:
data_files = {
    "train": "data/chapter_5/drug-reviews-train.jsonl",
    "validation": "data/chapter_5/drug-reviews-validation.jsonl",
    "test": "data/chapter_5/drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

drug_dataset_reloaded