### TorchText with our custom data
From the previous notebook we have been using the `IMDB` dataset for sentiment analyisis classification. In real world we will want to work with our own dataset. In this notebook we are going to cover that with TorchText helper functions which have have been using all long. We:

1. Define the Fields
2. Loaded the Dataset
3. Created the Splits

Recall:
```
TEXT = data.Field()
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split()
```
Torch text is cappable of reading 3 files which are:

1. json -> javascript object notation
2. csv -> comma serperated values
3. tsv -> tab seperated values

**`Json` is the best we will explain why later on.**

We have files that are in the data Folder. The `train.json` has the following formart.

```json
{"name": "Jocko", "quote": "You must own everything in your world. There is no one else to blame.", "score":1}
```

In [3]:
from torchtext.legacy import data, datasets
import torch

### Define the Fields.

In [17]:
tokenizer = lambda x: x.split()

In [20]:
QOUTE = data.Field(tokenize=tokenizer, lower=True)
LABEL = data.LabelField(dtype=torch.float32)

Next, we must tell TorchText which fields apply to which elements of the json object.

For `json` data, we must create a dictionary where:

* the key matches the key of the json object
* the value is a tuple where:
* the first element becomes 
    * the batch object's attribute name
    * the second element is the name of the Field
    
**What do we mean when we say "becomes the batch object's attribute name"?**

Recall in the previous notebooks where we accessed the `TEXT` and `LABEL` fields in the `train/evaluation` loop by using `batch.text` and `batch.label`, this is because `TorchText` sets the `text` object to have a `text` and `label` attribute, each being a tensor containing either the `text` or the `label`.

**Take Home Notes**:
1. The order of the keys in the fields dictionary does not matter, as long as its keys match the json data keys.
2. The Field name does not have to match the key in the json object, e.g. you can use `LABEL` for the "score" field.
3. When dealing with json data, not all of the keys have to be used, e.g. we did not use the "name" field.
4. Also, if the values of json field are a string then the Fields tokenization is applied (default is to split the string on spaces), however if the values are a list then no tokenization is applied. Usually it is a good idea for the data to already be tokenized into a list, this saves time as you don't have to wait for TorchText to do it.
5. The value of the json fields do not have to be the same type. Some examples can have their "quote" as a string, and some as a list. The tokenization will only get applied to the ones with their "quote" as a string.
6. If you are using a json field, every single example must have an instance of that field.

In [35]:
fields = {
    'quote': ('q' , QOUTE),
    'score': ('s', LABEL)
}


Now, in a training loop we can iterate over the data iterator and access the qoute via batch.q, the score/label via batch.s

We then create our datasets (train_data and test_data) with the TabularDataset.splits function.

The path argument specifices the **top level folder** (in our case `data`) common among both datasets, and the `train` and `test` arguments specify the filename of each dataset, e.g. here the train dataset is located at data/train.json.**We can also specify the validation if we have a file containing validation data**.

We tell the function we are using json data, and pass in our fields dictionary defined previously.

In [36]:
train_data, test_data = data.TabularDataset.splits(
            path = 'data',
            train="train.json", 
            test="test.json", 
            format="json", 
            fields=fields
)

We can then view an example to make sure it has worked correctly.
**Notice how the field names (q and s) match up with what was defined in the fields dictionary.**

In [37]:
print(vars(train_data[0]))

{'q': ['you', 'must', 'own', 'everything', 'in', 'your', 'world.', 'there', 'is', 'no', 'one', 'else', 'to', 'blame.'], 's': 1}


We can now use `train_data`, `test_data` and `valid_data`(if available) to build a `vocabulary` and create `iterators`, as in the previous `notebooks`. We can access all attributes by using batch.s and batch.q.


### Reading CSV/TSV
`csv` and `tsv` are very similar, except `csv` has elements separated by commas and `tsv` by tabs.

Using the same example above, our tsv data will be in the form of:

```tsv
name	quote	score
Jocko	You must own everything in your world. There is no one else to blame.	1
Bruce Lee	Do not pray for an easy life, pray for the strength to endure a difficult one.	1
Potato guy	Stand tall, and rice like a potato!	0
```

That is, on each row the elements are separated by tabs and we have one example per row. The first row is usually a header (i.e. the name of each of the columns), but sometimes with on header.

**You cannot have lists within tsv or csv data.**

The way the fields are defined is a bit different to json. We now use a list of tuples, where each element is also a tuple. The first element of these inner tuples will become the batch object's attribute name, second element is the Field name.

Unlike the json data, the tuples have to be in the same order that they are within the tsv data. Due to this, when skipping a column of data a tuple of Nones needs to be used.

However, if you only wanted to use the name and svore column, you could just use two tuples as they are the first two columns.

We change our TabularDataset to read the correct `.tsv` files, and change the format argument to 'tsv'.

If your data has a header, which ours does, it must be skipped by passing `skip_header = True`. If not, TorchText will think the header is an example. By default, `skip_header` will be `False.`

In [42]:
fields = [(None, None), ('q', QOUTE) , ('s', LABEL)]

In [44]:
train_data,  test_data = data.TabularDataset.splits(
                                        path = 'data',
                                        train = 'train.csv',
                                        test = 'test.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

### If you decide to specify field names as a dictionery like before you can do it as follows:

```python
fields = {
    'quote': ('q' , QOUTE),
    'score': ('s', LABEL)
}
train_data,  test_data = data.TabularDataset.splits(
                                        path = 'data',
                                        train = 'train.csv',
                                        test = 'test.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = False # should be false
)
```

In [46]:
print(vars(train_data[0]))

{'q': ['you', 'must', 'own', 'everything', 'in', 'your', 'world.', 'there', 'is', 'no', 'one', 'else', 'to', 'blame.'], 's': '1'}


### Why `JSON` over `CSV/TSV`?
1. Your csv or tsv data cannot be stored lists. This means data cannot be already be tokenized, thus everytime you run your Python script that reads this data via TorchText, it has to be tokenized. Using advanced tokenizers, such as the spaCy tokenizer, takes a non-negligible amount of time. Thus, it is better to tokenize your datasets and store them in the json lines format.

2. If tabs appear in your tsv data, or commas appear in your csv data, TorchText will think they are delimiters between columns. This will cause your data to be parsed incorrectly. Worst of all TorchText will not alert you to this as it cannot tell the difference between a tab/comma in a field and a tab/comma as a delimiter. As json data is essentially a dictionary, you access the data within the fields via its key, so do not have to worry about "surprise" delimiters.

### Building the Vocabularies.

In [47]:
QOUTE.build_vocab(train_data)
LABEL.build_vocab(train_data)

In [51]:
print(QOUTE.vocab.itos)

['<unk>', '<pad>', 'a', 'for', 'pray', 'to', 'an', 'and', 'blame.', 'difficult', 'do', 'easy', 'else', 'endure', 'everything', 'in', 'is', 'life,', 'like', 'must', 'no', 'not', 'one', 'one.', 'own', 'potato!', 'rice', 'stand', 'strength', 'tall,', 'the', 'there', 'world.', 'you', 'your']


In [54]:
print(LABEL.vocab.stoi)
print(LABEL.vocab.itos)

defaultdict(None, {'1': 0, '0': 1})
['1', '0']


In [58]:
QOUTE.vocab.freqs.most_common(2)

[('to', 2), ('pray', 2)]

### Iterating over a dataset using the `BucketIterator`.

* Then, we can create the iterators after defining our `batch size` and `device`.

* By default, the `train` data is `shuffled` each epoch, but the `validation/test` data is `sorted`. However, TorchText doesn't know what to use to sort our data and it would throw an error if we don't tell it.

There are two ways to handle this, you can either tell the iterator not to sort the `validation/test` data by passing `sort = False`, or you can tell it how to sort the data by passing a `sort_key`. **A sort key is a function that returns a key on which to sort the data on**. For example:
```py
lambda x: x.q 
```
will sort the examples by their q attribute, i.e their quote. Ideally, you want to use a sort key as the BucketIterator will then be able to sort your examples and then minimize the amount of padding within each batch.

We can then iterate over our iterator to get batches of data. **Note how by default TorchText has the batch dimension second.**

In [60]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

BATCH_SIZE = 1

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data),
    device = DEVICE,
    batch_size = BATCH_SIZE,
    sort_key = lambda x: x.q,
)

cpu


### Train Data.

In [62]:
for data in train_iterator:
    print(data, data.q)


[torchtext.legacy.data.batch.Batch of size 1]
	[.q]:[torch.LongTensor of size 14x1]
	[.s]:[torch.FloatTensor of size 1] tensor([[33],
        [19],
        [24],
        [14],
        [15],
        [34],
        [32],
        [31],
        [16],
        [20],
        [22],
        [12],
        [ 5],
        [ 8]])

[torchtext.legacy.data.batch.Batch of size 1]
	[.q]:[torch.LongTensor of size 16x1]
	[.s]:[torch.FloatTensor of size 1] tensor([[10],
        [21],
        [ 4],
        [ 3],
        [ 6],
        [11],
        [17],
        [ 4],
        [ 3],
        [30],
        [28],
        [ 5],
        [13],
        [ 2],
        [ 9],
        [23]])

[torchtext.legacy.data.batch.Batch of size 1]
	[.q]:[torch.LongTensor of size 7x1]
	[.s]:[torch.FloatTensor of size 1] tensor([[27],
        [29],
        [ 7],
        [26],
        [18],
        [ 2],
        [25]])


### Test Data

In [63]:
for data in train_iterator:
    print(data.q)

tensor([[27],
        [29],
        [ 7],
        [26],
        [18],
        [ 2],
        [25]])
tensor([[10],
        [21],
        [ 4],
        [ 3],
        [ 6],
        [11],
        [17],
        [ 4],
        [ 3],
        [30],
        [28],
        [ 5],
        [13],
        [ 2],
        [ 9],
        [23]])
tensor([[33],
        [19],
        [24],
        [14],
        [15],
        [34],
        [32],
        [31],
        [16],
        [20],
        [22],
        [12],
        [ 5],
        [ 8]])


### That's how we can load our own dataset using `TorchText`

### Credits.
* [bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/A%20-%20Using%20TorchText%20with%20Your%20Own%20Datasets.ipynb)