### TorchText
* The torchtext package consists of data processing utilities and popular datasets for natural language.

In [29]:
import spacy, os
from torchtext.legacy.data import  Field, TabularDataset, BucketIterator
import pandas as pd


1 . **Field** - Defines a datatype together with instructions for converting to Tensor.
```python
torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)
```
* [Docs](https://text-docs.readthedocs.io/en/latest/data.html#fields)
2 . **TabularDataset** - Defines a Dataset of columns stored in CSV, TSV, or JSON format.
```python
torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)
```        
3 . **BucketIterator** - Defines an iterator that batches examples of similar lengths together. Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. See pool for the bucketing procedure used.
```py
 torchtext.data.BucketIterator(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)        
```
* [Docs](https://text-docs.readthedocs.io/en/latest/data.html#iterators)

In [19]:
spacy_english = spacy.load('en_core_web_sm')
def tokenize(sent):
    return [tok.text for tok in spacy_english.tokenizer(sent)]

tokenizer2 = lambda x: x.split()

> We have files in the `data` for this example of text processing. We are going to use one of those files as an example.




In [21]:
df = pd.read_csv('data/train.csv')
df

Unnamed: 0,name,quote,score
0,Jocko,You must own everything in your world. There i...,1
1,Bruce Lee,"Do not pray for an easy life, pray for the str...",1
2,Potato guy,"Stand tall, and rice like a potato!",0


> So here we are interested in the `quote` and `score`. We are going to create a `Field` for each key.

In [22]:
qoute = Field(sequential=True, use_vocab=True, tokenize=tokenize, lower=True)
score = Field(sequential=False, use_vocab=False)
qoute, score

(<torchtext.legacy.data.field.Field at 0x1cb58b154c0>,
 <torchtext.legacy.data.field.Field at 0x1cb58b15fa0>)

In [35]:
fields = {
    "quote": ("quote", qoute),
    "score": ("score", score)
}
fields

{'quote': ('quote', <torchtext.legacy.data.field.Field at 0x1cb58b154c0>),
 'score': ('score', <torchtext.legacy.data.field.Field at 0x1cb58b15fa0>)}

> Creating dataset, train and test using `TabularDataset`.

In [36]:
train, test = TabularDataset.splits(path = 'data',
                             train="train.csv", 
                             test="test.csv", 
                             format="csv", 
                             fields=fields
                             )

> Note that we can also load the `json`, `tsv` the same way for example:

##### Loading `json`
```python
train, test = TabularDataset.splits('data',
                             train="train.json", 
                             test="test.json", 
                             validation = "validation.json",
                             format="json", 
                             fields=fields
                             )
```

##### Loading `tsv`
```
train, test = TabularDataset.splits('data',
                             train="train.tsv", 
                             test="test.tsv", 
                             validation = "validation.tsv",
                             format="tsv", 
                             fields=fields
                             )
```
##### Loading `csv`
```python
train, test = TabularDataset.splits('data',
                             train="train.csv", 
                             test="test.csv", 
                             format="csv", 
                             validation = "validation.csv",
                             fields=fields
                             )
```

In [40]:
train[0].quote

['you',
 'must',
 'own',
 'everything',
 'in',
 'your',
 'world',
 '.',
 'there',
 'is',
 'no',
 'one',
 'else',
 'to',
 'blame',
 '.']

In [42]:
qoute.build_vocab(train, max_size=10000, min_freq=1, vectors="glove.6B.100d")

.vector_cache\glove.6B.zip: 862MB [34:19, 419kB/s]                                                                     
100%|███████████████████████████████████████████████████████████████████████▉| 399999/400000 [00:53<00:00, 7513.45it/s]


In [44]:
train_set, test_set = BucketIterator.splits(
    (train, test), batch_size=2
)

In [47]:
for X in train_set:
    pass

In [57]:
X


[torchtext.legacy.data.batch.Batch of size 2]
	[.quote]:[torch.LongTensor of size 16x2]
	[.score]:[torch.LongTensor of size 2]

In [None]:
> Creating a `LSTM_RNN`

In [63]:
len(qoute.vocab)

37

In [64]:
import torch
from torch import nn
from torch.nn import functional as F

In [85]:
input_size = len(qoute.vocab)
hidden_size = 512
num_layers = 2
embedding_size = 100
learning_rate = 0.005
num_epochs = 10
class RNN_LSTM(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size, num_layers):
        super(RNN_LSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.emb = nn.Embedding(input_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers)
        self.fc = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        h_0 = torch.zeros(self.num_layers, x.size(1), self.hidden_size)
        c_0 = torch.zeros(self.num_layers, x.size(1), self.hidden_size)
        
        embeded = self.emb(x)
        output, _ = self.lstm(embeded, (h_0, c_0))
        output = output[:, -1, :]
        return self.fc(output)
    
net = RNN_LSTM(input_size,embedding_size, hidden_size, num_layers)
net

RNN_LSTM(
  (emb): Embedding(37, 100)
  (lstm): LSTM(100, 512, num_layers=2)
  (fc): Linear(in_features=512, out_features=1, bias=True)
)

In [86]:
a = torch.zeros(32, 10)
a.size(1)

10

In [87]:
pretrained_embeddings = qoute.vocab.vectors
net.emb.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.3398,  0.2094,  0.4635,  ..., -0.2339,  0.4730, -0.0288],
        ...,
        [ 0.4918,  1.1164,  1.1424,  ..., -0.5088,  0.6256,  0.4392],
        [-0.4989,  0.7660,  0.8975,  ..., -0.4118,  0.4054,  0.7850],
        [-0.5718,  0.0463,  0.8673,  ..., -0.3566,  0.9293,  0.8995]])

In [88]:
pretrained_embeddings[0]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])

In [89]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
# Train Network
for epoch in range(num_epochs):
    for batch_idx, batch in enumerate(train_set):
        # Get data to cuda if possible
        data = batch.quote
        targets = batch.score

        # forward
        scores = net(data)
        loss = criterion(scores.squeeze(1), targets.type_as(scores))

        # backward
        optimizer.zero_grad()
        loss.backward()

        # gradient descent
        optimizer.step()
    print("loss", loss)

ValueError: Target size (torch.Size([2])) must be the same as input size (torch.Size([18]))