### Objective :  
- With help of simple text classification task based on AG_NEWS sample dataset we will learn text procesing task.

### About Dataset:
- Main Objective of AG_NEWS dataset is to classify news headlines into one of 4 categories:***World, Sports, Business and Sci/Tech***
- This dataset is built from PyTorch's torchtext module.

#### Importing libraries

In [1]:
import torch
import torchtext
import os
import collections
os.makedirs('./data',exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
train_dataset,test_dataset = torchtext.datasets.AG_NEWS(root='./data')
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

#### train_dataset & test_dataset contain iterators that returns pair of label and text

In [3]:
train_dataset,test_dataset

(<torchtext.data.datasets_utils._RawTextIterableDataset at 0x21d38c55a00>,
 <torchtext.data.datasets_utils._RawTextIterableDataset at 0x21d38ca4700>)

#### Display top 3 news 

In [4]:
for index, tuple_data in zip(range(3),train_dataset):
    print(f"<{classes[tuple_data[0]]}> <-> {tuple_data[1]}\n")

<Sci/Tech> <-> Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.

<Sci/Tech> <-> Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.

<Sci/Tech> <-> Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.



#### Converting train_dataset, test_dataset iterators to List

In [5]:
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

#### First step is to convert text to tokens - tokenization.

#### `Tokenization` : 
- The process of converting text into a sequence of tokens is called tokenization.

In [6]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

In [7]:
ex_sentence = train_dataset[0][1]
tokens = tokenizer(ex_sentence)
print(f'\nToken list:\n{tokens}')


Token list:
['iraq', 'halts', 'oil', 'exports', 'from', 'main', 'southern', 'pipeline', '(', 'reuters', ')', 'reuters', '-', 'authorities', 'have', 'halted', 'oil', 'export\\flows', 'from', 'the', 'main', 'pipeline', 'in', 'southern', 'iraq', 'after\\intelligence', 'showed', 'a', 'rebel', 'militia', 'could', 'strike\\infrastructure', ',', 'an', 'oil', 'official', 'said', 'on', 'saturday', '.']


#### Next step to convert text to numbers is to build Vocalulary for which we will use counter object
#### `Vectorization` :
- The process of converting each token into number that can be represented as tensors which can be feed to neural nework is called vectorization.

In [8]:
counter = collections.Counter()
for (label,line) in train_dataset:
    #This step essentially counts the number of occurrences of each token in the entire training dataset.
    counter.update(tokenizer(line))

#### Create a Vocab object that would help us deal with vectorization

In [9]:
vocab = torchtext.vocab.Vocab(counter, min_freq=1)

#### Using vocabulary object, we can easily encode our tokenized string into a set of numbers
`vocab.stoi` allows us to convert from a string representation into numbers (the name stoi stands for "from string to integers). 

In [10]:
vocab_size = len(vocab)
print("Vocab Size", vocab_size)
def encode(x):
    return [vocab.stoi[s] for s in tokenizer(x)]
vector = encode(ex_sentence)
print(vector)

Vocab Size 95809
[71, 7377, 59, 1811, 30, 906, 538, 2847, 14, 28, 15, 28, 16, 839, 40, 4979, 59, 68867, 30, 3, 906, 2847, 8, 538, 71, 58871, 704, 6, 913, 2521, 94, 89166, 4, 31, 59, 294, 27, 11, 115, 2]


#### Limitations of word tokenization:
- 1. Ambiguity
- 2. Compound words
- 3. Punctuation - Tokenizing words based solely on whitespace or punctuation marks can also be problematic. 
- 4. Out-of-vocabulary words
- 5. Languages without whitespace

#### To overcome limitations of word tokenization we use N-gram representations
- N-grams are contiguous sequences of N words from a text
- N-gram representations are a commonly used technique for capturing the context and meaning of words in NLP tasks 

In [11]:
from torchtext.data.utils import ngrams_iterator

In [12]:
bi_gram_counter = collections.Counter()
for (label,line) in train_dataset:
    bi_gram_counter.update(ngrams_iterator(tokenizer(line), ngrams=2))
bi_gram_vocab = torchtext.vocab.Vocab(bi_gram_counter, min_freq=2)

#### Limitations of  N-gram representations:
- 1. Fixed window size
- 2. Sparsity
- 3. Limited context 
- 4. Fixed vocabulary
- 5. Curse of dimensionality

In practice, n-gram vocabulary size is still too high to represent words as one-hot vectors, and thus we need to combine this representation with some dimensionality reduction techniques, such as *embeddings*