In [0]:
!pip install torchtext==0.4

Collecting torchtext==0.4
[?25l  Downloading https://files.pythonhosted.org/packages/43/94/929d6bd236a4fb5c435982a7eb9730b78dcd8659acf328fd2ef9de85f483/torchtext-0.4.0-py3-none-any.whl (53kB)
[K     |██████▏                         | 10kB 25.6MB/s eta 0:00:01[K     |████████████▍                   | 20kB 4.3MB/s eta 0:00:01[K     |██████████████████▌             | 30kB 5.1MB/s eta 0:00:01[K     |████████████████████████▊       | 40kB 5.1MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51kB 5.9MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 4.2MB/s 
Installing collected packages: torchtext
  Found existing installation: torchtext 0.3.1
    Uninstalling torchtext-0.3.1:
      Successfully uninstalled torchtext-0.3.1
Successfully installed torchtext-0.4.0


# Project 3: Text Classification in PyTorch

## Instructions

* All the tasks that you need to complete in this project are either coding tasks (mentioned inside the code cells of the notebook with `#TODO` notations) or theoretical questions that you need to answer by editing the markdown question cells.
* **Please make sure you read the [Notes](#Important-Notes) section carefully before you start the project.**

## Introduction
This project deals with neural text classification using PyTorch. Text classification is the process of assigning tags or categories to text according to its content. It's one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Text classification algorithms are at the heart of a variety of software systems that process text data at scale. Email software uses text classification to determine whether incoming mail is sent to the inbox or filtered into the spam folder. Discussion forums use text classification to determine whether comments should be flagged as inappropriate.

**_Example:_** A simple example of text classification would be Spam Classification. Consider the bunch of emails that you would receive in the your personal inbox if the email service provider did not have a spam filter algorithm. Because of the spam filter, spam emails get redirected to the Spam folder, while you receive only non-spam ("_ham_") emails in your inbox.

![](http://blog.yhat.com/static/img/spam-filter.png)

## Task
Here, we want you to focus on a specific type of text classification task, "Document Classification into Topics". It can be addressed as classifying text data or even large documents into separate discrete topics/genres of interest.


![](https://miro.medium.com/max/700/1*YWEqFeKKKzDiNWy5UfrTsg.png)

In this project, you will be working on classifying given text data into discrete topics or genres. You are given a bunch of text data, each of which has a label attached. We ask you to learn why you think the contents of the documents have been given these labels based on their words. You need to create a neural classifier that is trained on this given information. Once you have a trained classifier, it should be able to predict the label for any new document or text data sample that is fed to it. The labels need not have any meaning to us, nor to you necessarily.

## Data
There are various datasets that we can use for this purpose. This tutorial shows how to use the text classification datasets in the PyTorch library ``torchtext``. There are different datasets in this library like `AG_NEWS`, `SogouNews`, `DBpedia`, and others. This project will deal with training a supervised learning algorithm for classification using one of these datasets. In task 1 of this project, we will work with the `AG_NEWS` dataset.

## Load Data

A bag of **ngrams** feature is applied to capture some partial information about the local word order. In practice, bi-grams or tri-grams are applied to provide more benefits as word groups than only one word.

**Example:**

*"I love Neural Networks"*
* **Bi-grams:** "I love", "love Neural", "Neural Networks"
* **Tri-grams:** "I love Neural", "love Neural Networks"

In the code below, we have loaded the `AG_NEWS` dataset from the ``torchtext.datasets.TextClassification`` package with bi-grams feature. The dataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.

In [0]:
"""
Load the AG_NEWS dataset in bi-gram features format.
"""

import torch
import torchtext
from torchtext.datasets import text_classification
import os

NGRAMS = 2

if not os.path.isdir('./.data'):
    os.mkdir('./.data')

train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)

BATCH_SIZE = 16

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

120000lines [00:06, 17539.50lines/s]
120000lines [00:14, 8318.45lines/s]
7600lines [00:00, 8879.91lines/s]


## Model

Our first simple model is composed of an [`EmbeddingBag`](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag) layer and a linear layer.

``EmbeddingBag`` computes the mean value of a “bag” of embeddings. The text entries here have different lengths. ``EmbeddingBag`` requires no padding here since the text lengths are saved in offsets. Additionally, since ``EmbeddingBag`` accumulates the average across the embeddings on the fly, ``EmbeddingBag`` can enhance the performance and memory efficiency to process a sequence of tensors.

In [0]:
# TODO: Import the necessary libraries
import torch.nn as nn
import torch.nn.functional as F

# TODO: Create a class TextClassifier. Remember that this class will be your model.
class text_classifier(nn.Module):
    # TODO: Define the __init__() method with proper parameters
    # (vocabulary size, dimensions of the embeddings, number of classes)
    def __init__(self, vocab_size, dim_embed, num_class):
        super().__init__()
        # TODO: define the embedding layer
        self.embedding = nn.EmbeddingBag(vocab_size, dim_embed, sparse = True)
        # TODO: define the linear forward layer
        self.linear = nn.Linear(dim_embed, num_class)
        # TODO: Initialize weights
        self.init_weights()

    # TODO: Define a method to initialize weights.
    def init_weights(self):
        # The weights should be random in the range of -0.5 to 0.5.
        # You can initialize bias values as zero.
        self.embedding.weight.data.uniform_(-0.5, 0.5)
        self.linear.weight.data.uniform_(-0.5, 0.5)
        self.linear.bias.data.zero_()
    
    # TODO: Define the forward function.
    def forward(self, text, offsets):
        # This should calculate the embeddings and return the linear layer
        embeddings = self.embedding(text, offsets)
        # with calculated embedding values.
        return self.linear(embeddings)

## Check your data before you proceed!

Okay, so we know that we are using the `AG_NEWS` dataset in this project, but do you know what does the data contain? What is the format of the data? How many classes of data are there in this dataset? We do not know, yet. Let's find out!


## Question 1:
Create a new cell in this notebook and try to analyze the dataset that we loaded for you before. Report the following:
* Vocabulary size (VOCAB_SIZE)
* Number of classes (NUM_CLASS)
* Names of the classes


## Answer 1:

In [0]:
VOCAB_SIZE = len(train_dataset.get_vocab())
NUM_CLASS = len(train_dataset.get_labels())
NAME_CLASSES = {"World","Sports", "Business","Sci/Tec"} #???

In [0]:
print(VOCAB_SIZE)
print(NUM_CLASS)
print(NAME_CLASSES)

1308844
4
{'Sports', 'Sci/Tec', 'World', 'Business'}


Vocabulary size (VOCAB_SIZE) = 1308844

Number of classes (NUM_CLASS) = 4

Names of the classes = {'Sports', 'World', 'Business', 'Sci/Tec'}

## Create an instance for your model

Great! You have successfully completed a basic analysis of the data that you are going to work with. The vocab size is equal to the length of vocab (including single word and ngrams). The number of classes is equal to the number of labels. Copy paste the code statements you used in your analysis to complete the code below. Also, using these parameters, create an instance `model` of your text classifier `TextClassifier`.

In [0]:
'''
Paramters and model instance creation.
'''

# TODO: Instantiate the Vocabulary size and the number of classes
# from the training dataset that we loaded for you.

# Hint: Remember that these are PyTorch datasets. So, there should be 
# readily available functions that you can use to save time. ;)

VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUM_CLASS = len(train_dataset.get_labels())

# TODO: Instantiate the model with the parameters you defined above. 
# Remember to allocate it to your 'device' variable.

model = text_classifier(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

## Generate batch

Since the text entries have different lengths, you need to create a custom function to generate data batches and offsets. This function should be passed to the ``collate_fn`` parameter in the ``DataLoader`` call of pyTorch which you will use to create the data later on. The input to ``collate_fn`` is a list of tensors with the size of batch_size, and the ``collate_fn`` function packs them into a mini-batch. Pay attention here and make sure that ``collate_fn`` is declared as a top level definition. This ensures that the function is available in each worker. This is the reason why you need to define this custom function first before you call DataLoader().

The text entries in the original data batch input are packed into a list and concatenated as a single tensor as the input of ``EmbeddingBag``. The offsets is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

Finish the function definition below. The function should take batch as an input parameter. Each entry in the batch contains a pair of values of the text and the corresponding label.

In [0]:
# TODO: Finish the function definition.

def generate_batch(batch):
    
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text] 

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) 
    text = torch.cat(text)   #tensor text
    
    return text, offsets, label

## Define the train function

Here, you need to define a function which you will use later on in the project to train your model. This is very similar to the training steps that you have encountered before in previous coding assignment(s). The outline of the function is something like this -

* load the data as batches
* iterate over the batches
* find the model output for a forward pass
* calculate the loss
* perform backpropagation on the loss (optimize it)
* find the training accuracy

In addition to this, you also need to find the total loss and total training accuracy values. Also, you need to return the average values of the total loss and total accuracy.

In [0]:
from torch.utils.data import DataLoader

def train(train_data):

    # Initial values of training loss and training accuracy
    
    train_loss = 0
    train_acc = 0
    total_acc = 0

    # TODO: Use the PyTorch DataLoader class to load the data 
    # into shuffled batches of appropriate sizes into the variable 'data'.
    # Remember, this is the place where you need to generate batches.
    data = DataLoader(train_data, batch_size = BATCH_SIZE, shuffle = True, collate_fn = generate_batch)
    
    
    for i, (text, offsets, cls) in enumerate(data):
        
        # TODO: What do you need to do in order to perform backprop on the optimizer?
        optimizer.zero_grad()
        
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        
        # TODO: Store the output of the model in variable 'output'
        output = model(text, offsets)
        
        # TODO: Define the 'loss' variable (with respect to 'output' and 'cls').
        # Also calculate the total loss in variable 'train_loss'
        loss = criterion(output, cls)
        train_loss += loss.item()
        
        # TODO: Perform the backward propagation on 'loss' and 
        # optimize it through the 'optimizer' step
        loss.backward()
        optimizer.step()
        
        # TODO: Calculate and store the total training accuracy
        # in the variable 'total_acc'.
        # Remember, you need to find the 
        train_acc = (output.argmax(1) == cls).sum()
        total_acc += train_acc.item()

    # TODO: Adjust the learning rate here using the scheduler step
    scheduler.step()
    

    return train_loss / len(train_data), total_acc / len(train_data)

## Define the test function

Using the framework of the `train()` function in the previous cell, try to figure out the structure of the test function below.

In [0]:
def test(test_data):
    
    # Initial values of test loss and test accuracy
    
    loss = 0
    acc = 0
    
    # TODO: Use DataLoader class to load the data
    # into non-shuffled batches of appropriate sizes.
    # Remember, you need to generate batches here too.
    data = DataLoader(test_data, batch_size = BATCH_SIZE, collate_fn = generate_batch)
    
    
    for text, offsets, cls in data:
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        
        # Hint: There is a 'hidden hint' here. Let's see if you can find it :)
        with torch.no_grad():  #why in this model we use no grad?
            
            # TODO: Get the model output
            output = model(text, offsets)
            
            
            # TODO: Calculate and add the loss to find total 'loss'
            ls = criterion(output,cls)
            loss += ls.item()
            
            
            # TODO: Calculate the accuracy and store it in the 'acc' variable
            a = (output.argmax(1) == cls).sum()
            acc += a.item()
            

    return loss / len(test_data), acc / len(test_data)

## Split the dataset and run the model

The original `AG_NEWS` has no validation dataset. For this reason, you need to split the training dataset into training and validation sets with a proper split ratio. The `random_split()` function in the torch.utils core PyTorch library should be able to help you with this. We have already imported it for you. :)

* Consider the initial learning rate as 4.0, number of epochs as 5, training data ratio as 0.9.
* You need to define and use a proper loss function
* Define an Optimization algorithm (Suggestion: SGD)
* Define a scheduler function to adjust the learning rate through epochs (gamma parameter = 0.9).
(Hint: Look at the `StepLR` function)
* Monitor the loss and accuracy values for both training and validation data sets.

In [0]:
import time
from torch.utils.data.dataset import random_split

# TODO: Set the number of epochs and the learning rate to 
# their initial values here

N_EPOCHS = 5
LEARNING_RATE = 4.0
TRAIN_RATIO = 0.9

# TODO: Set the intial validation loss to positive infinity
init_valid_loss = float('inf')


# TODO: Use the appropriate loss function
criterion = torch.nn.CrossEntropyLoss().to(device)


# TODO: Use the appropriate optimization algorithm with parameters (Suggested: SGD)
optimizer = torch.optim.SGD(model.parameters(), lr = LEARNING_RATE)


# TODO: Use a scheduler function
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)


# TODO: Split the data into train and validation sets using random_split()
len_trainset = int(len(train_dataset)*0.95)
len_validset = len(train_dataset) - len_trainset
train_set , valid_set = random_split(train_dataset, [len_trainset, len_validset])


# TODO: Finish the rest of the code below

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train(train_set)
    valid_loss, valid_acc = test(valid_set)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0 minutes, 9 seconds
	Loss: 0.0261(train)	|	Acc: 84.8%(train)
	Loss: 0.0195(valid)	|	Acc: 89.3%(valid)
Epoch: 2  | time in 0 minutes, 8 seconds
	Loss: 0.0119(train)	|	Acc: 93.6%(train)
	Loss: 0.0178(valid)	|	Acc: 90.8%(valid)
Epoch: 3  | time in 0 minutes, 9 seconds
	Loss: 0.0069(train)	|	Acc: 96.3%(train)
	Loss: 0.0199(valid)	|	Acc: 90.1%(valid)
Epoch: 4  | time in 0 minutes, 9 seconds
	Loss: 0.0038(train)	|	Acc: 98.1%(train)
	Loss: 0.0200(valid)	|	Acc: 91.1%(valid)
Epoch: 5  | time in 0 minutes, 9 seconds
	Loss: 0.0022(train)	|	Acc: 99.0%(train)
	Loss: 0.0211(valid)	|	Acc: 91.1%(valid)


## Let's  check the test loss and test accuracy

So you have trained your model and seen how well it performs on the training and validation datasets. Now, you need to check your model's performance against the test dataset. Using the test dataset as input, report the test loss and test accuracy scores of your model.

In [0]:
# TODO: Compete the code below to find 
# the results (loss and accuracy) on the test data

print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.0236(test)	|	Acc: 90.6%(test)


In [0]:
# importing necessary libraries

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

# labels for the AG_NEWS dataset

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
      #vocab[token] : tokens_ids
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

# If you have done everything correctly in this task,
# then the output of this cell should be - "This is a 'Sports' news".

This is a 'Sports' news


# Congratulations! You just designed your first neural classifier!

And probably you have achieved a good accuracy score too. Great job!

## Question 2:
You just tested your model with a new sample text. Try to feed some more random examples of similar text (which you think are related to at least one of the four topics _"World", "Sports", "Business", "Sci/Tec"_ of our problem) to the model and see how your model reacts. Give at least 3 such examples (You are free to include more examples if you wish to).

## Answer 2:

In [0]:
#first example

ex_text_str = "Two of Australia's bush fires are likely to merge into a so-called 'mega blaze' on Friday evening, authorities have warned.\
The merger is expected at the border of New South Wales and Victoria and has been feared for days.\
Prime Minister Scott Morrison warned Friday would be 'a difficult day in the eastern states' amid forecasts of heat, strong winds and dry lightning.\
In South Australia, Kangaroo Island also faced an abrupt threat.\
A spokesman for the New South Wales (NSW) Rural Fire Service told the BBC the merger of two fires - both of which are out of control - was 'imminent' and expected at about 8pm (09:00 GMT).\
The two fires at Dunns Road and East Ournie Creek has firefighters bracing for a difficult night - and aircraft won't be able to operate after dark.\
More than 100 bushfires are burning in worst-hit NSW alone, but the danger is equally great in Victoria.\
Victoria's Country Fire Authority issued several emergency warnings on Friday, telling people to evacuate before it became too dangerous.\
In parts of both Victoria and NSW, authorities urged people to leave their homes 'to avoid tragedy'.\
Fires in NSW have destroyed about 1,000 homes since the New Year.\
Mr Morrison said that two ships remained off the coast of NSW ready to evacuate towns if needed."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

# then the output of this cell should be - "This is a 'World' news".

This is a 'World' news


In [0]:
#Second example

ex_text_str = "Demand for iPhones appears to be flourishing once again in China, a year after Apple had to warn investors that the Chinese market was facing a serious slow down.\
IPhone sales in China were up 18% in December from the same month a year earlier, an even better performance than Wall Street had projected, according to an investor note from Wedbush analyst Dan Ives. Apple (AAPL) shipped around 3.2 million iPhones to China during the month compared to 2.7 million in December 2018, Ives reported, citing data from the China Academy of Information and Communication Technology.\
It's good news for Apple, after iPhone sales tumbled in China over the past year.\
'Our belief that China will continue this positive upward trajectory with renewed growth and share gains on the heels of an iPhone 11 product cycle which the skeptics continue to underestimate,' Ives said in the Thursday note.\
The good news was reflected in Apple's stock, which was up nearly 2% to a record high on Thursday.\
China is a key market for Apple — the region makes up nearly 17% of the company's total sales. And the iPhone is Apple's biggest profit driver.\
In early January 2019, Apple CEO Tim Cook wrote a letter to investors warning them to expect lower sales from the holiday quarter due primarily to iPhone sales in China falling short of what the company had expected. It was the first time since June 2002 that Apple issued a reduction in its quarterly revenue forecast. When the company reported earnings for that quarter later in January, iPhone sales had fallen 15% from the prior year.\
A number of factors contributed to the drop, namely slower growth in the Chinese economy and the US-China trade war.\
The trend continued throughout much of last year.\
In April, Apple said its iPhone sales in the first three months of 2019 dropped 17% from the same period a year earlier, again because of sluggish demand in China. A month later, Citi analysts warned that the trade war could cause Apple's iPhone sales in China to be cut in half.\
In the three months ending in June 2019, iPhones made up less than half of the company's revenue for the first time in years, though the slump also coincided with a greater focus at Apple on subscription-based services such as Apple Music.\
But the iPhone 11, which Apple introduced in September with better camera technology and battery life, as well as lower-than-expected prices, has helped with the rebound, Ives said. Early demand for the new model was strong, and in Apple's October earnings call, Cook noted that the company's prospects in China were turning around.\
Now, Ives estimates that there are roughly 60 million to 70 million iPhone users in China who are likely to upgrade their phones in the coming months.\
The momentum probably will continue this year, as Apple analysts widely expect Apple to release a 5G-enabled version of the iPhone in the fall.\
'Many investors are asking us: Is all the good news baked into shares after an historic upward move over the last year?' Ives said in the note. 'The answer from our vantage point is a resounding NO, as we view [this as] only the first part of this massive upgrade opportunity."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

# then the output of this cell should be - "This is a 'Bussiness' news". But it reports Sci/Tec, because the text is selected from CNN bussiness news

This is a 'Business' news


In [0]:
#Third example

ex_text_str = "Amazon, Alphabet, Alibaba, Facebook, Tencent - five of the world's 10 most valuable companies,\
 all less than 25 years old - and all got rich, in their own ways, on data.No wonder it's become common to call data the 'new oil'.\
  As recently as 2011, five of the top 10 were oil companies. Now, only ExxonMobil clings on.\
The analogy isn't perfect. Data can be used many times, oil only once.\
But data is like oil in that the crude, unrefined stuff is not much use to anyone.\
You have to process it to get something valuable. You refine oil to make diesel, to put it in an engine.\
With data, you need to analyse it to provide insights that can inform decisions - which advert to insert in a social media timeline, which search result to put at the top of the page.\
Imagine you were asked to make just one of those decisions.\
Someone is watching a video on YouTube, which is run by Google, which is owned by Alphabet. What should the system suggest they watch next?\
 Pique their interest, and YouTube gets to serve them another advert. Lose their attention, and they will click away.\
You have all the data you need. Consider every other YouTube video they have ever watched - what are they interested in? Now, look at what other users have gone on to watch after this video.\
Weigh up the options, calculate probabilities. If you choose wisely, and they view another ad, well done - you've earned Alphabet all of, ooh, maybe 20 cents (15p).\
Clearly, relying on humans to process data would be impossibly inefficient. These business models need machines.\
In the data economy, power comes not from data alone but from the interplay of data and algorithm."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

# then the output of this cell should be - "This is a 'Sci\Tec' news". But it reports Bussiness, because the text is selected from BBC news

This is a 'Business' news


In [0]:
#Fourth example

ex_text_str = "Nreal, the Chinese start-up involved, has confounded the expectations of many industry watchers with the quality of the images its Light glasses produces.\
The firm still faces issues.\
One tester said the glasses looked a bit 'clunky', and the company is being sued by Magic Leap, a rival.\
But long-time CES attendee Ben Wood, an influential tech consultant, declared them the 'product of the show'.\
For years people have over-promised and under-delivered on augmented reality glasses,' the CCS Insight analyst told the BBC.\
'Nreal seem to have quietly got on with delivering the product and are now set to ship it by the middle of the year.\
'I'm not gong to pretend the glasses will be to everybody's taste - this is still a first-generation product. But they are a lot closer to a normal pair of sunglasses than some of the other bulky smart glasses I've seen, and they definitely provide the best experience of augmented reality glasses at CES.'"

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

# then the output of this cell should be - "This is a 'Sci\Tec' news".

This is a 'Sci/Tec' news


## Question 3:
Okay, probably the model still works great with the examples you fed to it in the previous question. How about a twist in the plot? Let's feed it some more random text data from completely different genres/topics (not belonging to the 4 topics which we talk about the in the first question). How does your model react now? Give at least 3 such examples (You are free to include more examples if you wish to).

Of course the predictions will be limited to the four class labels that your model is trained on. Can you somehow justify the labels that your model predicted now for the given text inputs?

## Answer 3:

In [0]:
#example 1:

ex_text_str = "More than 50 people have been infected. Seven are currently in a critical condition.\
A new virus arriving on the scene, leaving patients with pneumonia, is always a worry and health officials around the world are on high alert.\
But is this a brief here-today-gone-tomorrow outbreak or the first sign of something far more dangerous?\
What is this virus?\
Viral samples have been taken from patients and analysed in the laboratory.\
And officials in China and the World Health Organization have concluded the infection is a coronavirus.\
Coronaviruses are a broad family of viruses, but only six (the new one would make it seven) are known to infect people.\
Severe acute respiratory syndrome (Sars), which is caused by a coronavirus, killed 774 of the 8,098 people infected in an outbreak that started in China in 2002.\
'There is a strong memory of Sars, that's where a lot of fear comes from, but we're a lot more prepared to deal with those types of diseases,' says Dr Josie Golding, from the Wellcome Trust.\
Where has it come from?\
New viruses are detected all the time.\
They jump from one species, where they went unnoticed, into humans.\
'If we think about outbreaks in the past, if it is a new coronavirus, it will have come from an animal reservoir,' says Prof Jonathan Ball, a virologist at the University of Nottingham.\
Sars jumped from the civet cat into humans.\
And Middle East respiratory syndrome (Mers), which has killed 858 out of the 2,494 recorded cases since it emerged in 2012, regularly makes the jump from the dromedary camel."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

#The output of this cell should be - "This is a 'Health' news".

This is a 'Sci/Tec' news


In [0]:
#example 2:

ex_text_str = "Architecture can flirt with nature in expressive yet subtle ways. The idea is, often, to harmonize, not dominate, the landscapes.\
This can prove a challenge, however, when faced with steep slopes, cliff faces and mountainsides.\
Some of today's most interesting architects are out to prove the discipline can be edgy -- quite literally. Here is one examples of houses that overcome difficult environments to offer extraordinary an experience for owners and onlookers alike.\
Cliff House, on the Atlantic coast in Nova Scotia, is an inventive and playful intervention in the landscape.\
From hill height, the house looks absolutely normal. But from the\ coast, you can see it's actually perched on a cliff, which the architects say is intended 'to heighten one's experience of the landscape through a sense of vertigo and a sense of floating on the sea.'\
The galvanized steel superstructure provides solid support and is fixed to the cliff, while wooden elements introduce cosiness inside and out.\
The cube is not divided into levels, so the large living space fills the entire area. Only a small part of it is transformed into sleeping quarters."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

#The output of this cell should be - "This is a 'Architecture / Style/ Art' news".

This is a 'Business' news


In [0]:
#example 3:

ex_text_str = "Cambodian cuisine has a long history and a diverse range of influences, yet it's only now becoming known beyond the country's borders. In fact, the only place you can experience all it has to offer is in the country itself. \
Here is one of the 30 best dishes to try.\
Samlor korkor\
While amok is sometimes called the country's national dish, and might be the one most familiar to tourists, samlor korkor has a better claim to being the true national dish of Cambodia. It has been eaten for hundreds of years and today can be found in restaurants, roadside stands and family homes alike.\
The ingredients list for this nourishing soup is versatile and easily adapted to whatever is seasonal and abundant; it often includes more than a dozen vegetables. It can be made with almost any type of meat, but most commonly it's a hearty soup made from catfish and pork belly. The soup always includes two quintessential Cambodian ingredients -- prahok, a type of fermented fish, and kroeung, a fragrant curry paste -- and is then thickened with toasted ground rice."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

#The output of this cell should be - "This is a 'Travel' news".

This is a 'Business' news


## Question 4:
Your model probably has achieved a good accuracy score. However, there may be lots of things that you could still try to do to improve your classifier model. Can you try to list down some improvements that you think would be able to improve the above model's performance?

_(Hint: Maybe think about alternate architectures, #layers, hyper-paramters, etc..., but try not to come up with too complex stuff! :) )_

## Answer 4:

For improving our classifier model,

1) we can use differenr Architecture:

Convolutional Neural Networks for Sentence Classification (CNN)

Recurrent Neural Networks (RNN)

Gated Recurrent Unit (GRU)

Long Short-Term Memory (LSTM)

Hierarchical Attention Networks for Document Classification 

Hierarchical Deep Learning for Text (HDLTex)

2) We can change hyper parameters in order to have beter results:
value 4.0 for learning rate is so hight by considering 0.01 and Epoch = 50 we can have better accuracy.

# Task 2: Try the better option that you proposed

In Question 4, you have proposed some alternate solution that you think will be able to somehow improve your model. Following one of the options below, try to build and train a new model, and report the new loss and accuracy scores. Is it better than your initial classifier model for the same data?

For your reference, here are some neural models using which researchers have tried to classify text before:

* Recurrent Neural Networks (RNNs)
* Long-Short Term Memory (LSTM)
* Bi-directional LSTM (BiLSTM)
* Gated Recurrent Units (GRUs)



# Task 3: Let your creativity flow!

As discussed earlier, you are free to come up with anything in task 3. Think and try to model unique (not too complex!) neural architecture on your own. Remember that this model has to be novel as much as possible, so try not to copy other people's existing work. Using the same data, train the new model, and report the accuracy scores. How much better/worse is this model than the previous two models? Why do you think this is better/worse?

# Important Notes

## NOTE 1:
If you want, you can try out the models on other datasets too for comparisons. Although this is not mandatory, it would be really interesting to see how your model performs for data from different domains maybe. Note that you may need to tweak the code a little bit when you are considering other datasets and formats. 

## NOTE 2:
Any form of plagiarism is strictly prohibited. If it is found that you have copied sample code from the internet, the entire team will be penalized.

## NOTE 3:
Often Jupyter Notebooks tend to stop working or crash due to overload of memory (lot of variables, big neural models, memory-intensive training of models, etc...). Moreover, with more number of tasks, the number of variables that you will be using will surely incerase. Therefore, it is recommended that you use separate notebooks for each _Task_ in this project.

## NOTE 4:
You are expected to write well-documented code, that is, with proper comments wherever you think is needed. Make sure you write a comprehensive report for the entire project consisting of data analysis, your model architecture, methods used, discussing and comparing the models against the accuracy and loss metrics, and a final conslusion. If you want to prepare separate reports for each _Task_, you could do this in the Jupyter Notebook itself using $Mardown$ and $\LaTeX$ code if needed. If you want to submit a single report for the entire project, you could submit a PDF file in that case (Word or $\LaTeX$).

All the very best for project 2. Wishing you happy holidays and a very happy new year in advance! :)