<a href="https://colab.research.google.com/github/parsa-abbasi/NLP981/blob/master/NLP981_Phase2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP981 Final Project - Phase #2

*   Instructor: Javad PourMostafa
*   Teaching Assistant: Parsa Abbasi
*   University of Guilan, 1st semester of 2019
*   GitHub repository : *https://github.com/JoyeBright/NLP*

![alt text](https://api.ning.com/files/zVkZJuBsVmCiwDFZI4D6EjB5JmygDkzEXdXRDm2i7IAblU56Hb0Ed65qbxUrsRD3vxYQ0jung4Cv4*wmIrVvN7Gu*2njAPkV/deeep.png)

Welcome to the final phase of the *NLP981* project. The task you should do in this phase is very similar to the first phase. The only difference is you are going to use a simple Deep Learning model, instead of a Machine Learning algorithm.

The good news is you don't need to implement anything from scratch. You just need to know the main concepts about Neural Networks and be familiar with a Deep Learning library like Keras, Tensorflow or PyTorch. If you didn't work with any of them before. I suggest you use Keras. It's a high-level and easy-to-use deep learning library. The description of this phase is also written based on the Keras library and its functions.

You have to code inside of this python notebook. The *Google Colab* gives you a free 12GB GPU to run your models. So I highly recommend you to save your time and use the Google Colab environment. 

If you have any questions, feel free to ask.
You can use the [*Quera*](https://quera.ir/course/4385/) platform for your general questions.



## Introduction

A category predictor is going to build at this phase of the project. (Similar to the first phase)

The predictor gets a text as input and predicts a category for that.

For this purpose, you need to :

1.   Load the data
2.   Preprocess the text data
3.   Prepare texts to be suitable to feed into a deep learning model

      1.   Tokenize each text
      2.   Encode each text to numeric vector
      3.   Pad each encoded text to have same length vectors

4.   Make and train a simple deep learning model
5.   Predict a category for each of validation data using the trained model
6.   Evaluate your model performance using f1-score
7.   Conclusion



## Google Colab Setup

If you are using the Google Colab environment, make sure that you enabled GPU runtime type. To do this :


*   Click the **“Runtime”** dropdown menu.
*   Select “**Change runtime type**”
*   Select **“GPU”** in the **“Hardware accelerator”** dropdown menu.

![alt text](https://media.geeksforgeeks.org/wp-content/uploads/20190430121157/Screenshot-4910.png)

## 1) Dataset

The dataset you will use in this phase is called *Divar* that released by the *CafeBazaar* research team.

It contains more than 900,000 posts of the *Divar* ads platform. We split this dataset into training, validation, and testing sets.

The testing set is not accessible for you, and we use them to evaluate your work on the presentation day.

You can download the dataset files (training and validation sets) directly from the following link :

> *https://drive.google.com/open?id=1oj-fqpymjDr8QsOK-zQliiqXbVqakrFo*


### 1.1) Import

In [0]:
# Import the training and validation sets here

### 1.2) Extract data

You need two types (columns) of information from this dataset. The descriptions (*desc*) will be used as the input of your model, and the First-level category (*cat1*) will be used as the classes or output of your model.

The classes/categories must be represented with a unique integer number. As there are 6 different first-level categories in this dataset, you can assign a number between 0 to 5 to each of them.

In [0]:
# Your code
x_train = # descriptions extracted from training set
y_train = # first-level categories extracted from training set
x_val = # descriptions extracted from validation set
y_val = # first-level categories extracted from validation set

## 2) Preprocessing

You can use any preprocessing steps you want. I recommend you to try different methods and see it's effects on the final result. Therefore, choose a composition of preprocessing steps that give you the most prediction score. 

In [0]:
def preprocessing(text):
  # Your preprocessing steps
  return cleared_text

## 3) Word Encoding

There are different word embedding approach that was acceptable for deep learning models. The embedding vectors can be made when the model is training, but they can also be pre-trained word vectors. We recommend the first one for this phase. It means you need to use an embedding layer inside of your Neural Network architecture.

You can add an embedding layer very easily in Keras. It requires that the input data be integer encoded so that each word is represented by a unique integer. You can use the Keras Tokenizer to do this representation. Tokenizer allows to vectorize a text corpus, by turning each text into a sequence of integers.

### 3.1) Tokenization

Initilize a [Tokenizer](https://keras.io/preprocessing/text/)

In [0]:
# Your code

Fit the tokenizer on the training set texts.

In [0]:
# Your code

What is the maximum length of all training texts?

In [0]:
max_len = # Your code

How many words are in the tokenizer vocabulary? 

In [0]:
vocab_size = # Your code

### 3.2) Encoding

Encode your training and validation texts by integer values using the tokenizer built-in functions.

In [0]:
encoded_x_train = # Your code
encoded_x_val = # Your code

### 3.3) Padding

The Keras expects that all inputs to have the same length, but our descriptions have different lengths. You can use another Keras built-in function called [pad_sequences](https://keras.io/preprocessing/sequence/) to pads the sequences to the same length.

In [0]:
padded_x_train = # Your code
paddedd_x_val = # Your code

## 4) Deep Learning

Now it's time to make your appropriate Neural Network. You need to implement a simple NN architecture with just one hidden layer. We don't want any special architecture like CNN, RNN, or LSTM from you. But if you are skilled in deep learning, feel free to use any architecture you want.

![alt text](https://drive.google.com/uc?export=view&id=1NuxNnGnum1cWSoL4G8BYOdw7HUwrvcG5)

The proposed model has four layers:


1.   An Embedding layer with shape of (*max_length*, *embedding_dimension*)

2.   A Dense layer with an optional number of units as a hidden layer

3.   A Global Max Pooling layer

4.   An Output layer with 6 units (Number of classes)

**Notes:**

*   You can use an optional number as Embedding Dimension. Our suggestion is 300.
*   You should know what each layer is doing.
*   You can be creative and add different hidden layers.
*   The activation functions, loss functions, and the optimizer will be chosen by you, and you need to have strong reasons behind your choices.
*   Try different batch size, epoch numbers, and learning rates.
*   If you have a problem with the dimension of output (labels), you can convert the labels to a binary class matrix using [to_categorical](https://keras.io/utils/) function.





In [0]:
# Your model initilization

In [0]:
# Your training parameters

In [0]:
# Fit training set to the model

## 5) Prediction

Now you can predict a category for each of the validation data using the implemented classifier.

In [0]:
# Your code

## 6) Evaluation

As the dataset is imbalanced, it's better to use F1-score as the evaluation metric. Therefore, evaluate your model using f1-score based on the prediction you made and the true labels.

In [0]:
# Your code

## 7) Conclusion

You need to have a handle on what you did in these two phases of the project. We want strong answers from you on the presentation day. So spend some time to review your work and make some conclusions. Some of the questioning topics about your work are listed below :

*   Comparing the results or limits of the Machine Learning algorithm and your Deep Learning model.
*   Comparing the advantages and disadvantages of different word embedding approaches (Neural Embedding vs TF-IDF Vectorization).
*   Reasons for using these layers, activation functions, loss function, etc.
*   The performance of your Deep Learning model based on the training loss over each epoch.
*   And everything else that is possible to ask :) 

Good Luck!
