## Homework (Mini-project)

Objective of this assignment is to implement the basic building blocks of a Deep Learning pipeline on a sample supervised-learning problem in **PyTorch**.



Name: Omar Elsobky

Matriculation No.: 03737994 

**Important:** Do not forget to fill the places where you see `### Your code goes here ###`

______________

# Task and Setup

In this assignment, we want you to experience doing a mini-project in PyTorch. You are supposed to build the different parts of the pipeline as illustrated in the course notebooks (DataLoading, Model, Loss, Training, Evaluation). For this purpose, we will use a dataset from Quora containing question pairs and labels whether the pair is a duplicate or not.

The data can be found in Moodle in a csv file. The data is This data is subject to Quora's [Terms of Service](https://www.quora.com/about/tos), allowing for non-commercial use. The dataset was downloaded from https://www.kaggle.com/quora/question-pairs-dataset. 



**Classifiying Duplicate Question Pairs** 


We want to build a DL Model that can predict whether two questions from a Quora dataset are duplicates or not. Note that the two questions must not be identical as you will see in the dataset, rather they semantically mean almost the same thing.

To make the setup a bit simpler, we extracted and prepared a small subset of the original data consisting of 50k examples. Additionally, we removed questions that are too long or too short, we kept questions of length between 30 and 50 characters. Those 50k examples should serve as training and validation data, please consider making a reasonable split. Do not train on the validation data, just use it to evaluate your model.

**Model Inputs and Label**:

Input Format: 2 questions, for each question you will have an input of `BATCH_SIZE X SEQ_LEN`, where SEQ_LEN is the number of tokens in the question. Of course, if you will stack the input into batches, you will need to pad the questions (i.e. add a padding token or zeros at the end of the question to make all questions equal in length).

Label Format: `BATCH_SIZE X 1`, please note that the extra dimension (`X 1`) is optional and dependent on your implementaiton, you could have a simple 1D tensor of length `BATCH_SIZE`, where each value is either 0 or 1 indicating that the two questions are either non-duplicates or not, respectively.

**Hints:**

1. Please read the csv file and explore the dataset a bit in order to familize yourself with the problem before working on it.

2. In your custom DataLoader you have to make sure that you provide two questions for each example, this should be done in the `def __getitem__(self, idx)` method.

3. Please work at the word-level, your tokens are words. You will need to preprocess the data accordingly. Feel free to write simple Python code that can do the job, but also consider using tokenizers, stemmers, and lemmatizers from known NLP libraries such as [NLTK](https://www.nltk.org/) or [SpaCy](https://spacy.io/).

4. You will need to encode the words into integers to be able to pass them to the model, you will also need to keep track of the vocabulary. For this purpose, you can also write your own Python code or use an out-of-the-box module such as `torchtext.data.vocab` (see example in the data loading notebook). You can include this part in your Dataset class if you like.

5. You will most probalby need to use an embedding layer as input to the model, it will then take the sequence of integers and return numeric vectors representing each word. Please consider using pre-trained embeddings, there are multiple ways how to load these into your newly-created Embedding layer, `torchtext` also provides some easy ways to do that.

6. With Embeddings, you will have two options:
    - train your own embeddings on the task, either by starting from random weights or after loading pre-trained embeddings (this will take more time and probably need Colab or GPU)
    - or freeze the pre-trained embeddings and train the rest of the network (make sure the embedding layer is frozen, `requires_grad` is set to `False`.

7. Note that you will need to encode the questions as integers based on the vocabulary you are using. This sequence of integers will be fed as input to the model (embeddings lookup, then the following layers).

8. Take care that the model have two inputs (two sentences in parallel). This should be done in your implementation of the `def forward(self, question1, qustion2)` in your custom model class.

9. Since you need to feed both question to your model, in the `forward` you will have to let each question go through a couple of layers to get a representation for each question. Then, you will have to combine the two representations in any way you see possible (e.g. multiply them, subtract them, concatenate them). Finally, with this final representation, you will have to let it go through a couple of layers (mostly fully-connected layers) and then predict the outcome (2 classes).

10. This is generally a binary classification problem, you can use a classification loss to train your model. There are more advanced loss functions that are related to Siamese Networks (which is this architecture since it has multiple parallel inputs), feel free to use or explore them.

11. A nice lecture about the topic is here: Siamese Networks and Similarity Learning Lecture, Prof. Dr. Laura Leal-Taixé, Advanced Deep Learning for Computer Vision Course:https://www.youtube.com/watch?v=6e65XfwmIWE

12. Good summary and course notes of Deep Learning specialization on Coursera: https://github.com/mbadry1/DeepLearning.ai-Summary

### Task 1: Data Loading (30 Points)

1. Write code to read the dataset after you download it from Moodle.
2. Explore some examples and check if you need to do some data cleaning or remove some bad examples.
3. Decide on what preprocessing steps you will do to the text of the questions.
4. Build a custom PyTorch dataset where you implement the required methods `__getitem__` and `__len__`. Do not forget to integrate any preprocessing steps in the class. Make sure that you also have a function that applies the whole preprocessing to a raw example, this will be very helpful when you want to predict for test examples later.
5. Split the data into train and validation data. Use a reasonable split ratio.
6. Create PyTorch dataloaders for train and validation datasets.

In [11]:
### Your code goes here ####

### Task 2: Model (20 Points)

1. Explore what possible models for the task could be. You do not need to come up with a very complex model, a relatively small model consisting of the following sequence would be okay: {*Embeddings for the input - LSTM or CNN to process the sequence - Linear Layers to learn features from the combined representation of the questions - Output Layer*} would also be fine, just take care of the sizing of the different layers. Please always check the sizing after each layer and make sure you understand the dimensions correclty and they map to what you have in mind.
2. Build a model class.
3. Test your model with one batch from your dataloader and check the input and output shapes.

In [1]:
### Your code goes here ####

### Task 3: Training (30 Points)

1. Develop Training and Validation code.
2. Choose a suitable loss function.
3. Refactor the code so that it can be easily modified and adapted (use methods, classes, etc...)
4. Make sure to save your trained model when you reach a good score on the validation dataset.
5. Plot the training and validation losses.

In [13]:
### Your code goes here ####

### Task 4: Evaluation (20 Points)

1. Report some suitable evaluation metrics. If you stick to standard classification, please report the classification metrics we discussed in the evaluation notebook.
2. Check some example and results from the training data.
3. Check some examples and results from the validation data (not used for training).
4. Come up with one pair of questions and see if your model can produce a reasonable prediction for them. For this, you need to apply the preprocessing pipeline and encoding on the questions' text and make inference to see if the model predicts that they are duplicates.
5. Conclude with some comments
6. Give us your feedback about the task (at least a sentence).

In [2]:
### Your code goes here ####
### do not forget to add your comments and feedback ###

**Submission Notes:**

1. You can surely use external files to organize your code in classes or modules (e.g. `.py` files). However, please make sure that the notebook can run without errors and all the required files are attached in your submission (e.g. .zip file). If everything is confined in one single notebook, just submit the notebook.
2. If you use special libraries or packages, please indicate that clearly and add a `requirements.txt` file to your submission.
3. Do not upload the dataset nor submit it in any way.
4. Please do not copy code from someone else, the idea here is that you get a chance to write some code in PyTorch and solve the problem on your own. On the other hand, discuss with your colleagues and support them as much as needed. 
5. You can of course reuse the code in all the notebooks of the course.
6. You can also use code and/or ideas from the internet (e.g. Kaggle notebooks), but please always do the following:
    - make sure you understand the code so that you can use it correctly
    - cite the source in your notebook as markdown or as a comment in the code (`# adopted from .......`) 


**Evaluation Criteria:**

The most important idea we will use for evaluation is that you should get every component of the complete pipeline (1) doing what it is supposed to do and (2) fitting with the other components.

Concrete Examples are:
1. Your code runs and we can reproduce the results.
2. Dataset: Your code reads the dataset, preprocesses it, splits the text into tokens, etc...
3. Dataloading: Your dataset and dataloader produce correct inputs and labels (two questions as input, one binary label).
4. Model: Your model takes the input (processes each question correctly) and produces a binary prediction (or score).
5. Loss: your chosen loss is suitable for the task (e.g. binary cross entropy in case of the simple classification setup).
6. Training: your code for training and validation works without errors and the loss on the training and validation data decreases over the course of training. You do not have to achieve a specific performance.
7. Evaluation: you discuss the results and pick correctly some questions from train and validation and show the respective predictions (preferably showing the original question text and not the encoded integer sequence)

**Bonus:**
If you do any of the following ideas, you get a bonus, additionally you get to learn more, which is better than the bonus:

1. Build a baseline model to compare and benchmark your deep learning model against. You can in this case use a classical machine learning model from `scikit-learn`. A starter example is here: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
2. Use W&B or Tensorboard to visualize your training. TensorBoard tutorial is available here: (https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html)
3. Achieve a relatively good performance on the task.
4. Use a learning rate scheduler to improve training. (https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)
5. Use pre-trained embeddings correctly.
6. Use an advanced loss suitable for a siamese-network.
7. Implement any relevant new idea or discuss an interesting insight about the task.