# Classification of Questions

## Introduction

This notebook is dedicated to a challenging task in the realm of Deep Learning - the classification of questions into two categories: sincere and insincere. Our dataset comprises over 250,000 training questions, each labeled as either honest or dishonest, providing a rich ground for training and testing our models.

### Objective

The primary goal is to implement and train models specifically on this dataset, evaluating their performance in accurately categorizing the questions. This task not only tests the practical application of deep learning models but also emphasizes the importance of precise data preprocessing and model tuning.

### Approach

We will explore the dataset, implement various models, and train them to classify questions effectively. The use of pre-trained embeddings is allowed, although the models themselves should be built and trained from scratch. Special attention will be paid to the preprocessing of text data, ensuring that every decision is data-driven and well-justified.

### Evaluation Metrics

The model's performance will be evaluated using metrics such as accuracy, precision, recall, and F1 score, with a particular focus on the F1 score due to the dataset's imbalance.

### Tools and Technologies

- Programming Language: Python
- Main Libraries: PyTorch, Pandas, NumPy, Matplotlib, scikit-learn

This notebook aims to not only build a robust model for question classification but also to serve as a comprehensive guide through the process of model selection, training, and evaluation in a real-world deep learning application.

---

**Note**: This project is an academic exercise, part of the "Taller de Deep Learning" course. It is intended for educational purposes, focusing on applying deep learning techniques to a real-world problem. The dataset and the task are designed to provide hands-on experience in building and evaluating deep learning models, particularly in the field of Natural Language Processing (NLP).

---


## Environment Setup

### Setting Up the Conda Environment

To ensure consistency across different platforms and manage dependencies effectively, we will use a Conda environment. If you haven't already installed Conda, please follow the instructions from [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/products/distribution).

#### Steps to Setup:

1. **Create the Conda Environment**: Navigate to the root of the project where the `environment.yml` file is located and run the following command in your terminal:   
```bash
conda env create -f environment.yml
```
> This command will create a new Conda environment with all the necessary packages as specified in `environment.yml`.

2. **Activate the Environment**: Once the environment is created, activate it using:
```bash
conda activate question-classifier-nlp
```

### Importing Libraries

In [None]:
import re
import time
from itertools import chain
from bs4 import BeautifulSoup
from collections import Counter

import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

from gensim.models.word2vec import Word2Vec

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

### Checking the GPU

In [None]:
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

### Setting the Random Seed for Reproducibility

In [None]:
SEED = 43
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

# set deterministic cudnn for reproducibility
torch.backends.cudnn.deterministic = True