# **Project: Quora Insincere Question Classifier**
---

# **Introduction** 
--- 




## Background: 

In today's digital age, social platforms have become hubs for information sharing and community engagement. Quora, one such platform, provides users with a platform to ask questions and receive answers from a diverse community. However, as with any open forum, there is a potential for misuse, where users may pose insincere or deceptive questions.

The classification of insincere questions is a significant challenge in natural language processing (NLP). It requires the ability to discern the underlying intent and identify questions that may be misleading, inflammatory, or offensive. Accurately detecting and categorizing these insincere questions is crucial to maintaining the quality and credibility of a platform like Quora.

In this project, we delve into the task of insincere question classification on Quora, using machine learning and NLP techniques. Our objective is to develop a robust and efficient model that can automatically differentiate between sincere and insincere questions.

## Dataset:

For our insincerity quest, we will leverage the Quora Insincere Questions Classification dataset, which is publicly available on Kaggle. This dataset comprises a large collection of questions from Quora, along with corresponding labels indicating whether each question is sincere or insincere. The dataset is annotated by human reviewers, providing valuable ground truth for training and evaluation purposes.

Our dataset are divided into training and testing dataset.The data contains the following columns: 

1. **qid**: This is a unique number for each of the question in our datasets. 
2. **question_text**: The full text of a Quora question. 
3. **target**: The label encoding on whether a question is sincere or not. 

## Approachs: 

To tackle this problem, we will adopt a supervised learning approach. We will explore various NLP techniques to build a classification model that can effectively distinguish between sincere and insincere questions. This will involve several key steps:

1. Data Collection: The Quora Insincere Questions Classification dataset is collected and downloaded from Kaggle. The dataset will be imported and processed using Google Colab. The data structures and features will be explored to gain a better understanding of the dataset.

2. Data Preprocessing: Text data is preprocessed by tokenization, lowercasing, and removal of stop words and punctuation. Techniques like stemming or lemmatization may be applied for further normalization.

3. Feature Extraction: The preprocessed text data will be transformed into numerical representations suitable for machine learning algorithms. For this project, we will be using word embeddings, such as Word2Vec or GloVe, to convert the text into dense vector representations that capture semantic relationships between words.

4. Model Selection and Training: For this project, we will explore various NLP models suitable for insincere question classification, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), or transformer models like BERT. These models have shown promising results in NLP tasks and can capture complex patterns and dependencies in text data. We will select the most appropriate model based on its performance on the validation dataset and train it using the labeled training dataset.

5. Model Evaluation: The trained NLP model will be evaluated using appropriate evaluation metrics, such as accuracy, precision, recall, and F1-score. The performance of the model will be assessed on the test dataset.



## Importing Packages:

In [None]:
# Data manipution packages. 
import pandas as pd
import numpy as np

# Data visualization packages.
import matplotlib.pyplot as plt
import seaborn as sns

# File manager packages.
import os
import shutil
from zipfile import ZipFile
from google.colab import files

# Tensoflow packages.
import tensorflow as tf
from tensorflow import keras
from keras.layers import Flatten, Conv1D, MaxPooling1D, Bidirectional, LSTM, RNN, GRU, Dropout
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential, Model

# Other packages.
%matplotlib inline

# **Data Collecion**
---

For this project we will be working from the kaggle notebook,and since our data is already in our working station we would just be loading our dataset from the kaggle input directory. 

To download this dataset from kaggle for colab and local system use, i will be providing commented code to help with this.

## Kaggle Data Import:

This commented code below is only for colab notebooks.

In [None]:
# Defining data directory.
!mkdir '1. Dataset'

In [None]:
# Install kaggle api with Pip. 
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Uploading Kaggle api token key.
files.upload()

In [None]:
# Changing api token location.
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/


mkdir: cannot create directory ‚Äò/root/.kaggle‚Äô: File exists


In [None]:
# set the appropriate permissions 
!chmod 600 ~/.kaggle/kaggle.json

# verify api key.
!kaggle datasets list

ref                                                       title                                          size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------  --------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
arnabchaki/data-science-salaries-2023                     Data Science Salaries 2023 üí∏                   25KB  2023-04-13 09:55:16          19436        545  1.0              
fatihb/coffee-quality-data-cqi                            Coffee Quality Data (CQI May-2023)             22KB  2023-05-12 13:06:39           1414         43  1.0              
ashpalsingh1525/imdb-movies-dataset                       IMDB movies dataset                             3MB  2023-04-28 23:18:15           2103         45  1.0              
iammustafatz/diabetes-prediction-dataset                  Diabetes prediction dataset                   734KB  2023-0

In [None]:
# Downloading dataset. 
!kaggle competitions download 'quora-question-pairs'

Downloading quora-question-pairs.zip to /content
 96% 297M/309M [00:01<00:00, 252MB/s]
100% 309M/309M [00:01<00:00, 227MB/s]


In [None]:
# Unzip folder. 
! unzip quora-question-pairs.zip -d '1. Dataset'

Archive:  quora-question-pairs.zip
  inflating: 1. Dataset/sample_submission.csv.zip  
  inflating: 1. Dataset/test.csv     
  inflating: 1. Dataset/test.csv.zip  
  inflating: 1. Dataset/train.csv.zip  


We will be importing and loading our training, and validation dataset.

In [None]:
# Checking input folder. 

## Defining a path exploral function. 

def path_exploral(dir_path:str): 
    for dirname, _, filenames in os.walk(dir_path):
        print(f"Directory name: {dirname}")
        print(f"File name: {filenames}\n\n")
    
        for filename in filenames:
            print(os.path.join(dirname, filename))

## Checking the input folder. 
data_dir = '/content/1. Dataset'

path_exploral(data_dir)

Directory name: /content/1. Dataset
File name: ['test.csv', 'test.csv.zip', 'sample_submission.csv.zip', 'train.csv.zip']


/content/1. Dataset/test.csv
/content/1. Dataset/test.csv.zip
/content/1. Dataset/sample_submission.csv.zip
/content/1. Dataset/train.csv.zip


In [None]:
# Unzipping file. 

## Training zip directory.
train_zip_dir = os.path.join(data_dir,'train.csv.zip' )

## Extracting training data file
def unzip_folder(source_dir: str, destination_dir: str):
    with ZipFile(source_dir) as zip_dir:
        zip_dir.extractall(destination_dir)

unzip_folder(train_zip_dir, data_dir)

## Testing and sample directory.
testing_zip_dir = os.path.join(data_dir, 'test.csv.zip')
sample_zip_dir = os.path.join(data_dir, 'sample_submission.csv.zip')

## Extracting testing and sample file. 
unzip_folder(testing_zip_dir, data_dir)
unzip_folder(sample_zip_dir, data_dir)

## Checking dataset folder on updated files.
path_exploral(data_dir)

Directory name: /content/1. Dataset
File name: ['test.csv', 'test.csv.zip', 'sample_submission.csv.zip', 'sample_submission.csv', 'train.csv', 'train.csv.zip']


/content/1. Dataset/test.csv
/content/1. Dataset/test.csv.zip
/content/1. Dataset/sample_submission.csv.zip
/content/1. Dataset/sample_submission.csv
/content/1. Dataset/train.csv
/content/1. Dataset/train.csv.zip


In [None]:
# Deleting all zip files. 
!rm /content/*.zip
!rm /content/1.\ Dataset/*.zip


rm: cannot remove '/content/*.zip': No such file or directory


In [None]:
# Checking folder for update.
path_exploral(data_dir)

Directory name: /content/1. Dataset
File name: ['test.csv', 'sample_submission.csv', 'train.csv']


/content/1. Dataset/test.csv
/content/1. Dataset/sample_submission.csv
/content/1. Dataset/train.csv


## Loading Dataset:

In [None]:
#  Loading training and testing dataframe.

## File directories. 
train_dir = os.path.join(data_dir, 'train.csv')
test_dir = os.path.join(data_dir, 'test.csv')
sample_sub_dir = os.paht.join(data_dir, 'sample_submission.csv')

## Importing files to dataframes.
train_df = pd.DataFrame(train_dir)
test