<a href="https://colab.research.google.com/github/Benjamin-Ojo/Quora-Insincere-Question-Classifier/blob/main/2.%20Notebooks/quora_insincere_question_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Quora Insincere Question Classifier**
---

# **Introduction** 
--- 




## Background: 

In today's digital age, social platforms have become hubs for information sharing and community engagement. Quora, one such platform, provides users with a platform to ask questions and receive answers from a diverse community. However, as with any open forum, there is a potential for misuse, where users may pose insincere or deceptive questions.

The classification of insincere questions is a significant challenge in natural language processing (NLP). It requires the ability to discern the underlying intent and identify questions that may be misleading, inflammatory, or offensive. Accurately detecting and categorizing these insincere questions is crucial to maintaining the quality and credibility of a platform like Quora.

In this project, we delve into the task of insincere question classification on Quora, using machine learning and NLP techniques. Our objective is to develop a robust and efficient model that can automatically differentiate between sincere and insincere questions.

## Dataset:

For our insincerity quest, we will leverage the Quora Insincere Questions Classification dataset, which is publicly available on Kaggle. This dataset comprises a large collection of questions from Quora, along with corresponding labels indicating whether each question is sincere or insincere. The dataset is annotated by human reviewers, providing valuable ground truth for training and evaluation purposes.

Our dataset are divided into training and testing dataset.The data contains the following columns: 

1. **qid**: This is a unique number for each of the question in our datasets. 
2. **question_text**: The full text of a Quora question. 
3. **target**: The label encoding on whether a question is sincere or not. 

## Approachs: 

To tackle this problem, we will adopt a supervised learning approach. We will explore various NLP techniques to build a classification model that can effectively distinguish between sincere and insincere questions. This will involve several key steps:

1. Data Collection: The Quora Insincere Questions Classification dataset is collected and downloaded from Kaggle. The dataset will be imported and processed using Google Colab. The data structures and features will be explored to gain a better understanding of the dataset.

2. Data Preprocessing: Text data is preprocessed by tokenization, lowercasing, and removal of stop words and punctuation. Techniques like stemming or lemmatization may be applied for further normalization.

3. Feature Extraction: The preprocessed text data will be transformed into numerical representations suitable for machine learning algorithms. For this project, we will be using word embeddings, such as Word2Vec or GloVe, to convert the text into dense vector representations that capture semantic relationships between words.

4. Model Selection and Training: For this project, we will explore various NLP models suitable for insincere question classification, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), or transformer models like BERT. These models have shown promising results in NLP tasks and can capture complex patterns and dependencies in text data. We will select the most appropriate model based on its performance on the validation dataset and train it using the labeled training dataset.

5. Model Evaluation: The trained NLP model will be evaluated using appropriate evaluation metrics, such as accuracy, precision, recall, and F1-score. The performance of the model will be assessed on the test dataset.



## Importing Packages:

In [2]:
# Data manipution packages. 
import pandas as pd
import numpy as np

# Data visualization packages.
import matplotlib.pyplot as plt
import seaborn as sns

# File manager packages.
import os
import shutil
from zipfile import ZipFile
from google.colab import files
from IPython.display import display

# Tensoflow packages.
import tensorflow as tf
from tensorflow import keras
from keras.layers import Flatten, Conv1D, MaxPooling1D, Bidirectional, LSTM, RNN, GRU, Dropout
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential, Model

# Other packages.
from gensim.models import KeyedVectors
%matplotlib inline

# **Data Collecion**
---

For this project we will be working from the kaggle notebook,and since our data is already in our working station we would just be loading our dataset from the kaggle input directory. 

To download this dataset from kaggle for colab and local system use, i will be providing commented code to help with this.

## Kaggle Data Import:

This commented code below is only for colab notebooks.

In [None]:
# Defining data directory.
!mkdir '1. Dataset'

In [None]:
# Install kaggle api with Pip. 
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Uploading Kaggle api token key.
files.upload()

In [None]:
# Changing api token location.
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/


In [None]:
# set the appropriate permissions 
!chmod 600 ~/.kaggle/kaggle.json

# verify api key.
!kaggle datasets list

ref                                                       title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
arnabchaki/data-science-salaries-2023                     Data Science Salaries 2023 💸                         25KB  2023-04-13 09:55:16          21378        591  1.0              
fatihb/coffee-quality-data-cqi                            Coffee Quality Data (CQI May-2023)                   22KB  2023-05-12 13:06:39           2185         52  1.0              
anxods/spotify-top-50-playlist-songs-anxods               Spotify Top 50 Playlist Songs | @anxods              62KB  2023-05-27 11:39:30            783         26  0.9411765        
darshanprabhu09/stock-prices-for                          Stock prices of Amazon , Microso

In [None]:
# Downloading dataset. 
!kaggle competitions download 'quora-insincere-questions-classification'

Downloading quora-insincere-questions-classification.zip to /content
100% 6.02G/6.03G [01:19<00:00, 94.5MB/s]
100% 6.03G/6.03G [01:19<00:00, 81.4MB/s]


In [None]:
# Unzip folder. 
! unzip quora-insincere-questions-classification.zip -d '1. Dataset'

Archive:  quora-insincere-questions-classification.zip
  inflating: 1. Dataset/embeddings.zip  
  inflating: 1. Dataset/sample_submission.csv  
  inflating: 1. Dataset/test.csv     
  inflating: 1. Dataset/train.csv    


We will be importing and loading our training, and validation dataset.

In [3]:
# Checking input folder. 

## Defining a path exploral function. 

def path_exploral(dir_path:str): 
    for dirname, _, filenames in os.walk(dir_path):
        print(f"Directory name: {dirname}")
        print(f"File name: {filenames}\n")
    
        for filename in filenames:
            print(os.path.join(dirname, filename))
        print('\n\n')

## Checking the input folder. 
data_dir = '/content/1. Dataset'

path_exploral(data_dir)

Directory name: /content/1. Dataset
File name: ['sample_submission.csv', 'test.csv', 'train.csv']

/content/1. Dataset/sample_submission.csv
/content/1. Dataset/test.csv
/content/1. Dataset/train.csv



Directory name: /content/1. Dataset/wiki-news-300d-1M
File name: ['wiki-news-300d-1M.vec']

/content/1. Dataset/wiki-news-300d-1M/wiki-news-300d-1M.vec



Directory name: /content/1. Dataset/GoogleNews-vectors-negative300
File name: ['GoogleNews-vectors-negative300.bin']

/content/1. Dataset/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin



Directory name: /content/1. Dataset/glove.840B.300d
File name: ['glove.840B.300d.txt']

/content/1. Dataset/glove.840B.300d/glove.840B.300d.txt



Directory name: /content/1. Dataset/paragram_300_sl999
File name: ['README.txt', 'paragram_300_sl999.txt']

/content/1. Dataset/paragram_300_sl999/README.txt
/content/1. Dataset/paragram_300_sl999/paragram_300_sl999.txt





In [None]:
# Unzipping file. 

## Emebedding zip directory.
emb_zip_dir = os.path.join(data_dir,'embeddings.zip' )

## Extracting emebedding data file
def unzip_folder(source_dir: str, destination_dir: str):
    with ZipFile(source_dir) as zip_dir:
        zip_dir.extractall(destination_dir)

unzip_folder(emb_zip_dir, data_dir)

## Checking dataset folder on updated files.
path_exploral(data_dir)

Directory name: /content/1. Dataset
File name: ['sample_submission.csv', 'embeddings.zip', 'test.csv', 'train.csv']

/content/1. Dataset/sample_submission.csv
/content/1. Dataset/embeddings.zip
/content/1. Dataset/test.csv
/content/1. Dataset/train.csv



Directory name: /content/1. Dataset/wiki-news-300d-1M
File name: ['wiki-news-300d-1M.vec']

/content/1. Dataset/wiki-news-300d-1M/wiki-news-300d-1M.vec



Directory name: /content/1. Dataset/GoogleNews-vectors-negative300
File name: ['GoogleNews-vectors-negative300.bin']

/content/1. Dataset/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin



Directory name: /content/1. Dataset/glove.840B.300d
File name: ['glove.840B.300d.txt']

/content/1. Dataset/glove.840B.300d/glove.840B.300d.txt



Directory name: /content/1. Dataset/paragram_300_sl999
File name: ['README.txt', 'paragram_300_sl999.txt']

/content/1. Dataset/paragram_300_sl999/README.txt
/content/1. Dataset/paragram_300_sl999/paragram_300_sl999.txt





In [None]:
# Deleting all zip files. 
!rm /content/*.zip
!rm /content/1.\ Dataset/*.zip


In [None]:
# Checking folder for update.
path_exploral(data_dir)

Directory name: /content/1. Dataset
File name: ['sample_submission.csv', 'test.csv', 'train.csv']

/content/1. Dataset/sample_submission.csv
/content/1. Dataset/test.csv
/content/1. Dataset/train.csv



Directory name: /content/1. Dataset/wiki-news-300d-1M
File name: ['wiki-news-300d-1M.vec']

/content/1. Dataset/wiki-news-300d-1M/wiki-news-300d-1M.vec



Directory name: /content/1. Dataset/GoogleNews-vectors-negative300
File name: ['GoogleNews-vectors-negative300.bin']

/content/1. Dataset/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin



Directory name: /content/1. Dataset/glove.840B.300d
File name: ['glove.840B.300d.txt']

/content/1. Dataset/glove.840B.300d/glove.840B.300d.txt



Directory name: /content/1. Dataset/paragram_300_sl999
File name: ['README.txt', 'paragram_300_sl999.txt']

/content/1. Dataset/paragram_300_sl999/README.txt
/content/1. Dataset/paragram_300_sl999/paragram_300_sl999.txt





## Data import & review:

In [3]:
#  Loading training and testing dataframe.

## File directories. 
train_dir = os.path.join(data_dir, 'train.csv')
test_dir = os.path.join(data_dir, 'test.csv')
sample_sub_dir = os.path.join(data_dir, 'sample_submission.csv')

## Importing files to dataframes.
train_df = pd.read_csv(train_dir)
test_df = pd.read_csv(test_dir)
sample_sub_df = pd.read_csv(sample_sub_dir)


In [4]:
# View data.

## Training dataset.
print("\t\t\t##### Training Dataset #####")
display(train_df.head(10))
print('\n\n')

## Testing dataset.
print("\t\t\t##### Testing Dataset #####")
display(test_df.head(10))
print('\n\n')

## Sample Submission dataset. 
print("\t\t\t##### Sample Submission Dataset #####")
display(sample_sub_df.head(10))
print('\n\n')

			##### Training Dataset #####


Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
5,00004f9a462a357c33be,"Is Gaza slowly becoming Auschwitz, Dachau or T...",0
6,00005059a06ee19e11ad,Why does Quora automatically ban conservative ...,0
7,0000559f875832745e2e,Is it crazy if I wash or wipe my groceries off...,0
8,00005bd3426b2d0c8305,"Is there such a thing as dressing moderately, ...",0
9,00006e6928c5df60eacb,Is it just me or have you ever been in this ph...,0





			##### Testing Dataset #####


Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?
5,000101884c19f3515c1a,How do you train a pigeon to send messages?
6,00010f62537781f44a47,What is the currency in Langkawi?
7,00012afbd27452239059,"What is the future for Pandora, can the busine..."
8,00014894849d00ba98a9,My voice range is A2-C5. My chest voice goes u...
9,000156468431f09b3cae,How much does a tutor earn in Bangalore?





			##### Sample Submission Dataset #####


Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,0
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0
5,000101884c19f3515c1a,0
6,00010f62537781f44a47,0
7,00012afbd27452239059,0
8,00014894849d00ba98a9,0
9,000156468431f09b3cae,0







In [5]:
# Data structure

## Data columns. 
print("\t\t---------> Dataset Columns <---------")
print(f"Training Columns: \n\t{train_df.columns}\n")
print(f"Testing Columns: \n\t{test_df.columns}\n\n")

## Data Shape.
print("\t\t---------> Data Shape <---------")
print(f"Train data Shape: \n\tRows -> {train_df.shape[0]} \n\tColumns -> {train_df.shape[1]}\n")
print(f"Train data Shape: \n\tRows -> {test_df.shape[0]} \n\tColumns -> {test_df.shape[1]}\n\n")

## Null data. 
print("\t\t---------> Null Data <---------")
print(f"Number of train data null values: \n{train_df.isnull().sum()}\n")
print(f"Number of test data null values: \n{test_df.isnull().sum()}\n\n")

## Data info
print("\t\t---------> Data Info <---------")
print(f"Train data info: \n{train_df.info()}\n")
print(f"Test data info: \n{test_df.info()}\n\n")

		---------> Dataset Columns <---------
Training Columns: 
	Index(['qid', 'question_text', 'target'], dtype='object')

Testing Columns: 
	Index(['qid', 'question_text'], dtype='object')


		---------> Data Shape <---------
Train data Shape: 
	Rows -> 1306122 
	Columns -> 3

Train data Shape: 
	Rows -> 375806 
	Columns -> 2


		---------> Null Data <---------
Number of train data null values: 
qid              0
question_text    0
target           0
dtype: int64

Number of test data null values: 
qid              0
question_text    0
dtype: int64


		---------> Data Info <---------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1306122 entries, 0 to 1306121
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   qid            1306122 non-null  object
 1   question_text  1306122 non-null  object
 2   target         1306122 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 29.9+ MB
Train data info: 
None



# **Data Preprocessing**
---

For this phase of our project we will be processing our data & importing our pretrained tokenizer. 

I will be using the glove pretrained tokenizer for our word converter to vectors. The tokenizer was provided with our dataset and we will be making use of this provided verstion. 

## Pretrained Tokenizer: 

In [4]:
# Embedding directory.
glove_dir = os.path.join(data_dir, 'glove.840B.300d/glove.840B.300d.txt')
ggle_vec_dir = os.path.join(data_dir, 'GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin')

In [7]:
# Import Pretrained tokenizer.

## Load the pretrained word vectors
ggle_word_vec = KeyedVectors.load_word2vec_format(ggle_vec_dir, binary=True)

## Access the vector for a specific word
word_vector = ggle_word_vec['hello']

print(word_vector)


[-0.05419922  0.01708984 -0.00527954  0.33203125 -0.25       -0.01397705
 -0.15039062 -0.265625    0.01647949  0.3828125  -0.03295898 -0.09716797
 -0.16308594 -0.04443359  0.00946045  0.18457031  0.03637695  0.16601562
  0.36328125 -0.25585938  0.375       0.171875    0.21386719 -0.19921875
  0.13085938 -0.07275391 -0.02819824  0.11621094  0.15332031  0.09082031
  0.06787109 -0.0300293  -0.16894531 -0.20800781 -0.03710938 -0.22753906
  0.26367188  0.012146    0.18359375  0.31054688 -0.10791016 -0.19140625
  0.21582031  0.13183594 -0.03515625  0.18554688 -0.30859375  0.04785156
 -0.10986328  0.14355469 -0.43554688 -0.0378418   0.10839844  0.140625
 -0.10595703  0.26171875 -0.17089844  0.39453125  0.12597656 -0.27734375
 -0.28125     0.14746094 -0.20996094  0.02355957  0.18457031  0.00445557
 -0.27929688 -0.03637695 -0.29296875  0.19628906  0.20703125  0.2890625
 -0.20507812  0.06787109 -0.43164062 -0.10986328 -0.2578125  -0.02331543
  0.11328125  0.23144531 -0.04418945  0.10839844 -0.28

In [None]:
# Import Pretrained tokenizer.

## Define an empty glove dictionary.
glove_dict = {}
line_coof = []
line_word = []

## Input word and value from glove text file.
with open(glove_dir, 'r', encoding='utf-8') as gd:
    for line in gd:
        line = line.split()
        line_word.append(line[0])
        line_coof.append(line[1:])

## test

### Remove text strings and convert remaining values to float
for line in line_coof:
    filtered_data = [float(value) for value in line if isinstance(value, str) and value.replace('.', '', 1).isdigit()]

print(filtered_data)


# if line_coof:
#     line_coof = np.array(line_coof, dtype=np.float32)
#     glove_dict[line_word] = line_coof

# ## Testing 
# train_words = train_df['question_text'][3]
# print(f'Text "{train_words}" --> {[[glove_dict[word]] for word in train_words.split()]}')
