# quora-insincere-questions-classification-with-deep-learning

Use the "Run" button to execute the code.

Ref : https://jovian.ai/learn/nautral-language-processing-zero-to-nlp/lesson/neural-networks-and-deep-learning

In [1]:
import os

In [2]:
IS_KAGGLE='KAGGLE_KERNEL_RUN_TYPE' in os.environ

In [4]:
if IS_KAGGLE:
    data_dir = '../input/quora-insincere-questions-classification'
    train_fname = data_dir + '/train.csv'
    test_fname = data_dir + '/test.csv'
    sub_fname = data_dir + '/sample_submission.csv'
else:
    os.environ['KAGGLE_CONFIG_DIR'] = '.'
    !kaggle competitions download -c quora-insincere-questions-classification -f train.csv -p data
    !kaggle competitions download -c quora-insincere-questions-classification -f test.csv -p data
    !kaggle competitions download -c quora-insincere-questions-classification -f sample_submission.csv -p data
    train_fname = 'data/train.csv.zip'
    test_fname = 'data/test.csv.zip'
    sub_fname = 'data/sample_submission.csv.zip' 

Downloading train.csv.zip to data
 47% 26.0M/54.9M [00:00<00:00, 123MB/s]
100% 54.9M/54.9M [00:00<00:00, 185MB/s]
Downloading test.csv.zip to data
 32% 5.00M/15.8M [00:00<00:00, 51.6MB/s]
100% 15.8M/15.8M [00:00<00:00, 105MB/s] 
Downloading sample_submission.csv.zip to data
  0% 0.00/4.09M [00:00<?, ?B/s]
100% 4.09M/4.09M [00:00<00:00, 109MB/s]


In [5]:
import pandas as pd

In [7]:
raw_df=pd.read_csv(train_fname)
test_df=pd.read_csv(test_fname)
sub_df=pd.read_csv(sub_fname)

In [8]:
raw_df

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
...,...,...,...
1306117,ffffcc4e2331aaf1e41e,What other technical skills do you need as a c...,0
1306118,ffffd431801e5a2f4861,Does MS in ECE have good job prospects in USA ...,0
1306119,ffffd48fb36b63db010c,Is foam insulation toxic?,0
1306120,ffffec519fa37cf60c78,How can one start a research project based on ...,0


In [9]:
test_df

Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?
...,...,...
375801,ffff7fa746bd6d6197a9,How many countries listed in gold import in in...
375802,ffffa1be31c43046ab6b,Is there an alternative to dresses on formal p...
375803,ffffae173b6ca6bfa563,Where I can find best friendship quotes in Tel...
375804,ffffb1f7f1a008620287,What are the causes of refraction of light?


In [10]:
sub_df

Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,0
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0
...,...,...
375801,ffff7fa746bd6d6197a9,0
375802,ffffa1be31c43046ab6b,0
375803,ffffae173b6ca6bfa563,0
375804,ffffb1f7f1a008620287,0


In [11]:
if IS_KAGGLE:
  sample_df=raw_df
else:
  sample_df=raw_df[:100_00]

In [12]:
sample_df

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
...,...,...,...
9995,01f3d4c0c7566f2f7f1c,Where can one find an online video which demon...,0
9996,01f3e0e7c52adb6d84f6,"Can someone be ""emotionally logic""?",0
9997,01f3ebd3f7bfac05eb37,What are you using for text messaging?,0
9998,01f3ed6a3313dfc76999,How much ml is 16 oz?,0


## Prepare data for training



Outline:
- Convert text to TF-IDF Vectors
- Split training & validation set
- Convert to PyTorch tensors

In [14]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [17]:
stemmer=SnowballStemmer(language='english')
def tokenize(text):
  return [stemmer.stem(token) for token in word_tokenize(text)]

In [18]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [19]:
english_stopwords=stopwords.words('english')

In [21]:
vectorizer=TfidfVectorizer(
    lowercase=True,
    tokenizer=tokenize,
    stop_words=english_stopwords,
    max_features=1000
)

In [22]:
%%time
vectorizer.fit(sample_df.question_text)



CPU times: user 2.45 s, sys: 11 ms, total: 2.46 s
Wall time: 2.57 s


TfidfVectorizer(max_features=1000,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function tokenize at 0x7f79872ae9d0>)

In [23]:
%%time
inputs=vectorizer.transform(sample_df.question_text)

CPU times: user 2.9 s, sys: 11.8 ms, total: 2.91 s
Wall time: 3.29 s


In [24]:
inputs.shape

(10000, 1000)

In [25]:
targets=sample_df.target

In [27]:
targets.shape

(10000,)

In [29]:
%%time
test_inputs=vectorizer.transform(test_df.question_text)

CPU times: user 1min 23s, sys: 295 ms, total: 1min 23s
Wall time: 1min 25s


## Split train and validation set

In [30]:
from sklearn.model_selection import train_test_split

In [31]:
train_inputs,val_inputs,train_targets,val_targets=train_test_split(inputs,targets,random_state=42,shuffle=True,test_size=0.3)

In [32]:
train_inputs.shape, val_inputs.shape

((7000, 1000), (3000, 1000))

In [33]:
train_targets.shape , val_targets.shape

((7000,), (3000,))

## Converting PyTorch tensors

In [35]:
import torch
from torch.utils.data import TensorDataset,dataloader
import torch.nn.functional as F

In [None]:
train_tensors=F.normalize