## Importing important libraries and reading the training and testing data

In [67]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt 
train_df = pd.read_csv("train_df.csv")
test_df = pd.read_csv("test_df.csv")

## Preprocessing training data

### Checking for null values, invalid target values and duplicated rows

In [68]:
train_df 

Unnamed: 0,qid,question_text,target
0,dda0b0efc8ba86e81ec4,What are interesting facts about Microsoft his...,0
1,dc708b74a108d0fc0ad9,What are those things which are not gonna happ...,0
2,06a27ec5d82dacd8bfe0,"What should I know to avoid being ""upsold"" whe...",0
3,00cbb6b17e3ceb7c5358,How I add any account with payment bank?,0
4,7c304888973a701585a0,Which Multi level marketing products are actua...,0
...,...,...,...
999995,4bd96088d0b5f0f2c4f4,How is CSE at VIT Chennai?,0
999996,e80edbfc086f7125940f,"How can we prevent a holocaust by robots, AI, ...",0
999997,1506dfad6bd340782a1f,How can I help a student remember key steps an...,0
999998,b56c60fd407f2f85553c,What is the difference between lace closure & ...,0


In [69]:
train_df.describe()

Unnamed: 0,target
count,1000000.0
mean,0.06187
std,0.240919
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [70]:
train_df.isna().sum()

qid              0
question_text    0
target           0
dtype: int64

In [71]:
train_df["target"].unique()

array([0, 1])

In [72]:
train_df.index[train_df.duplicated()]

Int64Index([], dtype='int64')

## Preprocessing testing data

### Checking for null values and duplicated rows

In [73]:
test_df

Unnamed: 0,qid,question_text
0,a4f3da3a3df9dd881edd,My period is due on my wedding day. How can I ...
1,9914c62ed3f69684d549,How many numbers higher than a million can be ...
2,8138ae48649e37091a91,"How come I feel nothing for my family, but sti..."
3,981b4753d17ef14d09f7,"In case of collapse of the Democratic party, w..."
4,452e2c705276ba16b7b7,Who is Émile Naoumoff?
...,...,...
306117,a352dff4fcc2571815ce,Did anyone get an update on Maruti Suzuki All ...
306118,ad4a8498d97c536c67b9,What 5 people in history do you find the most ...
306119,19784a27b55d4b453fda,How can I remove the tan on my forehead?
306120,370191dba26465997879,"If you are a well known hacker, will you be mo..."


In [74]:
test_df.isna().sum()

qid              0
question_text    0
dtype: int64

In [75]:
test_df.index[test_df.duplicated()]

Int64Index([], dtype='int64')

## Cleaning the text
When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

Common data cleaning steps on all text:

Make text all lower case
Remove punctuation
Remove numerical values
Remove common non-sensical text (/n)
Tokenize text
Remove stop words
More data cleaning steps after tokenization:

Stemming / lemmatization
Parts of speech tagging
Create bi-grams or tri-grams
Deal with typos
And more...

### Here (in round 1) we are doing the following things:-
1. Making the text lower case.
2. Removing text in square brackets
3. Removing punctuation marks from the text
4. Removing words containing numbers.

In [76]:
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [77]:
train_df.question_text= train_df.question_text.apply(round1)
train_df.question_text

0         what are interesting facts about microsoft his...
1         what are those things which are not gonna happ...
2         what should i know to avoid being upsold when ...
3                   how i add any account with payment bank
4         which multi level marketing products are actua...
                                ...                        
999995                            how is cse at vit chennai
999996    how can we prevent a holocaust by robots ai or...
999997    how can i help a student remember key steps an...
999998    what is the difference between lace closure  l...
999999      what happens when you look into a broken mirror
Name: question_text, Length: 1000000, dtype: object

### Here (in round2) we are doing:-
1. Getting rid of additional punctuation
2. Removing some non-sensical text

In [78]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text
round2 = lambda x: clean_text_round2(x)

In [79]:
train_df.question_text= train_df.question_text.apply(round2)
train_df.question_text

0         what are interesting facts about microsoft his...
1         what are those things which are not gonna happ...
2         what should i know to avoid being upsold when ...
3                   how i add any account with payment bank
4         which multi level marketing products are actua...
                                ...                        
999995                            how is cse at vit chennai
999996    how can we prevent a holocaust by robots ai or...
999997    how can i help a student remember key steps an...
999998    what is the difference between lace closure  l...
999999      what happens when you look into a broken mirror
Name: question_text, Length: 1000000, dtype: object

In [80]:
train_df.question_text

0         what are interesting facts about microsoft his...
1         what are those things which are not gonna happ...
2         what should i know to avoid being upsold when ...
3                   how i add any account with payment bank
4         which multi level marketing products are actua...
                                ...                        
999995                            how is cse at vit chennai
999996    how can we prevent a holocaust by robots ai or...
999997    how can i help a student remember key steps an...
999998    what is the difference between lace closure  l...
999999      what happens when you look into a broken mirror
Name: question_text, Length: 1000000, dtype: object

### Tokenization
Tokenization is the process of segmenting running text into sentences and words. In essence, it’s the task of cutting a text into pieces called tokens. 
$\newline$ Here we are going to use word tokenizer i.e. the words are the tokens

In [81]:
# Tokenization
#defining function for tokenization
# import re
# def tokenization(text):
#     tokens = re.split('W+',text)
#     return tokens[0].split(" ")
# #applying function to the column
# train_df['question_text']= train_df['question_text'].apply(lambda x: tokenization(x))
# train_df.iloc[1].question_text
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize,word_tokenize
def tokenization(text):
    return word_tokenize(text)
# applying function to the column
train_df['question_text']= train_df['question_text'].apply(lambda x: tokenization(x))
# train_df.iloc[1].question_text

[nltk_data] Downloading package punkt to
[nltk_data]     /home/karanjitsaha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Stop Words removal
Stop words are commonly occurring words that for some computational processes provide little information or in some cases introduce unnecessary noise and therefore need to be removed.

In [82]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# train_df.question_text = [word for word in train_df.question_text if not word in stopwords.words('english')]
# train_df
# print(stopwords.words('english'))
stopwords=stopwords.words('english')
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output
#applying the function
train_df['question_text']= train_df['question_text'].apply(lambda x:remove_stopwords(x))
train_df

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/karanjitsaha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,qid,question_text,target
0,dda0b0efc8ba86e81ec4,"[interesting, facts, microsoft, history]",0
1,dc708b74a108d0fc0ad9,"[things, gon, na, happen, ever]",0
2,06a27ec5d82dacd8bfe0,"[know, avoid, upsold, getting, car, brakes, ch...",0
3,00cbb6b17e3ceb7c5358,"[add, account, payment, bank]",0
4,7c304888973a701585a0,"[multi, level, marketing, products, actually, ...",0
...,...,...,...
999995,4bd96088d0b5f0f2c4f4,"[cse, vit, chennai]",0
999996,e80edbfc086f7125940f,"[prevent, holocaust, robots, ai, aliens]",0
999997,1506dfad6bd340782a1f,"[help, student, remember, key, steps, informat...",0
999998,b56c60fd407f2f85553c,"[difference, lace, closure, lace, frontals]",0


In [83]:
# import nltk
# from nltk.stem import PorterStemmer
# ps = PorterStemmer()

In [84]:
# train_df['question_text'] = train_df['question_text'].apply(lambda x: [ps.stem(y) for y in x]) # Stem every word.
# # train_df = train_df.drop(columns=['question_text']) # Get rid of the unstemmed column.

In [85]:
train_df

Unnamed: 0,qid,question_text,target
0,dda0b0efc8ba86e81ec4,"[interesting, facts, microsoft, history]",0
1,dc708b74a108d0fc0ad9,"[things, gon, na, happen, ever]",0
2,06a27ec5d82dacd8bfe0,"[know, avoid, upsold, getting, car, brakes, ch...",0
3,00cbb6b17e3ceb7c5358,"[add, account, payment, bank]",0
4,7c304888973a701585a0,"[multi, level, marketing, products, actually, ...",0
...,...,...,...
999995,4bd96088d0b5f0f2c4f4,"[cse, vit, chennai]",0
999996,e80edbfc086f7125940f,"[prevent, holocaust, robots, ai, aliens]",0
999997,1506dfad6bd340782a1f,"[help, student, remember, key, steps, informat...",0
999998,b56c60fd407f2f85553c,"[difference, lace, closure, lace, frontals]",0


### Lemmatization
Lemmatization is a tool that performs full morphological analysis to more accurately find the root, or “lemma” for a word.

In [86]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
#defining the function for lemmatization
def lemmatizer(text):
    lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
    return lemm_text
train_df['question_text']=train_df['question_text'].apply(lambda x:lemmatizer(x))
# train_df

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/karanjitsaha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [87]:
# listOfWords = train_df.question_text[:][0]
# listOfWords.extend(train_df.question_text[:][1])
# listOfWords

In [88]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
words=[]
for i in range(train_df.question_text.shape[0]):
    words.extend(word for word in train_df.question_text[:][i])
# print(words)
data_cv = cv.fit_transform(word for word in words)
# data_cv = train_df.question_text 
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
# data_dtm.index = train_df.index
data_dtm

: 

: 