# Text Preprocessing

This notebook performs text preprocessing on the data present in the file: `quora-question-pairs/train.csv`

We the following approach for text preprocessing:
1. Tokenizing
2. Lowercasing
3. Noise Removal
4. Stop-word Removal
5. Lemmatization
6. Normalisation

### To do:

- [ ] Provide description for all processes and reason to perform them
- [ ] Describe Directory Structure
- [x] Load Dataset into pandas
- [x] Perform Tokenizing
- [x] Perform Lowercasing
- [x] Perform Noise Removal
- [x] Remove Stop-words
- [x] Perform Lemmatization
- [ ] Perform Normalisation/Spelling correction
- [ ] Prepare Datasets:
    - [ ] With Lemmatization and without Normalisation
    - [ ] Without Lemmatization and with Normalisation
    - [ ] Without Lemmatization and without Normalisation
    - [ ] With Lemmatization and with Normalisation


### Import Statements and File Paths

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import pandas as pd
from IPython.display import display, HTML
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# from symspellpy.symspellpy import SymSpell, Verbosity
import numpy as np
import re

data_dir = "/content/drive/MyDrive/Quora-Data/"
train_csv = data_dir + 'pre-processing/train.csv'
test_csv = data_dir + 'pre-processing/test.csv'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


### [Step 0] Loading Dataset into Pandas Dataframe

In [None]:
data_train = pd.read_csv(train_csv)
data_test = pd.read_csv(test_csv)
display(data_train)
display(data_test)

Unnamed: 0,id,qid1,qid2,question1,question2,Y
0,394437,434361,527326,How do I install APK files on my Windows Phone?,"How can I backup a (.xap, / . APPX) file insta...",0
1,373988,8023,10567,What were the major effects of the cambodia ea...,What were the major effects of the cambodia ea...,1
2,183101,280083,280084,How is Stack Exchange better than Quora?,Is Stack Exchange better than Quora? Why or wh...,1
3,43553,78324,78325,How to prevent from pimples to break out insid...,How can I avoid getting pimples inside my nose?,1
4,213919,319381,46044,What are some good books and online courses to...,What is a good online course on probability an...,0
...,...,...,...,...,...,...
323427,346872,475259,475260,What is the benefit of using const for declari...,What is the benefit of using enum to declare a...,0
323428,143678,14804,40458,How can I get a complete list of all old Gmail...,How do I find my own gmail accounts list?,1
323429,128137,14317,34001,Where can I found modern colours and textures ...,"Where can I get wide range of floor tile, wall...",1
323430,323891,449906,449907,Support@ 1877#778#89.69 ACER Technical support...,@Support@ 1877#778#89.69 COMPAQ Technical supp...,0


Unnamed: 0,id,qid1,qid2,question1,question2,Y
0,350255,478982,478983,Studying: I have made easy handwritten notes t...,From where can I download different institutes...,0
1,49376,87885,87886,Why didn't Melisandre make more shadow babies?,Why wouldn't Stannis use Melisandre's shadow b...,0
2,276580,395562,395563,Why do we fix one gear in epicyclic gear train...,Does Antarctica have any geopolitical importance?,0
3,13757,26386,26387,I sent text to my friends on WhatsApp and ther...,"Lately, I sent a WhatsApp message to a friend ...",0
4,307764,431504,431505,Can machine learning be used for Borderline PD...,How does Borderline PD affect writing ability?,0
...,...,...,...,...,...,...
80853,159203,40624,27380,What are the safety precautions on handling sh...,What are the safety precautions on handling sh...,1
80854,200725,37131,302569,How can you find the molar mass of deuterium?,How do you find the molar mass of ionic compou...,0
80855,179551,275483,275484,Which key witnesses supported the death penalt...,Which key witnesses did not support the death ...,0
80856,115590,188458,188459,Should dams be built or not?,How are dams built?,0


In [None]:
# data.rename(columns = {'question1':'q1_orig', 'question2':'q2_orig'}, inplace=True)

In [None]:
data_train['Y'].value_counts()

0    203899
1    119533
Name: Y, dtype: int64

In [None]:
data_test['Y'].value_counts()

0    51128
1    29730
Name: Y, dtype: int64

As can be seen there is a class imbalance, of roughly 63:36

We also observe that we have been given question ids, we try and see if there are any repeated question ids

### [Step 1] Lowercasing

In [None]:
data_train['question1'] = data_train['question1'].str.lower()
data_train['question2'] = data_train['question2'].str.lower()

data_test['question1'] = data_test['question1'].str.lower()
data_test['question2'] = data_test['question2'].str.lower()

### [Step 2] Tokenizing

White Space tokenizing with NLTK

### [Step 3]  Noise Removal

Not sure what to do for this step.. I think the data we have is clean because theire aren't any HTML tags or emojis etc. However, there are symbols such as : ^ * { } ( ) \[ \] \ & シ し 
I am not sure if we want to eliminate them or keep them? Some of these symbols are being used for math equations too. For example: [math]y=\frac{4x^2 - 36x}{ x-9}[/math]

### [Step 4] Removing Stop-words

Using NLTK stop-words

### [Step 5] Lemmatization

Using NLTK WordNet

### [Step 6] Normalisation: Spelling correction 

Perform before lemmatization using: https://towardsdatascience.com/text-normalization-7ecc8e084e31


In [None]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def tokenize(question_body):
    tokens = word_tokenize(question_body)
    return tokens

def remove_symbols(tokens):
    review_text = " ".join(tokens)
    # review_text = re.sub(r"[^A-Za-z0-9(),!.?\'\`]", " ", review_text)
    # review_text = re.sub(r"\'s", " 's ", review_text)
    # review_text = re.sub(r"\'ve", " 've ", review_text)
    # review_text = re.sub(r"n\'t", " 't ", review_text)
    # review_text = re.sub(r"\'re", " 're ", review_text)
    # review_text = re.sub(r"\'d", " 'd ", review_text)
    # review_text = re.sub(r"\'ll", " 'll ", review_text)
    # review_text = re.sub(r",", " ", review_text)
    # review_text = re.sub(r"\.", " ", review_text)
    # review_text = re.sub(r"!", " ", review_text)
    # review_text = re.sub(r"\(", " ( ", review_text)
    # review_text = re.sub(r"\)", " ) ", review_text)
    # review_text = re.sub(r"\?", " ", review_text)
    # review_text = re.sub(r"\s{2,}", " ", review_text)

    text = review_text

    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)

    words = review_text.split()
    return(words)

def remove_stop_words(tokens):
    keywords = [word for word in tokens if not word in stop_words]
    return keywords

def spelling_correction(tokens):
    #implement spelling correction if better performance is reqd
    pass

def lemmatize_text(tokens):
    for word in range(0,len(tokens)):
        tokens[word] = lemmatizer.lemmatize(tokens[word])
    return tokens

def concatenate_tokens(tokens):
    joined = str(" ".join(tokens))
    return(joined)

def pre_processing_with_lemmatization(question_body):
    tokens = tokenize(question_body)
    keywords = remove_stop_words(tokens)
    keywords = remove_symbols(keywords)
    lemmatized_tokens = lemmatize_text(keywords)
    joined = concatenate_tokens(lemmatized_tokens)
    return(joined)

def pre_processing_without_lemmatization_and_spell_correct(question_body):
    tokens = tokenize(question_body)
    keywords = remove_stop_words(tokens)
    keywords = remove_symbols(keywords)
    joined = concatenate_tokens(keywords)
    return(joined)

def pre_processing_with_spell_correct(question_body):
    tokens = tokenize(question_body)
    keywords = remove_stop_words(tokens)
    keywords = remove_symbols(keywords)
    spell_checked = spelling_correction(keywords)
    joined = concatenate_tokens(keywords)
    return(joined)

def pre_processing_with_lemmatization_and_spell_correct(question_body):
    tokens = tokenize(question_body)
    keywords = remove_stop_words(tokens)
    keywords = remove_symbols(keywords)
    spell_checked = spelling_correction(keywords)
    lemmatized_tokens = lemmatize_text(spell_checked)
    joined = concatenate_tokens(keywords)
    return(joined)

data_train['question1']=data_train['question1'].apply(str)
data_train['question2']=data_train['question2'].apply(str)
data_test['question1']=data_test['question1'].apply(str)
data_test['question2']=data_test['question2'].apply(str)


In [None]:
print(stop_words)

{"aren't", "you'd", 'm', 'been', "you'll", 'here', 'your', 'ain', "you're", 't', 'wasn', 'but', 'me', 'they', 'all', 'will', 'with', 'once', "shouldn't", 'to', 'its', 'can', 'aren', 'an', 'her', 'did', 'too', 'is', "that'll", "she's", 'same', 'he', "couldn't", 'needn', 'ours', "don't", 'whom', 'nor', 'o', 'below', 'being', 'now', 'for', 'again', 'each', 'be', 'during', 'isn', 'shan', 'both', 'until', 'hers', 'was', 'not', "it's", 'about', 'mightn', 'between', 'doesn', 'our', 'have', 'a', 'itself', 'that', 'myself', "weren't", 'because', 'such', 'any', 'there', 'mustn', 'so', 'having', 'no', 's', 'himself', 'd', 'didn', 'down', 'ma', 'theirs', 'does', 'own', 'this', 'what', 'as', 'yours', 'these', "mustn't", "should've", 'shouldn', 'out', "didn't", 'my', 'themselves', 'further', 'than', 'y', 'on', "wouldn't", 'when', 'few', 'should', 'll', 'over', 'above', "haven't", 'just', 'she', 'had', "you've", 've', 'them', 're', 'how', 'wouldn', 'him', 'i', 'before', 'we', 'in', 'some', 'doing', '

### Preparing pre-processed datasets and writing to csv file

In [None]:
path_to_csv_train = data_dir+'pre-processing/'+'preprocessing_text_train.csv'
path_to_csv_test = data_dir+'pre-processing/'+'preprocessing_text_test.csv'
path_to_csv_train_without_lemma = data_dir+'preprocessing_text_train_without_lemma.csv'
path_to_csv_test_without_lemma = data_dir+'preprocessing_text_test_without_lemma.csv'

def write_data_to_csv(path_to_csv, data_frame_name):
    data_frame_name.to_csv(path_to_csv, index = False, header=True)

# data_train['question1'] = data_train['question1'].apply(pre_processing_without_lemmatization_and_spell_correct)
# data_train['question2'] = data_train['question2'].apply(pre_processing_without_lemmatization_and_spell_correct)
# data_train = data_train.drop(['Y'], axis=1)
# display(data_train)
# write_data_to_csv(path_to_csv_train_without_lemma, data_train)

data_train['question1'] = data_train['question1'].apply(pre_processing_with_lemmatization)
data_train['question2'] = data_train['question2'].apply(pre_processing_with_lemmatization)
data_train = data_train.drop(['Y'], axis=1)
display(data_train)
write_data_to_csv(path_to_csv_train, data_train)

# data_test['question1'] = data_test['question1'].apply(pre_processing_without_lemmatization_and_spell_correct)
# data_test['question2'] = data_test['question2'].apply(pre_processing_without_lemmatization_and_spell_correct)
# data_test = data_test.drop(['Y'], axis=1)
# display(data_test)
# write_data_to_csv(path_to_csv_test_without_lemma, data_test)

data_test['question1'] = data_test['question1'].apply(pre_processing_with_lemmatization)
data_test['question2'] = data_test['question2'].apply(pre_processing_with_lemmatization)
data_test = data_test.drop(['Y'], axis=1)
display(data_test)
write_data_to_csv(path_to_csv_test, data_test)



Unnamed: 0,id,qid1,qid2,question1,question2
0,394437,434361,527326,install apk file window phone,backup ( xap appx ) file installed window phone
1,373988,8023,10567,major effect cambodia earthquake effect compar...,major effect cambodia earthquake effect compar...
2,183101,280083,280084,stack exchange better quora,stack exchange better quora
3,43553,78324,78325,prevent pimple break inside nose,avoid getting pimple inside nose
4,213919,319381,46044,good book online course follow grab concept st...,good online course probability statistic
...,...,...,...,...,...
323427,346872,475259,475260,benefit using const declaring constant,benefit using enum declare constant
323428,143678,14804,40458,get complete list old gmail account name,find gmail account list
323429,128137,14317,34001,found modern colour texture floor tile sydney,get wide range floor tile wall tile porcelain ...
323430,323891,449906,449907,support 1877 778 89 69 acer technical support ...,support 1877 778 89 69 compaq technical suppor...


Unnamed: 0,id,qid1,qid2,question1,question2
0,350255,478982,478983,studying made easy handwritten note bought del...,download different institute handwritten class...
1,49376,87885,87886,'t melisandre make shadow baby,would 't stannis use melisandre 's shadow baby...
2,276580,395562,395563,fix one gear epicyclic gear train power input,antarctica geopolitical importance
3,13757,26386,26387,sent text friend whatsapp 2 tick soon uninstal...,lately sent whatsapp message friend online mes...
4,307764,431504,431505,machine learning used borderline pd diagnosis,borderline pd affect writing ability
...,...,...,...,...,...
80853,159203,40624,27380,safety precaution handling shotgun proposed nr...,safety precaution handling shotgun proposed nr...
80854,200725,37131,302569,find molar mass deuterium,find molar mass ionic compound
80855,179551,275483,275484,key witness supported death penalty dzhokhar t...,key witness support death penalty dzhokhar tsa...
80856,115590,188458,188459,dam built,dam built
