## POSTagging
* POS tagging, or part-of-speech tagging, is the process of assigning a grammatical category or part of speech to each word in a given text. It is a fundamental task in natural language processing (NLP) that helps analyze the syntactic structure and meaning of sentences.
> Eg: `Original Sentence:` "The cat is sitting on the mat."
`POS-tagged Sentence:` "The/DT cat/NN is/VBZ sitting/VBG on/IN the/DT mat/NN."
* Below you can see the various tags that pos assigns to the words based on their grammatical structure

## Part of Speech Tags
Note:  these are the 'modified' tags used for `Penn tree banking`; these are the tags used in the Jet system. NP, NPS, PP, and PP$ from the original Penn part-of-speech tagging were changed to NNP, NNPS, PRP, and PRP$ to avoid clashes with standard syntactic categories.
* CC	Coordinating conjunction
* CD	Cardinal number
* DT	Determiner
* EX	Existential there
* FW	Foreign word
* IN	Preposition or subordinating conjunction
* JJ	Adjective
* JJR	Adjective, comparative
* JJS	Adjective, superlative
* LS	List item marker
* MD	Modal
* NN	Noun, singular or mass
* NNS	Noun, plural
* NNP	Proper noun, singular
* NNPS	Proper noun, plural
* PDT	Predeterminer
* POS	Possessive ending
* PRP	Personal pronoun
* PRP Possessive pronoun
* WP	Possessive wh-pronoun
* WRB	Wh-adverb
* RB	Adverb
* RBR	Adverb, comparative
* RBS	Adverb, superlative
* RP	Particle
* SYM	Symbol
* TO	to
* UH	Interjection
* VB	Verb, base form
* VBD	Verb, past tense
* VBG	Verb, gerund or present participle
* VBN	Verb, past participle
* VBP	Verb, non-3rd person singular present
* VBZ	Verb, 3rd person singular present
* WDT	Wh-determiner
* WP	Wh-pronoun

## Import Required libraries

In [3]:
import pandas as pd
import numpy as np

import re
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

import nltk
nltk.download('stopwords')
# Downloading wordnet before applying Lemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')



sns.set_style('whitegrid')
plt.style.use('bmh')

import warnings
warnings.filterwarnings('ignore')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# for HD visualizations
%config InlineBackend.figure_format='retina'

[nltk_data] Downloading package stopwords to C:\Users\GUDLA
[nltk_data]     RAGUWING\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package wordnet to C:\Users\GUDLA
[nltk_data]     RAGUWING\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package omw-1.4 to C:\Users\GUDLA
[nltk_data]     RAGUWING\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Note:
* Here in this code file, first we are going to load the cleaned csv file that we obtained from after performing EDA on the original csv file. We will apply POSTagging and create tags for the words in the questions given us.

In [4]:
df_lb = pd.read_csv(r"C:\Users\GUDLA RAGUWING\Data Science Course\Internship_Project\cleaned_df.csv")

In [5]:
df_lb.head()

Unnamed: 0.1,Unnamed: 0,id,qid1,question1,qid2,question2,is_duplicate,Question_similarity
0,0,0,1,What is the step by step guide to invest in sh...,2,What is the step by step guide to invest in sh...,0,Not Similar
1,1,1,3,What is the story of Kohinoor (Koh-i-Noor) Dia...,4,What would happen if the Indian government sto...,0,Not Similar
2,2,2,5,How can I increase the speed of my internet co...,6,How can Internet speed be increased by hacking...,0,Not Similar
3,3,3,7,Why am I mentally very lonely? How can I solve...,8,Find the remainder when [math]23^{24}[/math] i...,0,Not Similar
4,4,4,9,"Which one dissolve in water quikly sugar, salt...",10,Which fish would survive in salt water?,0,Not Similar


In [6]:
y = df_lb['is_duplicate']
X = df_lb[['question1','question2']]

In [7]:
from sklearn.model_selection import train_test_split
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
from tqdm import tqdm, tqdm_notebook
tqdm.pandas()
'''tqdm.pandas() is a method provided by tqdm library  that allows you to create/apply progress bars to pandas operations.
Works for pandas series as well as DataFrame, 
you can visualize the progress of your operations and get an estimate amount of time to complete the pandas task'''

'tqdm.pandas() is a method provided by tqdm library  that allows you to create/apply progress bars to pandas operations.\nWorks for pandas series as well as DataFrame, \nyou can visualize the progress of your operations and get an estimate amount of time to complete the pandas task'

In [9]:
def preprocess(raw_text):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", raw_text)
    
    # change sentence to lower case
    sentence = sentence.lower()
    
    return pd.Series(sentence)


In [10]:
X_train['question1_clean'] = X_train['question1'].progress_apply(lambda x: preprocess(x))

100%|████████████████████████████████████████████████████████████████████████| 283003/283003 [00:52<00:00, 5393.75it/s]


In [11]:
X_train.head()

Unnamed: 0,question1,question2,question1_clean
20128,"How is the working environment at SBI Life, Mu...",How stressful is work of SBI clerk?,how is the working environment at sbi life mu...
296237,How can a US citizen work in Canada?,Will a US graduate degree help a non-US citize...,how can a us citizen work in canada
107095,What are the benefits of washing your hands wi...,Why is it important to wash your hands with soap?,what are the benefits of washing your hands wi...
27940,How do the holy scriptures of Hinduism compare...,How do the holy scriptures of Hinduism compare...,how do the holy scriptures of hinduism compare...
251434,Is the humanoid shape inevitable for any speci...,Once RNA evolves is it inevitable that eventua...,is the humanoid shape inevitable for any speci...


In [12]:
X_train['question2_clean'] = X_train['question2'].progress_apply(lambda x: preprocess(x))
X_train.head()

100%|████████████████████████████████████████████████████████████████████████| 283003/283003 [00:54<00:00, 5233.89it/s]


Unnamed: 0,question1,question2,question1_clean,question2_clean
20128,"How is the working environment at SBI Life, Mu...",How stressful is work of SBI clerk?,how is the working environment at sbi life mu...,how stressful is work of sbi clerk
296237,How can a US citizen work in Canada?,Will a US graduate degree help a non-US citize...,how can a us citizen work in canada,will a us graduate degree help a non us citize...
107095,What are the benefits of washing your hands wi...,Why is it important to wash your hands with soap?,what are the benefits of washing your hands wi...,why is it important to wash your hands with soap
27940,How do the holy scriptures of Hinduism compare...,How do the holy scriptures of Hinduism compare...,how do the holy scriptures of hinduism compare...,how do the holy scriptures of hinduism compare...
251434,Is the humanoid shape inevitable for any speci...,Once RNA evolves is it inevitable that eventua...,is the humanoid shape inevitable for any speci...,once rna evolves is it inevitable that eventua...


### Word Tokenizing

In [13]:
X_train['tokenizeQ1'] = X_train['question1_clean'].progress_apply(nltk.word_tokenize)
X_train.head()

100%|████████████████████████████████████████████████████████████████████████| 283003/283003 [00:39<00:00, 7118.57it/s]


Unnamed: 0,question1,question2,question1_clean,question2_clean,tokenizeQ1
20128,"How is the working environment at SBI Life, Mu...",How stressful is work of SBI clerk?,how is the working environment at sbi life mu...,how stressful is work of sbi clerk,"[how, is, the, working, environment, at, sbi, ..."
296237,How can a US citizen work in Canada?,Will a US graduate degree help a non-US citize...,how can a us citizen work in canada,will a us graduate degree help a non us citize...,"[how, can, a, us, citizen, work, in, canada]"
107095,What are the benefits of washing your hands wi...,Why is it important to wash your hands with soap?,what are the benefits of washing your hands wi...,why is it important to wash your hands with soap,"[what, are, the, benefits, of, washing, your, ..."
27940,How do the holy scriptures of Hinduism compare...,How do the holy scriptures of Hinduism compare...,how do the holy scriptures of hinduism compare...,how do the holy scriptures of hinduism compare...,"[how, do, the, holy, scriptures, of, hinduism,..."
251434,Is the humanoid shape inevitable for any speci...,Once RNA evolves is it inevitable that eventua...,is the humanoid shape inevitable for any speci...,once rna evolves is it inevitable that eventua...,"[is, the, humanoid, shape, inevitable, for, an..."


In [14]:
X_train['tokenizeQ2'] = X_train['question2_clean'].progress_apply(nltk.word_tokenize)
X_train.head()

100%|████████████████████████████████████████████████████████████████████████| 283003/283003 [00:40<00:00, 7072.87it/s]


Unnamed: 0,question1,question2,question1_clean,question2_clean,tokenizeQ1,tokenizeQ2
20128,"How is the working environment at SBI Life, Mu...",How stressful is work of SBI clerk?,how is the working environment at sbi life mu...,how stressful is work of sbi clerk,"[how, is, the, working, environment, at, sbi, ...","[how, stressful, is, work, of, sbi, clerk]"
296237,How can a US citizen work in Canada?,Will a US graduate degree help a non-US citize...,how can a us citizen work in canada,will a us graduate degree help a non us citize...,"[how, can, a, us, citizen, work, in, canada]","[will, a, us, graduate, degree, help, a, non, ..."
107095,What are the benefits of washing your hands wi...,Why is it important to wash your hands with soap?,what are the benefits of washing your hands wi...,why is it important to wash your hands with soap,"[what, are, the, benefits, of, washing, your, ...","[why, is, it, important, to, wash, your, hands..."
27940,How do the holy scriptures of Hinduism compare...,How do the holy scriptures of Hinduism compare...,how do the holy scriptures of hinduism compare...,how do the holy scriptures of hinduism compare...,"[how, do, the, holy, scriptures, of, hinduism,...","[how, do, the, holy, scriptures, of, hinduism,..."
251434,Is the humanoid shape inevitable for any speci...,Once RNA evolves is it inevitable that eventua...,is the humanoid shape inevitable for any speci...,once rna evolves is it inevitable that eventua...,"[is, the, humanoid, shape, inevitable, for, an...","[once, rna, evolves, is, it, inevitable, that,..."


## POSTagging on Train data

In [15]:
X_train['pos_tagsQ1'] = X_train['tokenizeQ1'].progress_apply(nltk.pos_tag)
X_train.head()

100%|█████████████████████████████████████████████████████████████████████████| 283003/283003 [08:17<00:00, 568.93it/s]


Unnamed: 0,question1,question2,question1_clean,question2_clean,tokenizeQ1,tokenizeQ2,pos_tagsQ1
20128,"How is the working environment at SBI Life, Mu...",How stressful is work of SBI clerk?,how is the working environment at sbi life mu...,how stressful is work of sbi clerk,"[how, is, the, working, environment, at, sbi, ...","[how, stressful, is, work, of, sbi, clerk]","[(how, WRB), (is, VBZ), (the, DT), (working, V..."
296237,How can a US citizen work in Canada?,Will a US graduate degree help a non-US citize...,how can a us citizen work in canada,will a us graduate degree help a non us citize...,"[how, can, a, us, citizen, work, in, canada]","[will, a, us, graduate, degree, help, a, non, ...","[(how, WRB), (can, MD), (a, DT), (us, PRP), (c..."
107095,What are the benefits of washing your hands wi...,Why is it important to wash your hands with soap?,what are the benefits of washing your hands wi...,why is it important to wash your hands with soap,"[what, are, the, benefits, of, washing, your, ...","[why, is, it, important, to, wash, your, hands...","[(what, WDT), (are, VBP), (the, DT), (benefits..."
27940,How do the holy scriptures of Hinduism compare...,How do the holy scriptures of Hinduism compare...,how do the holy scriptures of hinduism compare...,how do the holy scriptures of hinduism compare...,"[how, do, the, holy, scriptures, of, hinduism,...","[how, do, the, holy, scriptures, of, hinduism,...","[(how, WRB), (do, VB), (the, DT), (holy, NN), ..."
251434,Is the humanoid shape inevitable for any speci...,Once RNA evolves is it inevitable that eventua...,is the humanoid shape inevitable for any speci...,once rna evolves is it inevitable that eventua...,"[is, the, humanoid, shape, inevitable, for, an...","[once, rna, evolves, is, it, inevitable, that,...","[(is, VBZ), (the, DT), (humanoid, JJ), (shape,..."


In [16]:
X_train['pos_tagsQ2'] = X_train['tokenizeQ2'].progress_apply(nltk.pos_tag)
X_train.head()

100%|█████████████████████████████████████████████████████████████████████████| 283003/283003 [06:46<00:00, 696.43it/s]


Unnamed: 0,question1,question2,question1_clean,question2_clean,tokenizeQ1,tokenizeQ2,pos_tagsQ1,pos_tagsQ2
20128,"How is the working environment at SBI Life, Mu...",How stressful is work of SBI clerk?,how is the working environment at sbi life mu...,how stressful is work of sbi clerk,"[how, is, the, working, environment, at, sbi, ...","[how, stressful, is, work, of, sbi, clerk]","[(how, WRB), (is, VBZ), (the, DT), (working, V...","[(how, WRB), (stressful, JJ), (is, VBZ), (work..."
296237,How can a US citizen work in Canada?,Will a US graduate degree help a non-US citize...,how can a us citizen work in canada,will a us graduate degree help a non us citize...,"[how, can, a, us, citizen, work, in, canada]","[will, a, us, graduate, degree, help, a, non, ...","[(how, WRB), (can, MD), (a, DT), (us, PRP), (c...","[(will, MD), (a, DT), (us, PRP), (graduate, NN..."
107095,What are the benefits of washing your hands wi...,Why is it important to wash your hands with soap?,what are the benefits of washing your hands wi...,why is it important to wash your hands with soap,"[what, are, the, benefits, of, washing, your, ...","[why, is, it, important, to, wash, your, hands...","[(what, WDT), (are, VBP), (the, DT), (benefits...","[(why, WRB), (is, VBZ), (it, PRP), (important,..."
27940,How do the holy scriptures of Hinduism compare...,How do the holy scriptures of Hinduism compare...,how do the holy scriptures of hinduism compare...,how do the holy scriptures of hinduism compare...,"[how, do, the, holy, scriptures, of, hinduism,...","[how, do, the, holy, scriptures, of, hinduism,...","[(how, WRB), (do, VB), (the, DT), (holy, NN), ...","[(how, WRB), (do, VB), (the, DT), (holy, NN), ..."
251434,Is the humanoid shape inevitable for any speci...,Once RNA evolves is it inevitable that eventua...,is the humanoid shape inevitable for any speci...,once rna evolves is it inevitable that eventua...,"[is, the, humanoid, shape, inevitable, for, an...","[once, rna, evolves, is, it, inevitable, that,...","[(is, VBZ), (the, DT), (humanoid, JJ), (shape,...","[(once, RB), (rna, JJ), (evolves, NNS), (is, V..."


## POSTagging on Test data

In [17]:
X_test['question1_clean'] = X_test['question1'].progress_apply(lambda x: preprocess(x)) # Preprocessing Q1
X_test['question2_clean'] = X_test['question2'].progress_apply(lambda x: preprocess(x)) # Preprocessing Q2
X_test['tokenizeQ1'] = X_test['question1_clean'].progress_apply(nltk.word_tokenize) # Tokenizing Q1_clean
X_test['tokenizeQ2'] = X_test['question2_clean'].progress_apply(nltk.word_tokenize) # Tokenizing Q2_clean
X_test['pos_tagsQ1'] = X_test['tokenizeQ1'].progress_apply(nltk.pos_tag) # appying POSTagging
X_test['pos_tagsQ2'] = X_test['tokenizeQ2'].progress_apply(nltk.pos_tag) # appying POSTagging

100%|████████████████████████████████████████████████████████████████████████| 121287/121287 [00:16<00:00, 7436.37it/s]
100%|████████████████████████████████████████████████████████████████████████| 121287/121287 [00:16<00:00, 7256.36it/s]
100%|███████████████████████████████████████████████████████████████████████| 121287/121287 [00:10<00:00, 11126.38it/s]
100%|███████████████████████████████████████████████████████████████████████| 121287/121287 [00:11<00:00, 10846.53it/s]
100%|█████████████████████████████████████████████████████████████████████████| 121287/121287 [02:14<00:00, 899.76it/s]
100%|█████████████████████████████████████████████████████████████████████████| 121287/121287 [02:29<00:00, 810.21it/s]


In [18]:
X_test.head()

Unnamed: 0,question1,question2,question1_clean,question2_clean,tokenizeQ1,tokenizeQ2,pos_tagsQ1,pos_tagsQ2
8067,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,how do i play pok mon go in korea,how do i play pok mon go in china,"[how, do, i, play, pok, mon, go, in, korea]","[how, do, i, play, pok, mon, go, in, china]","[(how, WRB), (do, VB), (i, VB), (play, VB), (p...","[(how, WRB), (do, VB), (i, VB), (play, VB), (p..."
368101,What are some of the best side dishes for crab...,What are some good side dishes for buffalo chi...,what are some of the best side dishes for crab...,what are some good side dishes for buffalo chi...,"[what, are, some, of, the, best, side, dishes,...","[what, are, some, good, side, dishes, for, buf...","[(what, WDT), (are, VBP), (some, DT), (of, IN)...","[(what, WDT), (are, VBP), (some, DT), (good, J..."
70497,Which is more advisable and better material fo...,What is the best server setup for buddypress?,which is more advisable and better material fo...,what is the best server setup for buddypress,"[which, is, more, advisable, and, better, mate...","[what, is, the, best, server, setup, for, budd...","[(which, WDT), (is, VBZ), (more, RBR), (advisa...","[(what, WP), (is, VBZ), (the, DT), (best, JJS)..."
226567,How do I improve logical programming skills?,How can I improve my logical skills for progra...,how do i improve logical programming skills,how can i improve my logical skills for progra...,"[how, do, i, improve, logical, programming, sk...","[how, can, i, improve, my, logical, skills, fo...","[(how, WRB), (do, VBP), (i, VB), (improve, VB)...","[(how, WRB), (can, MD), (i, VB), (improve, VB)..."
73186,How close we are to see 3rd world war?,How close is a World War III?,how close we are to see rd world war,how close is a world war iii,"[how, close, we, are, to, see, rd, world, war]","[how, close, is, a, world, war, iii]","[(how, WRB), (close, JJ), (we, PRP), (are, VBP...","[(how, WRB), (close, JJ), (is, VBZ), (a, DT), ..."


In [20]:
# Saving the files in csv format for future use:
X_train.to_csv('Postaging_Train.csv')
X_test.to_csv('Postaging_Test.csv')