# Learning the basics of NLP
## Embedding / Encoding / Vectorization

**Embedding** converting words, dates, times or other features into meaningful numeric representation. 

**One Hot Encoding**
For example: if we have a sequence -> **orange - red - green - blue - yellow**
to represent orange, 
the sequence can be written as -> **1 - 0 - 0 - 0 - 0 **

However, this is not efficient because the length of the sequence should be as long as the length of the whole corpus.

**Embedding requires**
- combination of a Neural Network
- cleaner data
- one hot encoded data

I really like the tutorial [here](https://stackabuse.com/python-for-nlp-word-embeddings-for-deep-learning-in-keras/) and I am using it to practice and learn more.

In [3]:
import pandas as pd
import numpy as np

import re
from nltk.corpus import stopwords

from numpy import array
from keras import regularizers, optimizers
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten, LSTM, Dropout
from keras.layers.embeddings import Embedding

In [4]:
dataset = pd.read_csv('train.csv')
dataset.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ


In [5]:
print(len(dataset))

45000


In [6]:
dataset = dataset.rename(columns = {'Y' : 'Category'})
dataset.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Category
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ


> In this notebook, I want to work on texts.
I want to practice the concept of embedding.
I will only extract the Body from the dataset and try practicing cleaning the text and practice different tokenization technique offered by NLP.

In [7]:
from nltk.tokenize import word_tokenize
from gensim.utils import simple_preprocess



In [8]:
dataset_body = dataset['Body']
dataset_body.head()

0    <p>I'm already familiar with repeating tasks e...
1    <p>I'd like to understand why Java 8 Optionals...
2    <p>I am attempting to overlay a title over an ...
3    <p>The question is very simple, but I just cou...
4    <p>I'm using custom floatingactionmenu. I need...
Name: Body, dtype: object

## Cleaning Data

In [9]:
stop_words = set(stopwords.words('english'))
print(stop_words)

{"mightn't", 'ourselves', 'most', 'won', 'under', 'to', 'been', 'ours', 'that', 'in', 'didn', "haven't", 'his', 'and', 'herself', 'further', 'but', 'through', 'above', "don't", "couldn't", 'aren', 'they', 'isn', 'until', 'yourself', 'or', 'them', 'then', 'each', 'where', 'of', 'which', 'should', 'some', "you'd", 'don', 'nor', 'as', 'into', 'what', 'theirs', 'the', "shouldn't", 'hers', 'it', 'down', 'once', 'its', 'very', 'she', 'were', 'your', 'ma', 'me', 'out', 'a', 'hadn', "aren't", 'him', 'so', 'for', 's', 'my', 'yourselves', "it's", 'he', 'myself', 'whom', 'has', 'with', 'couldn', 'doesn', 'be', 'am', "mustn't", 'yours', "needn't", "hadn't", 'have', 'hasn', "shan't", 'too', 'an', 'these', 'at', "doesn't", 'between', 'during', 'needn', 'just', 'i', "hasn't", 'such', 'having', 'few', "wouldn't", 'themselves', "you'll", 'over', 'o', 'shouldn', 'mustn', 'does', 'you', 'himself', 'had', 'about', 'same', 'weren', 'being', 'again', 'do', 'wouldn', 'off', "weren't", 'no', "that'll", 'wasn'

In [10]:
symbols = re.compile(pattern = '[/<>(){}\[\]\|@,;]')
tags = ['href', 'http', 'https', 'www']

def text_clean(s: str) -> str:
    """
    Removes unwanted symbols, punctuation and stop words from a given string.
    """
    s = symbols.sub(' ', s)
    for i in tags:
        s = s.replace(i, ' ')
    cleaned_text = ' '.join(word for word in simple_preprocess(s, deacc = True) if not word in stop_words)
    return cleaned_text

# Applying the function on the questions column
dataset_body = dataset_body.apply(text_clean)
dataset_body.head()
        
    


0    already familiar repeating tasks every seconds...
1    like understand java optionals designed immuta...
2    attempting overlay title image image darkened ...
3    question simple could find answer pre code ret...
4    using custom need implement scale animation sh...
Name: Body, dtype: object

In [11]:
# making training sets
dataset_body_X_train = dataset_body

In [12]:
dataset_body_X_train

0        already familiar repeating tasks every seconds...
1        like understand java optionals designed immuta...
2        attempting overlay title image image darkened ...
3        question simple could find answer pre code ret...
4        using custom need implement scale animation sh...
                               ...                        
44995    new asking help convert string type data made ...
44996    working learning python wondering way scripts ...
44997    looks like costs days per month azure change b...
44998    questions want implement quiz clicks parenthes...
44999    new programming teaching made calculator calcu...
Name: Body, Length: 45000, dtype: object

In [13]:
dataset_body_Y_train = dataset.iloc[:, 0].values.reshape(-1,1)

In [14]:
dataset_body_Y_train

array([[34552656],
       [34553034],
       [34553174],
       ...,
       [60462001],
       [60465318],
       [60468018]])

## Tokenization

In [15]:
all_words = []
for sentence in dataset_body_X_train:
    tokenize_word = word_tokenize(sentence)
    for word in tokenize_word:
        all_words.append(word)

In [16]:
print(len(all_words))

3759217


In [17]:
# since words are repeated along the corpus so I am only taking the unique words
unique_words = set(all_words)
print(len(unique_words))

135602


# Embedding
## One Hot Encoding

The enbedding layer requires the words to be in numerical form. Thats why the categorical encoding is required. 

In [18]:
embedded_sentence = [one_hot(sentence, 135607) for sentence in dataset_body_X_train] # added a buffer 5 to the length of total unique words
embedded_sentence[1] # the data at index 1

[79805, 126217, 130615, 20968, 11423, 18003, 86967, 112384]

##### Trying to find the longest sentence. Enbedding requires all sentences to be of the same length. Once the longest sentence is found we can encode all sentences to be the same size as the longest one.


In [19]:
word_count = lambda dataset_body_X_train: len(word_tokenize(dataset_body_X_train))
largest_sentence = max(dataset_body_X_train, key=word_count)
length_of_longest_sentence = (len(word_tokenize(largest_sentence)))
print(length_of_longest_sentence)

7145


##### Now adiing padding to fill up the blank spaces for sentences that are not the same size as the longest onE
`embedded_sentence` encoded but not all are of same size.

Define padding index: same as the longest sentence size. 

`post` means - adding padding at the end of the sentence

In [20]:
padded_sentence = pad_sequences(embedded_sentence, length_of_longest_sentence, padding='post')

In [21]:
print(padded_sentence)

[[  3739  90791  82468 ...      0      0      0]
 [ 79805 126217 130615 ...      0      0      0]
 [ 14398 127366  45106 ...      0      0      0]
 ...
 [ 86898  79805   1135 ...      0      0      0]
 [ 21192 116289 131973 ...      0      0      0]
 [ 19147  63681  17694 ...      0      0      0]]


#### Now everything is ready to run embedding.

In [26]:
volume = 135607
model = Sequential()
model.add(Embedding(volume, 5, input_length = length_of_longest_sentence))
model.add(Flatten())
model.add(Dense(units = 1, activation = 'sigmoid'))


model.summary()

model.compile(optimizer = 'adam', metrics = ['accuracy'], loss = 'categorical_crossentropy')

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 7145, 5)           678035    
_________________________________________________________________
flatten_3 (Flatten)          (None, 35725)             0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 35726     
Total params: 713,761
Trainable params: 713,761
Non-trainable params: 0
_________________________________________________________________


In [None]:
#model.fit(padded_sentence, dataset_body_Y_train, epochs = 20, batch_size = 512, verbose = 1)