# Natural Language Processing (NLP)

Natural Language Processing(NLP) is about the application of computers on different language nuances and to build real-world applications using NLP techniques. We can think of NLP as an analogous to teaching a new language to a child.The most common tasks a child learn from his teachers are understanding words, sentences and forming grammatically and structurally correct sentences. NLP can be classified as a subset of the broader field of speech and language processing. 
NLP is the field dedicated exclusively to the automated understanding of human language. The few classification problem that are very much common to NLP are :

* Given characters of a corpus  to predict where words start and end.
* Given words of a corpus to predict where sentences start and end.
* Given words in a sentence to predict part-of-speech for each word. 
* Given words in a sentence to predict where phrases start and ends.
* Given words in a sentence to predict where named entity (nouns) references start and end.
* Given words find the sentiment in the sentences. 

In brief, NLP is suppose to do one of these three tasks
* Label a region of text (Part of Speech Tagging or sentiment classification).
* Link two or more region of text which are referencing to the same real world thing.
* try to fill in missing information (words) based on context.
With the popularity of Deep Learning, NLP now uses state-of-art algorithms such as RNN,LSTM,Transformers, etc. Before that, NLP used advanced ,probabilistic and non parametric models. 

NLP is a subfield of Artificial Intelligence(AI) which deals with the natural language and it is at the intersection of AI and linguistics.It enables the machine to understand the language of humans and thus bridge the communication gap between human and the machine. 

## Natural Language Toolkit(NLTK)

NLTK was created in 2001 for a Linguistic Course at the University of Pennysylvania. It has been widely adopted toolkit for research projects in the field of Natural Language Processing. NLTK comes with many modules for different language processing tasks such as String Preprocessing, POS(part-of-Speech tagging),classification,parsking,chunking and so on . Eventhough the toolkit is efficient enought to do meaningful tasks , it has not been optimized for runtime performance.
NLTK is a free toolkit and is donwnloadable from http://www.nltk.org/. You can find the detail instruction on downloading and installing the toolkit from BIG FLash moodle page inside <u>Introduction to the ML/AI MOOC tile</u>. 


## Spacy

Spacy is the library for advanced NLP in python and cython. Spacy packages includes state-of-the-art speed and neural netwrok models for tagging ,parsing,named entity recognition,text classification.It also includes pretrained transformers like BERT. IT is open source software and is downloadlable from https://spacy.io/usage. You can find the detail instruction on downloading and installing the toolkit from BIG FLash moodle page inside <u>Introduction to the ML/AI MOOC tile</u>. 

# Processing Raw Data

When grading the answers in the exam questions, the grader grades the answer based on the relevant part of the answers which carries most of information related to question and ignores irrelevant part.The grader identifies the key words in the questions and try to match them to the answer to find the correct one . Text processing works in the similar strategy.The machine doesnot need irrelevant part of the corpora(collection of texts)

##  Tokenization

In NLP, we need to divide the texts into multiple sentences or words. Usually computer reads bodies of text due to which we need to separate this single body of texts as an individual string object. Tokenization is the process of splitting each documents into the words that appears in the docoment. For example splitting the documents on whitespaces and punctuation.

In the example below, we are going to look at the two different tokenization method from nltk library word_tokenize and sent_tokenize.

In [1]:
sample_text = '''I am a student from the University of Alabama. I
was born in Ontario, Canada and I am a huge fan of the United States. I am going to get a degree in Philosophy to improve
my chances of becoming a Philosophy professor. I have been
working towards this goal for 4 years. I am currently enrolled
in a PhD program. It is very difficult, but I am confident that
it will be a good decision'''

In [2]:
from nltk.tokenize import word_tokenize,sent_tokenize
sample_word_tokens = word_tokenize(sample_text)
sample_sent_tokens = sent_tokenize(sample_text)

In [3]:
print(sample_word_tokens)

['I', 'am', 'a', 'student', 'from', 'the', 'University', 'of', 'Alabama', '.', 'I', 'was', 'born', 'in', 'Ontario', ',', 'Canada', 'and', 'I', 'am', 'a', 'huge', 'fan', 'of', 'the', 'United', 'States', '.', 'I', 'am', 'going', 'to', 'get', 'a', 'degree', 'in', 'Philosophy', 'to', 'improve', 'my', 'chances', 'of', 'becoming', 'a', 'Philosophy', 'professor', '.', 'I', 'have', 'been', 'working', 'towards', 'this', 'goal', 'for', '4', 'years', '.', 'I', 'am', 'currently', 'enrolled', 'in', 'a', 'PhD', 'program', '.', 'It', 'is', 'very', 'difficult', ',', 'but', 'I', 'am', 'confident', 'that', 'it', 'will', 'be', 'a', 'good', 'decision']


In [4]:
print(sample_sent_tokens)

['I am a student from the University of Alabama.', 'I\nwas born in Ontario, Canada and I am a huge fan of the United States.', 'I am going to get a degree in Philosophy to improve\nmy chances of becoming a Philosophy professor.', 'I have been\nworking towards this goal for 4 years.', 'I am currently enrolled\nin a PhD program.', 'It is very difficult, but I am confident that\nit will be a good decision']


The difference between word_tokenize and sent_tokenize is sent_tokenize tokenizes the text by sentence delimiter.

## Stemming

Similary some words in the corpus contains singular or plural version. For example , the semantics of 'drawer' and 'drawers' are so close that distinguishing them as a seperate word will create an overfitting.This problem can be overcome by using the word stem by identifying all the words that have the same word stem.The process of removing the suffixes at the end of the words is called stemming. 

In [5]:
import nltk
import pandas as pd
from nltk.stem import PorterStemmer 

In [6]:
stemmer=PorterStemmer()
words=['annoyed','levitated','was','cats','better']
stems=[ stemmer.stem(word=word) for word in words]
df_stem=pd.DataFrame({'given_word':words,
               'stem_word':stems
             })
print(df_stem)

  given_word stem_word
0    annoyed     annoy
1  levitated     levit
2        was        wa
3       cats       cat
4     better    better


In the result above, you can see that stemmer only drops the sufix at the end of word. However the stem of the word 'was' is 'be' not 'wa'. This is the limitation is stemming process.

## Lemmatization

Lemmatization is closely related to the stemming process. Stemming tends to get the stem words without taking the context of the word in the sentence by simply dropping the suffix. However Lemmatization applies morphohlogical analysis to the words and takes the role of the word into account. Both Stemming and Lemmatization are the form of normalization which try to extract normal form of words.

In the example below, we are going to use the same list of words that we had used in stemming example and see the differences

In [7]:
from nltk.stem import WordNetLemmatizer

In [8]:
wrd_lmtz=WordNetLemmatizer()
words=['annoyed','levitated','was','cats','better']

In [9]:
lmtz= [wrd_lmtz.lemmatize(word,pos='v')  for word in words ] # By default WordNetLemmatizer lemattizes only n=noun words
df_lmtz=pd.DataFrame({'given_word':words,
               'lematize_word':lmtz
             })
print(df_lmtz)

  given_word lematize_word
0    annoyed         annoy
1  levitated      levitate
2        was            be
3       cats           cat
4     better        better


##  Stop Words 

While processing the raw text data it is important to get rid of uninformative words by discarding words which are too frequent to be informative. we can import the stopword from NLTK package. All of the stopwords available in the NLTK are lowercase by default. <b>Therefore it is importantto convert all the words in the tokens in the lowercase letter because the matching process of words in the text and in the list of stopwords are case sensitive.</b>

Below is an example of removing the stopwords

In [10]:
from nltk.corpus import stopwords

In [11]:
stop=stopwords.words('english') 
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

We can see above the list of stopwords for the english corpus. We will be removing the stop words from the sample_sent_tokens.

In [12]:
sent_stop_rmv=[]
for sent in sample_sent_tokens:
    clean_word=[]
    for word in sent.split():
        if word.lower() not in stop:
            clean_word.append(word.lower()) # if true append the lower case words
    sent_stop_rmv.append('-'.join(clean_word)) # we can use white spaces aswell while joining
print(sent_stop_rmv)

['student-university-alabama.', 'born-ontario,-canada-huge-fan-united-states.', 'going-get-degree-philosophy-improve-chances-becoming-philosophy-professor.', 'working-towards-goal-4-years.', 'currently-enrolled-phd-program.', 'difficult,-confident-good-decision']


## Word Embeddings

Natural Language Processing is about preparing textual data for machine learning and deep learning models.However ML/DL model works efficiently with the numerical data as input and therefore it becomes important to transform the preprocess textual data into numerical data. The embedded numerical data are the representation of the textual data and consists of real-value vectors. The words which has similar meaning are mapped to similar vectors. Word embeddings are generated once the tokenization has been performed on the corpus. 
One of the important reason for embeddings is to make machine understand the synonyms the way human do. For example : a machine has to learn to differentiate between positive and negative adjective or to familariza with different words that has similar meaning and either gives positive or negative impact. Let us take two sentences:
* The food here is good.

* The food here is great.

Both of above sentence indicates positive vibes about the food and thus word embedding maps 'good' and 'great' two seperate but similar real-value vectors.


###  CountVectorizer 

CountVectorizer transforms a text into a vector on the basis of the frequency(count) of each words present in the text or in the corpus. A count vectorizer is an implementation of bag-of-words in which we code text data as a representation of features/words.

Bag-of-words(BoW) is one of the  effective ways to represent the text for machine learning is using the Bag-of-Words. When using this representation the algorithm simply seeks to know the number of times a given words is present with a body of text. We discard the structure of the text such as chapters,paragraphs,sentences and formatting.

In [13]:
# let us consider the list of tokens created from the example given in tokenization section. 
# 
sample_sent_tokens=['I am a student from the University of Alabama.', 
                     'I \nwas born in Ontario, Canada and I am a huge fan of the United States.', 
                     'I am going to get a degree in Philosophy to improve my chances of \nbecoming a Philosophy professor.', 
                     'I have been working towards this goal\nfor 4 years.', 'I am currently enrolled in a PhD program.', 
                     'It is very difficult, \nbut I am confident that it will be a good decision']

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
c = CountVectorizer(stop_words='english', token_pattern=r'\w+')
converted_data = c.fit_transform(sample_sent_tokens).todense()

In [15]:
print(converted_data)

[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0]
 [0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 2 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]
 [0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]


Our converted data set is of arrary 6$\times$50, which means we have six sentences where each of the sentences has 50 features. Let us look at the features 

In [16]:
print(c.get_feature_names())

['4', 'alabama', 'born', 'canada', 'chances', 'confident', 'currently', 'decision', 'degree', 'difficult', 'enrolled', 'fan', 'goal', 'going', 'good', 'huge', 'improve', 'ontario', 'phd', 'philosophy', 'professor', 'program', 'states', 'student', 'united', 'university', 'working', 'years']


# Exercise for Students: NLP for SMS SPAM  DETECTION using MLP 

In this exercise, we are going to use the dataset from UCI ML repository,https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. The dataset is already available in the jupyter notebook text/spam path. The datasets consists of different SMS messages labeled as 'ham'(genuine) and 'spam'. Your task would be to preprocess the text data , create a MLP model and train the data.

STEPS TO DO in the following order
- read the file spam.csv located at path text/spam.csv
- Use label encoder to encode the target variables value 'ham' and 'spam'
- tokenize the the text amd conver them to lower case letters.
- Remove the stop words
- Lemmatize the text
- Use count vectorizer as word embedding
- Split the dataset into training and test
- Build a MLP model from given specification
     <table>
      <thead>
        <tr>
          <th>Layers</th>
          <th>Neurons</th>
          <th>Activation Function</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Dense Layer-I</td>
          <td>512</td>
          <td>tanh</td>
        </tr>
        <tr>
         <td>Dropoutlayer(rate=0.5)</td>
          <td>-</td>
          <td>-</td>
        </tr>
        <tr>
         <td>Dense Layer-II</td>
          <td>2</td>
          <td>-</td>
        </tr> 
      </tbody>
    </table>
- Fit the data with above given model and compare predicted and test labels.

<b>HINT: You can use the idea from the chapter 1 Introduction to Deep Learning for this exercise</b>

In [17]:
## Solution: Not to be disclosed during teaching process

In [18]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [19]:
df=pd.read_csv('text/spam.csv',encoding="ISO-8859-1").dropna(axis=1)

In [20]:
le=LabelEncoder()
df['v1']=le.fit_transform(df.v1)

In [21]:
stop=stopwords.words('english')

In [22]:
df['wrd_tokenize']=df['v2'].apply(lambda x:[word.lower() # change to lower case
                                            for sent in nltk.sent_tokenize(x) 
                                            for word in nltk.word_tokenize(sent)])
df['wrd_stop']=df['wrd_tokenize'].apply(lambda x:[wrd  for wrd in x if wrd not in  stop])


In [23]:
wrd_lmtz=WordNetLemmatizer()

In [24]:
df['wrd_lmtz']=df['wrd_stop'].apply(lambda x:[wrd_lmtz.lemmatize(wrd_stp)  for wrd_stp in x ])
df['clean_join'] =df['wrd_lmtz'].apply(lambda x:'-'.join(x))

In [25]:
df=df[['clean_join','v1']]

## WORD EMBEDDING

In [26]:
cnt_vtr=CountVectorizer()
X=cnt_vtr.fit_transform(df['clean_join'])

In [27]:
y=df.v1.to_numpy()
num_of_classes=len(np.unique(y))

In [28]:
num_of_classes

2

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, test_size=0.33, random_state=42)

In [30]:
X_val=X_test[:1800,:]
y_val=y_test[:1800]
X_test=X_test[1800:,:]
y_test=y_test[1800:]

In [31]:
X_val.shape,y_val.shape,X_test.shape,y_test.shape

((1800, 8114), (1800,), (39, 8114), (39,))

In [32]:
from tensorflow.keras.models import Sequential
import keras
import tensorflow as tf
from tensorflow.keras import datasets,layers,models

In [33]:
number_of_feats=X_train.shape[1]

In [34]:
batch_size=32
epochs=10

model = model=tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(number_of_feats,)))
model.add(tf.keras.layers.Dense(512, activation='tanh') )
model.add(tf.keras.layers.Dropout(0.5))


model.add(tf.keras.layers.Dense(num_of_classes)) 



In [35]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               4154880   
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 1026      
Total params: 4,155,906
Trainable params: 4,155,906
Non-trainable params: 0
_________________________________________________________________


In [36]:
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs,validation_data=(X_test,y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1c82ab03940>

In [37]:
for x_tst,y_tst   in zip(X_test,y_test):
    y_pred=np.argmax(model.predict(x_tst.reshape(1,-1)))
    print(f'true_label   {y_tst}----------------->pred_label   {y_pred}')

true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   1----------------->pred_label   1
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   1----------------->pred_label   1
true_label   1----------------->pred_label   1
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   1----------------->pred_label   1
true_label   0----------------->pred_label   0
true_label   0----------------->pred_label   0
true_label   