# NLP (Natural Language Processing)

Via the site 'Beingdatum.com'  I've followed the course: 'Guide on Deep Learning for NLP'. 

This Notebook is a summary of that course, which I will use as reference work when having questions in a NLP project. 

In [1]:
!pip install nltk



In [2]:
import nltk
#nltk.download()

## 1. Tokenization
The process of segmenting running text into words and sentences

### Manually tokenize a textfile 

In [3]:
filename = 'filename.txt' #this is a textfile in which we have written 'Hello World'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# convert to lowercase
words = [word.lower() for word in words]
print(words)

['hello', 'world']


### Tokenize a textfile with NLTK (Natural Language Toolkit)

In [4]:
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)

['Hello', 'World']


## 2. Create a Bag of Words model

To create the bag of words model, we need to create a matrix where the columns correspond to the most frequent words in our dictionary where rows correspond to the document or sentences.

### Step 1 Tokenize a text:

In [6]:
text = 'Deep learning methods are popular for natural language, primarily because they are delivering on their promise. Some of the first large demonstrations of the power of deep learning were in natural language processing, specifically speech recognition. More recently in machine translation.'

We will first preprocess the data, in order to:

- Convert text to lower case.
- Remove all non-word characters.
- Remove all punctuations.

In [7]:
#import all needed librairies:
import nltk 
import re 
import numpy as np 
dataset = nltk.sent_tokenize(text) 
for i in range(len(dataset)):     
    dataset[i] = dataset[i].lower() #all text in lowercase
    dataset[i] = re.sub(r'\W', ' ', dataset[i]) #Search  any non-alphanumeric character
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])  #search for all punctuations and split the text into three sentences

In [8]:
#output
dataset

['deep learning methods are popular for natural language primarily because they are delivering on their promise ',
 'some of the first large demonstrations of the power of deep learning were in natural language processing specifically speech recognition ',
 'more recently in machine translation ']

### Step 2 Obtaining most frequent words in our text: 


We will apply the following steps to generate our model:
- We declare a dictionary to hold our bag of words. 
- Next we tokenize each sentence to words. 
- Now for each word in a sentence, we check if the word exists in our dictionary.
- If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.

In [16]:
word2count = {} 
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1
#show the output
import pandas as pd
table = pd.DataFrame(word2count, index=[0])
table.T

Unnamed: 0,0
deep,2
learning,2
methods,1
are,2
popular,1
for,1
natural,2
language,2
primarily,1
because,1


In our model, we have a total of 41 words. However when processing large texts, the number of words could reach millions. We do not need to use all those words. Hence, we select a particular number of most frequently used words. To implement this we use:

In [19]:
import heapq
freq_words = heapq.nlargest(20, word2count, key=word2count.get) #20 denotes the number of words we want. In larger datasets we can set this to larger numbers
freq_words

['of',
 'deep',
 'learning',
 'are',
 'natural',
 'language',
 'the',
 'in',
 'methods',
 'popular',
 'for',
 'primarily',
 'because',
 'they',
 'delivering',
 'on',
 'their',
 'promise',
 'some',
 'first']

### Step 3  Building the Bag of Words model:
In this step we construct a vector, which would tell us whether a word in each sentence is a frequent word or not. If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.+

This can be implemented with the help of following code:

In [20]:
X = []
for data in dataset:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(data):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
X = np.asarray(X)

X

array([[0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

## 3. N-Grams

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios).
- When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. 
- When N=2, this is called bigrams 
- When N=3 this is called trigrams. 
- When N>3 this is usually referred to as four grams or five grams and so on.

In [21]:
import latex

#### How many N-grams in a sentence?
If X = Number of words in a given sentence K, the number of N-grams for sentence K would be: 

${Ngrams_K}= X - (N-1)$

In [22]:
#Python code for N-grams
def generate_ngrams(text,n): 
    # split sentences into tokens
    tokens=re.split("\\s+",text)
    ngrams=[] 
    # collect the n-grams
    for i in range(len(tokens)-n+1):
     temp=[tokens[j] for j in range(i,i+n)]
     ngrams.append(" ".join(temp)) 
    return ngrams


In [24]:
generate_ngrams('This is sparta', 2)

['This is', 'is sparta']

## 4. Word Embedding

Word embedding is a dense representation of words in the form of numeric vectors.

The word embedding representation is able to reveal many hidden relationships between words. For example, vector(“cat”) – vector(“kitten”) is similar to vector(“dog”) – vector(“puppy”).

### Why do we use word embedding?
Words aren’t things that computers naturally understand. By encoding them in a numeric form, we can apply mathematical rules and do matrix operations to them. This makes them amazing in the world of machine learning, especially.

Take deep learning for example. By encoding words in a numerical form, we can take many deep learning architectures and apply them to words. Convolutional neural networks have been applied to NLP tasks using word embedding and have set the state-of-the-art performance for many tasks.

Even better, what we have found is that we can actually pre-train word embedding that are applicable to many tasks. 

### examples of word embedding:
-  <span style="text-decoration: underline">**One-Hot-Encoding (Count Vectorizing)**</span>:

Create a vector that has as many dimensions as your corpora has unique words. Each unique word has a unique dimension and will be represented by a 1 in that dimension with 0s everywhere else.
- <span style="text-decoration: underline">**TF-IDF Transform**</span>:

TF-IDF vectors are related to one-hot encoded vectors. However, instead of just featuring a count, they feature numerical representations where words aren’t just there or not there. Instead, words are represented by their term frequency multiplied by their inverse document frequency.
In simpler terms, words that occur a lot but everywhere should be given very little weighting or significance. We can think of this as words like the or and in the English language. They don’t provide a large amount of value.
However, if a word appears very little or appears frequently, but only in one or two places, then these are probably more important words and should be weighted as such.
Again, this suffers from the downside of very high dimensional representations that don’t capture semantic relatedness.
- <span style="text-decoration: underline">**Co-Occurrence Matrix**</span>:

A co-occurrence matrix is exactly what it sounds like: a giant matrix that is as long and as wide as the vocabulary size. If words occur together, they are marked with a positive entry. Otherwise, they have a 0. 
It boils down to a numeric representation that simple asks the question of “Do words occur together? If yes, then count this.”
And what can we already see becoming a big problem? Super large representation! If we thought that one-hot encoding was high dimensional, then co-occurrence is high dimensional squared. That’s a lot of data to store in memory

Advantages of Co-occurrence Matrix

- It preserves the semantic relationship between words. i.e men and women tend to be closer than man and apple.
- It uses SVD at its core, which produces more accurate word vector representations than existing methods.
- It uses factorization which is a well-defined problem and can be efficiently solved.
- It has to be computed once and can be used anytime once computed. In this sense, it is faster in comparison to others.

Disadvantages of Co-Occurrence Matrix+

- It requires huge memory to store the co-occurrence matrix.
- But, this problem can be circumvented by factoring the matrix out of the system for example in Hadoop clusters etc. and can be saved

#### example using count vertorizing / one-hot-encoding

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
# To create a Count Vectorizer, we simply need to instantiate one.

sample_text = ["Machine Learning: Introduction to Machine learning and hands-on experience on the various applications of ML",
"Deep Learning: Introduction to Deep learning & NLP"]
#these are two booktitles

#CountVectorizer Plain and Simple
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(sample_text)
count_vector=cv.fit_transform(sample_text)

What happens above is that the 2 books titles are preprocessed, tokenized and represented as a sparse matrix as explained in the introduction. By default, CountVectorizer does the following:

- lowercases your text (set lowercase=false if you don’t want lowercasing)
- uses utf-8 encoding
- performs tokenization (converts raw text to smaller units of text)
- uses word level tokenization (meaning each word is treated as a separate token)
- ignores single characters during tokenization (so say bye bye to words like ‘a’ and ‘I’)
Now, let’s look at the vocabulary (collection of unique words from our documents):

In [26]:
cv.vocabulary_

{'machine': 7,
 'learning': 6,
 'introduction': 5,
 'to': 13,
 'and': 0,
 'hands': 4,
 'on': 11,
 'experience': 3,
 'the': 12,
 'various': 14,
 'applications': 1,
 'of': 10,
 'ml': 8,
 'deep': 2,
 'nlp': 9}

As we are using all the defaults, these are all word level tokens, lower-cased. Note that the numbers here are not counts, they are the position in the sparse vector.

In [27]:
#let's check the shape:
count_vector.shape

(2, 15)

we have two rows (two titles) and 15 unique words! 

### CountVectorizer and Stop Words

Now, the first thing you may want to do, is to eliminate stop words from your text as it has limited predictive power and may not help with downstream tasks such as text classification. Stop word removal is a breeze with CountVectorizer and it can be done in several ways:

- Use a custom stop word list that you provide
- Use sklearn’s built in English stop word list (not recommended)
- Create corpora specific stop words using max_df and min_df (highly recommended)

Now let’s look at these 3 ways of using stop words.

In [28]:
#Custom stop word list:
cv = CountVectorizer(sample_text,stop_words=["all","on","the","is","and","to"])
count_vector=cv.fit_transform(sample_text)
count_vector.shape

(2, 11)

In [29]:
#the shape has changed from 15 unique words to 11 unique words because the stop words have been removed. 
#let's see what python has remembered as the stop word list:
cv.stop_words

# Note that we can actually load stop words directly from a file into a list and supply that as the stop word list.

['all', 'on', 'the', 'is', 'and', 'to']

#### Stop Words using MIN_DF:

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis.+

Instead of using a minimum term frequency (total occurrences of a word) to eliminate words, MIN_DF looks at _**how many documents contained a term**_, better known as _**document frequency**_. The MIN_DF value can be an _**absolute value**_ (e.g. 1, 2, 3, 4) or a _**value representing proportion of documents**_ (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

In [31]:
#Eliminating words that appeared in less than 2 documents:

cv = CountVectorizer(sample_text,min_df=2)
count_vector=cv.fit_transform(sample_text)
cv.stop_words_

{'and',
 'applications',
 'deep',
 'experience',
 'hands',
 'machine',
 'ml',
 'nlp',
 'of',
 'on',
 'the',
 'various'}

In [32]:
#To see what’s remaining, all we need to do is check the vocabulary again
cv.vocabulary_

{'learning': 1, 'introduction': 0, 'to': 2}

#### Stop Words using MAX_DF:

Just as we ignored words that were too rare with MIN_DF, we can ignore words that are too common with MAX_DF. 

MAX_DF looks at _**how many documents contained a term**_, and if it exceeds the MAX_DF threshold, then it is eliminated from consideration. The MAX_DF value can be an _**absolute value**_ (e.g. 1, 2, 3, 4) or a _**value representing proportion of documents**_ (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

In [34]:
cv = CountVectorizer(sample_text,max_df=0.50)
count_vector=cv.fit_transform(sample_text)
cv.stop_words_

{'introduction', 'learning', 'to'}

Now, to see which words have been eliminated, you can use cv.stop_words_ (see output above):+

In this example, all the words that appeared in all 2 book titles have been eliminated.

### Prediction based methods:

#### 1. Continuous Bag of Words(CBOW) model
CBOW is learning to predict the word by the context. A context may be single word or multiple words for a given target words.

- Example text “The man jumped over the wall.”

CBOW's approach is to treat {“The”, “man”, ’over”, “the’, “wall”} as a context and from these words, be able to predict or generate the center word “jumped”. 

#### 2. Skip-gram model

- Example text “The man jumped over the wall.”

Skip-gram's approach is to create a model such that given the center word “jumped”, the model will be able to predict or generate the surrounding words “The”, “man”, “over”, “the”, “wall”. Here we call the word “jumped” the context. 

Advantages/Disadvantages of CBOW and Skip-gram:

- Being probabilistic is nature, these are supposed to perform superior to deterministic methods(generally).
- These are low on memory. They don’t need to have huge RAM requirements like that of co-occurrence matrix where it needs to store three huge matrices.
- Though CBOW (predict target from context) and skip-gram (predict context words from target) are just inverted methods to each other, they each have their advantages/disadvantages. Since CBOW can use many context words to predict the 1 target word, it can essentially _**smooth out**_ over the distribution. This is essentially like regularization and offer very good performance when our input data is not so large. However the skip-gram model is more _**fine grained**_ so we are able to extract more information and essentially have more accurate embeddings when we have a large data set (large data is always the best regularizer). Skip-gram with negative sub-sampling outperforms every other method generally.

## Word2Vec using Gensim library

Let’s create a corpus using a single wikipedia article. To do so, we need to scrape wikipedia using BeautifulSoup.

In [38]:
!pip install lxml
#Used to parse XML and HTML



In [40]:
#The article we are going to scrape is the Wikipedia article on Artificial Intelligence. 
#Let’s write a Python Script to scrape the article from Wikipedia

import bs4 as bs
import urllib.request
import re
import nltk

scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In the script above, we first download the Wikipedia article using the urlopen method of the request class of the urllib library. We then read the article content and parse it using an object of the BeautifulSoup class. Wikipedia stores the text content of the article inside p tags. We use the find_all function of the BeautifulSoup object to fetch all the contents from the paragraph tags of the article.

Finally, we join all the paragraphs together and store the scraped article in article_text variable for later use.

**Preprocessing**

The next step is to preprocess the content for Word2Vec model. The following script preprocess the text:

In [41]:
# Cleaning the text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r'\s+', ' ', processed_article)

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]