# [Getting Started with NLP](https://dphi.tech/bootcamps/getting-started-with-natural-language-processing?utm_source=header)
by [CSpanias](https://cspanias.github.io/aboutme/), 28/01 - 06/02/2022 <br>

Bootcamp organized by **[DPhi](https://dphi.tech/community/)**, lectures given by [**Dipanjan (DJ) Sarkar**](https://www.linkedin.com/in/dipanzan/) ([GitHub repo](https://github.com/dipanjanS/nlp_essentials)) <br>

## Fundamental Tutorials for NLP:
* [NLTK Book](https://www.nltk.org/book/)
* [spaCy Tutorials](https://course.spacy.io/en/chapter1)

# CONTENT
1. Text Wrangling
2. [Text Representation with Feature Engineering - Statistical Models](#TextRepresentation)
    1. [Preparing a Sample Corpus](#Sample)
    1. [Text Preprocessing](#TextPre)
    1. [Feature Engineering Techniques](#FeaEng)
        1. [Bag of Words Model](#BoW)
        2. [Bag of N-grams Model](#Ngrams)
        3. [TF-IDF](#TfIdf)
    1. [Document Similarity](#Similarity)
    1. [Clustering using Document Similarity Features](#KMeans)

In [1]:
# Install Dependencies

import nltk
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
#!pip install contractions
#!pip install textsearch

<a name="TextRepresentation"></a>
# 2. Text Representation with Feature Engineering - Statistical Models
**Feature engineering** is particurarly important when working with **unstructured, textual data** because we need to **convert free flowing text into some numeric representations** which can then be understood by machine learning algorithms.

<a name="Sample"></a>
## 2.1 Prepare a Sample Corpus
A **corpus** is a **collection of text documents** usually belonging to one or more subjects or domains.

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth = 200

# prepare a sample corpus
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]

# create a list of categories to match the corpus
labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']

# convert corpus into array
corpus = np.array(corpus)

# create a pandas DataFrame
corpus_df = pd.DataFrame(
    # col_name: value
    {'Document': corpus, 'Category': labels})

# show dataframe
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful.,weather
1,Love this blue and beautiful sky!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food
4,"I love green eggs, ham, sausages and bacon!",food
5,The brown fox is quick and the blue dog is lazy!,animals
6,The sky is very blue and the sky is very beautiful today,weather
7,The dog is lazy but the brown fox is quick!,animals


<a name="TextPre"></a>
## 2.2 Text Pre-processing

Since the focus of this unit is on feature engineering, we will build a simple text pre-processor which focuses on removing special characters, extra whitespaces, digits, stopwords and lower casing the text corpus.

**`re.I`** \ **`re.A`** More about **re flags** [here](https://docs.python.org/3/library/re.html#contents-of-module-re). <br>
**`np.vectorize(custom_function)`** More info about the **why of vectorization** [here](https://www.geeksforgeeks.org/vectorization-in-python/) and official **documentation** [here](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html).

*For step-by-step tutorial on text pre-processing go to [lecture 1](https://github.com/CSpanias/nlp_resources/blob/main/dphi_nlp_bootcamp/1_text_wrangling.ipynb).*

In [11]:
import re

# load NLTK's list of stopwords
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    """Normalize a document."""
    
    # remove special characters, re.I = IgnoreCase | keep ASCII only
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    # lower case
    doc = doc.lower()
    # strip trailing whitespace
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # remove stopwords
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    
    # return normalized doc
    return doc

# vectorize function
normalize_corpus = np.vectorize(normalize_document)

# apply the vectorized function
norm_corpus = normalize_corpus(corpus)

# show normalized corpus
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

<a name="FeaEng"></a>
## 2.3 Feature Engineering Techniques

<a name="BoW"></a>
### 2.3.1 Bag of Words Model
**`from sklearn.feature_extraction.text import CountVectorizer`**

**The most simple vector space representational model for unstructured text**. 

A vector space model is simply a mathematical model to **represent unstructured text as numeric vectors**, such that each dimension of the vector is a specific feature\attribute.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# instantiate CountVectorizer
cv = CountVectorizer(min_df=0., max_df=1.)

# fit & transform text data
cv_matrix = cv.fit_transform(norm_corpus)

# convert matrix to array
cv_matrix = cv_matrix.toarray()

# show matrix
cv_matrix

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]],
      dtype=int64)

In [15]:
# get all unique words in the corpus
vocab = cv.get_feature_names()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0
3,1,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0
5,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0
6,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1
7,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0


<a name="Ngrams"></a>
### 2.3.2 Bag of N-Grams Model
An **N-gram** is basically a **collection of word tokens** from a text document such that these tokens are **continuous** and occur in a **sequence**. The Bag of N-Grams model is just an **extension of the BoW** model so we can also leverage N-gram based features.

In [10]:
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams

# instantiate CountVectorizer to obtain just bigrams
bv = CountVectorizer(ngram_range=(2,2))

# fit & transform data
bv_matrix = bv.fit_transform(norm_corpus)

# convert to array
bv_matrix = bv_matrix.toarray()

# get feature names
vocab = bv.get_feature_names()

# create a pd DF
pd.DataFrame(bv_matrix, columns=vocab)

Unnamed: 0,bacon eggs,beautiful sky,beautiful today,blue beautiful,blue dog,blue sky,breakfast sausages,brown fox,dog lazy,eggs ham,...,lazy dog,love blue,love green,quick blue,quick brown,sausages bacon,sausages ham,sky beautiful,sky blue,toast beans
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,1,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
6,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
7,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


<a name="TfIdf"></a>
### 2.3.3 TF-IDF Model
**`from sklearn.feature_extraction.text import TfidfVectorizer`**

With **BoW** the **terms which occur frequently across all documents** tend to **overshadow other terms** in the feature set. 

The **TF-IDF** model tries to combat this issue by using a scaling or normalizing factor by using a combination of two metrics in its computation:
1. **Term Frequency** (TF) represents the term frequency of a word in a document, which can be obtained from the Bag of Words model. 
1. **Inverse Document Frequency** (IDF) is the inverse document frequency for a word, which can be computed as the log transform of the total number of documents in the corpus divided by the document frequency of the word, which is basically the frequency of documents in the corpus where the word occurs. 


>**TF-IDF** = TF x IDF

*More info about TF-IDF [here](https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3).*

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# instantiate TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)

# fit & transform data
tv_matrix = tv.fit_transform(norm_corpus)

# convert to arry
tv_matrix = tv_matrix.toarray()

# get feature names
vocab = tv.get_feature_names()

# create a pd DF
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


<a name="Similarity"></a>
## 2.3 Document Similarity
**`from sklearn.metrics.pairwise import cosine_similarity`**

**Document similarity** is the process of using a **distance** or **similarity** based metric that can be used to identify **how similar a text document is with any other document(s)** based on features extracted from the documents like bag of words or tf-idf.

**Pairwise document similarity** in a corpus involves computing document similarity for each pair of documents in a corpus. 

There are **several similarity and distance metrics** that are used to compute document similarity. 

In our analysis, we will be using **cosine similarity and compare pairwise document similarity based on their TF-IDF** feature vectors.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

# compute cosine similarity for corpus
similarity_matrix = cosine_similarity(tv_matrix)

# create a pd DF
similarity_df = pd.DataFrame(similarity_matrix)

# show DF
similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,0.820599,0.0,0.0,0.0,0.192353,0.817246,0.0
1,0.820599,1.0,0.0,0.0,0.225489,0.157845,0.670631,0.0
2,0.0,0.0,1.0,0.0,0.0,0.791821,0.0,0.850516
3,0.0,0.0,0.0,1.0,0.506866,0.0,0.0,0.0
4,0.0,0.225489,0.0,0.506866,1.0,0.0,0.0,0.0
5,0.192353,0.157845,0.791821,0.0,0.0,1.0,0.115488,0.930989
6,0.817246,0.670631,0.0,0.0,0.0,0.115488,1.0,0.0
7,0.0,0.0,0.850516,0.0,0.0,0.930989,0.0,1.0


**Cosine similarity** gives us a metric representing the cosine of the **angle between the feature vector representations** of two text documents. **The lower** the angle between the documents, the closer and **more similar** they are.

Looking closely at the similarity matrix clearly tells us that documents 0, 1 and 6 as well as the documents 2, 5 and 7 are very similar to one another and documents 3 and 4 are slightly similar to each other but the magnitude is not very strong, however still stronger than the other documents. This must indicate these **similar documents have some similar features**. 

This is a perfect example of **grouping or clustering** that can be solved by **unsupervised learning** especially when you are dealing with huge corpora of millions of text documents.

<a name="KMeans"></a>
## 2.4 Clustering using Document Similarity Features
**`from sklearn.cluster import KMeans`**

A very popular partition based clustering method is **K-means clustering** which groups these documents based on their similarity based feature representations. 

In K-means clustering, we have an **input parameter** `k`, which specifies the **number of clusters** it will output using the document features. 

This clustering method is a **centroid based clustering method**, where it tries to cluster these documents into **clusters of equal variance**. It tries to create these clusters by **minimizing the within-cluster sum of squares measure** (aka **inertia**). 

In [21]:
from sklearn.cluster import KMeans

# instantiate KMeans
km = KMeans(n_clusters=3, random_state=0)

# fit & transform data
km.fit_transform(similarity_matrix)

# assign the 3 cluster labels of the model to a variable
cluster_labels = km.labels_

# create a pandas DataFrame for cluster labels
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])

# concatenate corpus_df with the above dataframe
pd.concat([corpus_df, cluster_labels], axis=1)

Unnamed: 0,Document,Category,ClusterLabel
0,The sky is blue and beautiful.,weather,2
1,Love this blue and beautiful sky!,weather,2
2,The quick brown fox jumps over the lazy dog.,animals,1
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food,0
4,"I love green eggs, ham, sausages and bacon!",food,0
5,The brown fox is quick and the blue dog is lazy!,animals,1
6,The sky is very blue and the sky is very beautiful today,weather,2
7,The dog is lazy but the brown fox is quick!,animals,1
