# Exploring common applications of NLP
While I scratch things down for this exploration, it will be messy. I'm looking to dive into functionality first and polish later.

Some ideas for NLP Applications:

* spell checking and grammatical correction
* text classification, summarization, mining
* information retrieval and information extraction
* question answering
* sentiment analysis
* support applications, such as: stemming, POS tagging, semantic tagging, and partial parsing
* natural language programming code generators, query generators
* machine translation
* speech analysis and generation systems
* conversational agents (e.g., chat bots)
* document generation (or computer support in document writing)

Ideas taken from my notes from CSCI 4152/6509 @ Dalhousie University, taught by Vlado Keselj

In [82]:
#Imports
import torch
import numpy as np
import pandas as pd
import spacy
import nltk
import matplotlib as plt
nlp = spacy.load("en_core_web_sm")

# Let's Talk Tensors
I will likely be using PyTorch for my NLP models, so going over the basics is always useful.

### Basics

In [1]:
#Helper function for tensor info

def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

In [2]:
import torch
#Create random tensor by specifying dimensions
describe(torch.Tensor(2, 3))
#Uniform distro initialize [0,1)
describe(torch.rand(2,3)) # uniform random
describe(torch.randn(2,3))  # random normal

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[-5.5196e-13,  4.5886e-41, -1.4018e+05],
        [ 4.5886e-41, -5.4010e-13,  4.5886e-41]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0.6312, 0.2235, 0.7323],
        [0.5298, 0.6733, 0.7855]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[-0.1161, -0.3188,  0.1020],
        [-0.4853, -0.0160,  1.5869]])


In [3]:
#Tensors filled with same scalar
describe(torch.zeros(2,3))
#or
x = torch.ones(2,3)
describe(x)
x.fill_(5) # _ indicates and in-place operation that will modify content in place
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0., 0., 0.],
        [0., 0., 0.]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1., 1., 1.],
        [1., 1., 1.]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[5., 5., 5.],
        [5., 5., 5.]])


In [4]:
#Using a list
x = torch.Tensor([[1,2,3], [4,5,6]])
describe(x)

#To/from NumPy
import numpy as np
y = np.random.rand(2,3)
describe(torch.from_numpy(y))

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]])
Type: torch.DoubleTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0.8535, 0.0760, 0.6408],
        [0.4872, 0.6421, 0.9031]], dtype=torch.float64)


In [5]:
#Operations
x = torch.randn(2,3)
describe(x)
describe(torch.add(x, x))
#or
describe(x + x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[-1.3372, -0.0507,  1.3860],
        [-1.0321,  0.7932, -1.0762]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[-2.6744, -0.1014,  2.7720],
        [-2.0641,  1.5864, -2.1524]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[-2.6744, -0.1014,  2.7720],
        [-2.0641,  1.5864, -2.1524]])


In [6]:
#Dimension operations
x = torch.arange(6) #Initalization with a range of values
describe(x)
#changing the 'view' of the tensor
x = x.view(2,3)
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([6])
Values: 
tensor([0, 1, 2, 3, 4, 5])
Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])


In [7]:
#Dim here = dimension to reduce
describe(torch.sum(x, dim=0)) #Reduce rows = sum cols
describe(torch.sum(x, dim=1)) #Reduce cols = sum rows

Type: torch.LongTensor
Shape/size: torch.Size([3])
Values: 
tensor([3, 5, 7])
Type: torch.LongTensor
Shape/size: torch.Size([2])
Values: 
tensor([ 3, 12])


In [8]:
#Indexing/slicing

x = torch.arange(6).view(2,3)
describe(x)
describe(x[:1,:2]) #Up to row 1, up to col 2 (not inclusive)
describe(x[0][1]) #row 0, column 1

Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([1, 2])
Values: 
tensor([[0, 1]])
Type: torch.LongTensor
Shape/size: torch.Size([])
Values: 
1


In [9]:
#Fancy indexing (non contiguous)
indices = torch.LongTensor([0,2]) #Long tensor required for indexing
describe(torch.index_select(x, dim=1, index=indices)) 

indices = torch.LongTensor([0,0])
describe(torch.index_select(x, dim=0, index=indices)) 

Type: torch.LongTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[0, 2],
        [3, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [0, 1, 2]])


In [13]:
#Concat
x = torch.arange(6).view(2,3)
describe(x)
describe(torch.cat([x, x], dim=0)) #Append rows
describe(torch.cat([x,x], dim=1)) #Append columns
describe(torch.stack([x,x])) #Appending 2d tensor, creates new dimension

Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([4, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5],
        [0, 1, 2],
        [3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 6])
Values: 
tensor([[0, 1, 2, 0, 1, 2],
        [3, 4, 5, 3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 2, 3])
Values: 
tensor([[[0, 1, 2],
         [3, 4, 5]],

        [[0, 1, 2],
         [3, 4, 5]]])


In [16]:
#Linear algebra
x1 = torch.arange(6, dtype=torch.float32).view(2,3)
describe(x1)
x2 = torch.ones(3,2)
x2[:,1] += 1
describe(x2)
describe(torch.mm(x1, x2))


Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0., 1., 2.],
        [3., 4., 5.]])
Type: torch.FloatTensor
Shape/size: torch.Size([3, 2])
Values: 
tensor([[1., 2.],
        [1., 2.],
        [1., 2.]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[ 3.,  6.],
        [12., 24.]])


### Tensors and Computational Graphs

In [21]:
#Gradient bookkeeping

#Requires grad is built into torch tensors to flag the model to 
#Track the gradient at the tensor & gradient function
x = torch.ones(2, 2, requires_grad=True)
describe(x)
print(x.grad is None)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
True


In [22]:
#After using x in computation
y = (x + 2) * (x + 5) + 3
describe(y)
print(x.grad is None)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[21., 21.],
        [21., 21.]], grad_fn=<AddBackward0>)
True


In [23]:
z = y.mean()
describe(z)
#Initiate a backward pass (would be training for a model)
z.backward()
print(x.grad is None)

Type: torch.FloatTensor
Shape/size: torch.Size([])
Values: 
21.0
False


"When you create a tensor with requires_grad=True, you are requiring PyTorch to manage bookkeeping information that computes gradients. First, PyTorch will keep track of the values of the forward pass. Then, at the end of the computations, a single scalar is used to compute a backward pass. The backward pass is initiated by using the backward() method on a tensor resulting from the evaluation of a loss function.
The backward pass computes a gradient value for a tensor object that participated in the forward pass.
In general, the gradient is a value that represents the slope of a function output with respect to the function input. In the computational graph setting, gradients exist for each parameter in the model and can be thought of as the parameter’s contribution to the error signal. In PyTorch, you can access the gradients for the nodes in the computational graph by using the .grad member variable. Optimizers use the .grad variable to update the values of the parameters." (pp.23-24, Rao &McMahan, 2019)

### Additional Tensor Exercises

In [48]:
#1. Create a 2D tensor and then add a dimension of size 1 inserted at dimension 0.
a = torch.rand(3, 3)
describe(a)
a = a.unsqueeze(0) #Unsqueeze no longer seems to work in place
describe(a)
#2. Remove the extra dimension you just added to the previous tensor.
a = a.squeeze(0)
describe(a)
#3. Create a random tensor of shape 5x3 in the interval [3, 7)
a = torch.rand(5,3).uniform_(3,7)
describe(a)
#4. Create a tensor with values from a normal distribution (mean=0, std=1).
a = torch.randn(1,2)
describe(a)


Type: torch.FloatTensor
Shape/size: torch.Size([3, 3])
Values: 
tensor([[0.1103, 0.8470, 0.3410],
        [0.0517, 0.3652, 0.4733],
        [0.7898, 0.9979, 0.7675]])
Type: torch.FloatTensor
Shape/size: torch.Size([1, 3, 3])
Values: 
tensor([[[0.1103, 0.8470, 0.3410],
         [0.0517, 0.3652, 0.4733],
         [0.7898, 0.9979, 0.7675]]])
Type: torch.FloatTensor
Shape/size: torch.Size([3, 3])
Values: 
tensor([[0.1103, 0.8470, 0.3410],
        [0.0517, 0.3652, 0.4733],
        [0.7898, 0.9979, 0.7675]])
Type: torch.FloatTensor
Shape/size: torch.Size([5, 3])
Values: 
tensor([[3.4073, 6.6782, 4.9442],
        [5.7531, 4.3199, 6.4622],
        [3.0486, 6.2562, 6.1288],
        [6.6474, 4.1279, 5.8314],
        [5.7852, 3.5584, 3.9229]])
Type: torch.FloatTensor
Shape/size: torch.Size([1, 2])
Values: 
tensor([[0.8900, 1.6407]])


In [53]:
#5. Retrieve the indexes of all the nonzero elements in the tensor torch.Tensor([1, 1, 1, 0, 1]).
a = torch.Tensor([1, 1, 1, 0, 1])
print(torch.nonzero(a))
#6. Create a random tensor of size (3,1) and then horizontally stack four copies together.
a = torch.rand(3,1)
a.expand(3, 4)
describe(a)
#7. Return the batch matrix-matrix product of two three-dimensional matrices (a=torch.rand(3,4,5), b=torch.rand(3,5,4)).
a = torch.rand(3, 4, 5)
b = torch.rand(3, 5, 4)
c = torch.bmm(a, b)
print('Batch matrix-matrix product:')
describe(c)

tensor([[0],
        [1],
        [2],
        [4]])
Type: torch.FloatTensor
Shape/size: torch.Size([3, 1])
Values: 
tensor([[0.8143],
        [0.8853],
        [0.9599]])
Batch matrix-matrix product:
Type: torch.FloatTensor
Shape/size: torch.Size([3, 4, 4])
Values: 
tensor([[[1.5171, 1.0205, 1.9006, 1.3833],
         [1.0590, 1.1029, 1.5360, 1.2021],
         [0.9381, 0.6295, 1.2227, 0.8907],
         [2.2295, 1.4544, 2.6187, 2.0427]],

        [[2.3655, 1.2813, 1.0573, 2.3138],
         [1.1838, 0.5576, 0.5598, 1.1609],
         [1.6539, 0.8487, 0.8366, 1.6434],
         [2.6953, 1.6405, 1.5004, 2.6969]],

        [[0.4632, 0.6717, 1.3089, 0.9911],
         [0.3524, 0.6966, 1.5733, 0.7359],
         [0.9599, 1.1777, 1.7903, 1.0912],
         [1.3014, 1.4758, 2.0686, 1.3581]]])


In [56]:
#8. Return the batch matrix-matrix product of a 3D matrix and a 2D matrix (a=torch.rand(3,4,5), b=torch.rand(5,4)).
b = torch.rand(5,4)
c = torch.bmm(a, b.unsqueeze(0).expand(a.size(0), *b.size()))
print('Batch matrix-matrix product of 3D & 2D:')
describe(c)

Batch matrix-matrix product of 3D & 2D:
Type: torch.FloatTensor
Shape/size: torch.Size([3, 4, 4])
Values: 
tensor([[[1.0435, 1.4942, 1.5699, 1.4485],
         [0.7644, 0.9854, 1.0839, 0.8961],
         [0.7661, 0.8133, 0.6325, 1.1026],
         [1.5191, 1.9652, 2.0571, 1.8838]],

        [[1.1463, 1.5522, 1.6381, 1.3833],
         [0.5495, 0.7978, 0.7697, 0.7563],
         [0.9601, 1.0818, 1.1293, 1.2079],
         [1.6021, 1.4269, 1.4266, 2.0614]],

        [[0.7948, 0.7197, 0.7924, 0.9401],
         [0.9730, 0.5689, 1.0061, 1.4761],
         [1.2304, 1.4277, 1.5134, 1.5557],
         [1.4274, 1.8596, 1.8209, 1.7938]]])


# NLP basics 

## Tokenizing

In [59]:
import spacy
#Install a default trained pipeline package 'en' = english
nlp = spacy.load("en_core_web_sm")
#Aurora is my hedgehog's name. tehe
text = "Auroras only come out at night, except in the winter."
#Using list comprehension for english language tokenization using spacy
print([str(token) for token in nlp(text.lower())])

['auroras', 'only', 'come', 'out', 'at', 'night', ',', 'except', 'in', 'the', 'winter', '.']


In [60]:
#Tokenizing a tweet using NLTK
from nltk.tokenize import TweetTokenizer
tweet = u"Snow White and the Seven Degrees #MakeAMovieCold@midnight:-)"
tokenizer = TweetTokenizer() #Understands hashtags and emojis
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


## N-Grams
Generating n-grams from a text is straightforward enough, as illustrated above, but packages like spaCy and NLTK provide convenient methods.

In [61]:
def n_grams(text, n):
    '''
    takes tokens on text, returns a list of n-grams
    '''
    #From index i to i+n (3 in our test case, non-inclusive)
    return [text[i:i+n] for i in range(len(text) - n+1)] #range -n+1 to not overflow indices

cleaned = ['mary', ',', "n't", 'slap', 'green', 'witch', '.']
print(n_grams(cleaned, 3))

[['mary', ',', "n't"], [',', "n't", 'slap'], ["n't", 'slap', 'green'], ['slap', 'green', 'witch'], ['green', 'witch', '.']]


## Lemmas and Stems
Lemmas are the root forms of words
Stemming uses handcrafted rules to strip endings of words to reduce them to *stems*.

Spacy uses WordNet for lemmas, does not have direct stem functions.

NLTK has a stemmer which we reference the notes from:
https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/

In [62]:
doc = nlp(u"he was running late")
for token in doc:
    print('{} --> {}'.format(token, token.lemma_))

he --> he
was --> be
running --> run
late --> late


In [63]:
from nltk.stem.porter import *

stemmer = PorterStemmer()
tokens = ['compute', 'computer', 'computed', 'computing']
for token in tokens:
    print('{} --> {}'.format(token, stemmer.stem(token)))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


## Part-of-Speech (POS) Tagging

In [66]:
doc = nlp(u"Mary slapped the green witch.")
for token in doc:
    print('{} - {}'.format(token, token.pos_))

Mary - PROPN
slapped - VERB
the - DET
green - PROPN
witch - NOUN
. - PUNCT


## Chunking and Named Entity Recognition (NER)

Additional reference used here:
https://blog.devgenius.io/named-entity-recognition-ner-nlp-python-6504d5843f98

In [72]:
#Identifying all the nouns in the document
doc = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
    print('{} - {}'.format(chunk, chunk.label_))


Mary - NP
the green witch - NP


In [73]:
doc = nlp(u"Mary Marigold slapped the green witch in Google Studios.")
#Named entities
for entities in doc.ents:
    print('{} - {} - {}'.format(entities, entities.label_, spacy.explain(entities.label_)))

Mary Marigold - PERSON - People, including fictional
Google Studios - ORG - Companies, agencies, institutions, etc.


In [80]:
spacy.displacy.render(doc, style="ent", jupyter=True)

## Parsing

https://spacy.io/usage/linguistic-features

In [79]:
spacy.displacy.render(doc, style="dep", jupyter=True)

# NLP with Neural Networks

I'm not going to go over the basics of Neural networks, as I've already built functionality explanations in my portfolio using just NumPy for maximum component clarity.

See page on Linear Regression for dive in gradient descent and
Neural Networks for multilayer perceptron building:
https://github.com/No-Arms/Portfolio

In [None]:
import torch.nn as nn
import torch.optim as optim

class Perceptron(nn.Module):
    def __init__(self, input_dim):
        #Input = size of input features
        super(Perceptron, self).__init__()
        self.fc1 = nn.Linear(input_dim,1) #First fully connect layer
    def forward(self, x_in):
        '''
        in shape = (batch, num_features)
        out shape = tensor shape (batch,)
        '''
        return torch.sigmoid(self.fc1(x_in)).squeeze() #squeeze might not work here

## Sentiment Classification
Classifying the restaurant reviews on Yelp as positive of negative.

Dataset found at : https://www.kaggle.com/datasets/ilhamfp31/yelp-review-dataset

In [113]:
'''Have to manually label our columns here.
Also need to leave the index column because otherwise the "class" column is considered an index, not a column.'''
colNames = ['Class','Review']
testD = pd.read_csv('datasets/yelp_review_polarity_csv/test.csv', header=0, names=colNames)
print(testD.head())
print(testD.columns)

   Class                                             Review
0      1  Last summer I had an appointment to get new ti...
1      2  Friendly staff, same starbucks fair you get an...
2      1  The food is good. Unfortunately the service is...
3      2  Even when we didn't have a car Filene's Baseme...
4      2  Picture Billy Joel's \"Piano Man\" DOUBLED mix...
Index(['Class', 'Review'], dtype='object')


# References
* Rao, D., & McMahan, B. (2019). *Natural language processing with pytorch: Build intelligent language applications using Deep Learning.* O'Reilly. 