# Notebook available at 
# http://tiny.cc/9agfiz

## Natral Language Processing



### What can NLP do? 

Would you like to learn how to make a computer understand English? Are you starting out in Natural Language Processing and need teams to work with? The Drexel Data Science Club’s (DSC’s) Text Processing Working Group (TPWG) features interactive demos, projects, tutorials and discussions about text processing. We will start from the basics: TF-IDF, cosine similarity scores, regular expressions, topical modeling, Stemming, Tokenization, Lemmatization and likewise explore more advanced topics.

Throughout the term, the TPWG will take up a TP project as well, which, in addition to ad hoc topics brought by the WG’s members will be approached as an open-ended skill-building exercise.

Project Description:

We will scrape ratemyprofessors.com for all the reviews that students have posted for professors and attempt to summarize ratings for professors. We will, e.g., use TF-IDF scores to get most informative words, create word vectors and learn to generate insights from our text data.

Introduction to Natural Language Processing. Basic Regex. Extracting patterns from text. Introduction to TF-IDF scores

# Overview and Expectations

![Text%20Processing%20Group%20Eventbrite.png](attachment:Text%20Processing%20Group%20Eventbrite.png)

This is a student led working group where we get together every week and learn something new about text processing and culminate the term with a project where we apply our knowledge. 

This is a very hands on group so feel free to discuss and help each other as we progress through the weeks. 

## Question

Take this article (https://local.theonion.com/luddite-in-2070-refuses-to-merge-consciousness-with-sel-1840480408). If you wanted to represent this article so that the AI overlords could read it how would you go about doing that? 

Through the next few weeks we will find out exactly how to do this!! 

# Natural Language Processing is Ubiquitous 

Technolgies like spell-check, autocorrect, machine translation, text summarization etc are all around us applied in subtle ways that we don't even realize but there is a lot more to be done in this amazing field. 

![NLP%20Applications.png](attachment:NLP%20Applications.png)

![MachineTranslation.jpg](attachment:MachineTranslation.jpg)

For someone who is just breaking into machine learning NLP can be specially appealing because of the abundance of text data all over the internet; unstructured text is one of the most common data found on the web. Moreover, as OCR is getting better every day, we are starting to get access to historical knowledge, most of which is in text form. 

One of the most interesting data for NLP is user generated text i.e. reviews, social media content. Some intersting applications for this kind of data be: 

Tweets analysis, hate speech detection
![Blog-Twitter2.png](attachment:Blog-Twitter2.png)



# Semi-Structured Text data

There is a lot of other data out there as well that is quite fun to play with and has some structure built in already. One example of this is transcripts. Let's take this as an example (https://rickandmorty.fandom.com/wiki/Lawnmower_Dog/Transcript). This is transcript for the show Rick and Morty episode 2 season 1. Let's say we wanted to extract all the dialogues from the transcript, we can see that it is pretty well structured in the sense that it follows the same pattern throughout and we can leverage that pattern using **regular expressions**.

![rickNMorty.jpg](attachment:rickNMorty.jpg)

Rick and morty regex https://regex101.com/r/WeQ7lA/1



# Finding patterns where ever we can using Regular Expressions (Regex) 

Regex is built into python and you can use it using the module `re` but if you want to just quickly play around with regex or practice you can also use https://regex101.com/. 

![regex_example_1.png](attachment:regex_example_1.png)

### Basic Regex


### Matching Tokens

- `.` __(wild card)__ Match any character. 

- `[...]` __(character class)__ Match any of these characters. 

- `[^...]` __(complimentary character class)__ Don't match any of these characters. Notice the carat at the begining. https://regex101.com/r/cO8lqs/10

- `[a-z]` __(lowercase range)__ Prebuilt set for lowercase alphabet, match any lowercase character. 

- `[A-Z]` __(uppercase range)__ Prebuilt set for uppercase alphabet, match any lowercase character. https://regex101.com/r/cO8lqs/16697

- `[0-9]` __(numeric range)__ Prebuilt set for numbers, match any lowercase character. 

- `|` __(or)__ Either first or second. https://regex101.com/r/cO8lqs/3

- `\s` __(whitespace character)__ Any whitespace character.

- `\d` __(Any digit)__ Any digit character. https://regex101.com/r/cO8lqs/4

- `\w` __(word)__ Any word character.

- `\` __(escape character)__ Escape any pattern character. 



### Quantifiers 

- `*` __(zero or more)__ Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. https://regex101.com/r/cO8lqs/16698

- `+` __(one or more)__ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

- `?` __(zero or one)__ Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.

- `{m}` __(exactly m times)__ Specifies that exactly m copies of the previous RE should be matched. https://regex101.com/r/cO8lqs/16700

- `{m,n}` __(m throug n times)__ Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. https://regex101.com/r/cO8lqs/16701



### Groupers and Extensions

- `(...)` __(group)__ Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the `\1`, `\2`, etc., special sequences, described below.

- `(?:...)` __(non-matching group)__ Matches `...` as in the parentheses, but does not capture it in a group. This becomes especially important when applying multipliers.

- `(?=...)` __(lookahead)__ Matches if `...` matches next, but doesn’t consume any of the string. https://regex101.com/r/bPryxV/1


- `(?!...)` __(negative look ahead)__ Matches if `...` doesn’t match next.

- `(?<=...)` __(positive look behind)__ Matches if the current position in the string is preceded by a match for `...` that ends at the current position. 

- `(?<!...)` __(negative look behind)__ Matches if the current position in the string is not preceded by a match for `...`.


### Anchors

- `^` __(start anchor)__ Forces match to be at the start of string.
- `$` __(end anchor)__ Forces match to be at the end of string.

https://regex101.com/r/cO8lqs/16696

### Combined Example

In [13]:
import re
sample_string = "01-05-2020 is a valid date while 13/32/2020 is not."

# month between 01 and 12, date between 01 and 31, year can be 19for now.
re.findall("(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-(19|20\d\d)", sample_string)

[('01', '05', '2020')]

# Motivation for numerical representation of text

Tf-idf weight could be thought of as a measure of uniqueness and importance of a word in a given context and document. 


We know that computers don't understand natural (human) language so we need some processing before the computer can read the words and then understand the context and finally the meaning. There are many ways to numerically represent natural language but we will start with one of the most intutive ways, using a **bag of words**. 

A bag of words is exactly what it sounds like, a unique unordered collection (set) of words without any context. Bag of Words ,**BoW** for short, models rely on the frequency of unique words and reprent the given text documents. 

Simplest bag of words model just takes frequency of words.

Lets take the starting lines from A tale of two cities: 

    It was the best of times,
    it was the worst of times,
    it was the age of wisdom,
    it was the age of foolishness,

Some concept definitions:

- **document** : set of words.
- **t** : is the term frequency.
- **d** : is the document frequency.
- **N** : number of documents with the term in them.
- **C** : number of documents in the corpus.
- **Corpus** : Fancy word for a set of documents.

Each line here is a document and the four lines make up our corpus.  

Now lets implement a bag of words model in python.

First we need a list of unique words:

In [32]:
corpus = """It was the best of times,
    it was the worst of times,
    it was the age of wisdom,
    it was the age of foolishness,
"""

# lets learn some meta-data about our corpus!
print("Length of our corpus is " + str(len(corpus.split())))

# We need to take out punctuations and make all terms lowercase
normalized_corpus_string = corpus.replace(",", '').lower().split("\n")
print(normalized_corpus)

normalized_corpus_string = corpus.replace(",", '').replace("\n", "").lower()

#Lets finally get our list of unique words. 
list_unique_words = list(set(normalized_corpus_string.split()))
print(list_unique_words)
print("Number of unique words is " + str(len(list_unique_words)))




Length of our corpus is 24
['it was the best of times', '    it was the worst of times', '    it was the age of wisdom', '    it was the age of foolishness', '']
['age', 'times', 'worst', 'wisdom', 'it', 'was', 'foolishness', 'of', 'the', 'best']
Number of unique words is 10


A very basic implementation in pure python to ge the binary vector for our bag of words modek.

In [33]:
corpus_list = []
print(list_unique_words)
for document in normalized_corpus:
    docuement_list = []

    for word in list_unique_words:
        if word in document.strip():
            docuement_list.append(1)
        else:
            docuement_list.append(0)
            
    corpus_list.append(docuement_list)
            
            
corpus_list

['age', 'times', 'worst', 'wisdom', 'it', 'was', 'foolishness', 'of', 'the', 'best']


[[0, 1, 0, 0, 1, 1, 0, 1, 1, 1],
 [0, 1, 1, 0, 1, 1, 0, 1, 1, 0],
 [1, 0, 0, 1, 1, 1, 0, 1, 1, 0],
 [1, 0, 0, 0, 1, 1, 1, 1, 1, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

This has a few very obvious flaws that everyone might have noticed by now. If we increase the size of our vocabulary our bag of words model becomes larger and larger while the individual documents would still be very sparse (would have a large number of zeros). 

Now as you can see in our model, we reduced the vectors' dimensions by normalizing converting the words to lowercase, this is the simplest approach towards normalization. There are a lot of apprpaches to clean and normalize the text.

## Cleaning and Normalization Techniques

### Tokenization

Fancy word for getting just the words in a document. Words/ Terms are formally called tokens, so tokenization means getting rid of white space and just getting the words from a document. 

The tokens don't always need to be words, they can be groups of words as well. For example: **New York** when tokenized should be one token as it represents one concept of the city of New York. These groups of words are called **n-grams** where **n** is the length of your phrases, a **bigram** is an n-gram with two words. 


### Stemming 

Stemming reduces the words to their roots but the reduced word does not have to be part of the language. Different stemming schemes use different methods, for example the Porter stemming technique uses suffix stemming So for example:

    structured -> structur  
    structure -> structur 
    
Please read this article explaining the algorithm (http://people.scs.carleton.ca/~armyunis/projects/KAPI/porter.pdf).



In [65]:
!pip install nltk
!pip install spacy
!python -m spacy download en

Collecting en_core_web_sm==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0MB)
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py): started
  Building wheel for en-core-web-sm (setup.py): finished with status 'done'
  Stored in directory: C:\Users\pxp142\AppData\Local\Temp\pip-ephem-wheel-cache-vpi6qsc6\wheels\6a\47\fb\6b5a0b8906d8e8779246c67d4658fd8a544d4a03a75520197a
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.2.0
    Uninstalling en-core-web-sm-2.2.0:
      Successfully uninstalled en-core-web-sm-2.2.0
Successfully installed en-core-web-sm-2.2.5
[+] Download and installation successful
You can now load the model via spacy.load('e

In [66]:
import nltk

# Select the punkit model and wordnet from the window
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
# Using NLTK

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

#create an object of class PorterStemmer
porter = PorterStemmer()
lancaster= LancasterStemmer()


print("Porter Stemmer")
print(porter.stem("go"))
print(porter.stem("going"))
print(porter.stem("goes"))
print(porter.stem("gone"))

print("*"*10)

print("Lancaster Stemmer")
print(lancaster.stem("go"))
print(lancaster.stem("going"))
print(lancaster.stem("goes"))
print(lancaster.stem("gone"))

# You might want to try different packages available to see which fits 
#your requirements the best. Let't try to do the same thing with Spacy, another 
# NLP package.

Porter Stemmer
go
go
goe
gone
**********
Lancaster Stemmer
go
going
goe
gon




### Lemmatization

Lemmatization reduces the words to their roots and the reduced word is part of the language. So for example:

    go -> go
    going -> go
    goes -> go
    gone -> go


### Spacy is awesome!

The atomic unit in spacy is a token and each token has some very useful attributes. Look here for the complete list (https://spacy.io/api/token)

In [68]:
import spacy
nlp = spacy.load("en")

running_sentence = """I was sitting in the same place I sat before."""
doc = nlp(running_sentence)

for token in doc:
    print(token.text + "\t" + token.lemma_)

I	-PRON-
was	be
sitting	sit
in	in
the	the
same	same
place	place
I	-PRON-
sat	sit
before	before
.	.


In [53]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("Lemmatization using Wordnet")
print(lemmatizer.lemmatize("go"))
print(lemmatizer.lemmatize("going",  pos="v"))
print(lemmatizer.lemmatize("goes"))
print(lemmatizer.lemmatize("gone",  pos="n"))



Lemmatization using Wordnet
go
go
go
gone


## Understanding TF-IDF 

Tf-ID stands for Term Frequency and Inverse Document Frequency and is a way to score how important a word is in a given corpus (set of documents). Tf-Idf is used extensively in information retrieval and thus by many search engines to rank documents by their relevance to a given query. 

Some concept definitions: 

- **document** : set of words.
- **t** : is the term frequency.
- **d** : is the document frequency.
- **N** : number of documents with the term in them.
- **C** : number of documents in the corpus.
- **Corpus** : Fancy word for a set of documents.

The score increases as the frequency of the word increases in a document and decreases as the freqency increases in the corpus. This means that if a word is very frequent in all different contexts it is not that important, for example stopwords ("a" , "the", "of" etc.) are very common and are not related to document subject per se; while a word like "tf-idf" is very uncommon and would be frequent only in a document about tf-idf.

#### Definitions: 

##### Term Frequency / Normalized Term Frequency

Term Frequency (TF) is just the number of times a term occurs in a document divided by the total number of words in that document. Division by total numeber of words normalizes the measure.

$$c = t / d $$

Inverse Document Frequency (IDF) is a measure of importance. It is the total number of documents in the corpus divided by the documents with the given term in them. It weights down the frequent terms and weights up the rare terms. 

$$c = \log( N / C) $$

##### Example:

Assuming our corpus is all the buzzfeed articles published in 2019, we might have 10,000 total articles (documents) in our corpus. Now consider one of those articles that has 100 words and contains the word **kombucha** 10 times and there are 200 articles that mention **kombucha**. 

Now according to our formula:

$$TermFrequency = 10/100 = 0.1$$

and 

$$InverseDocumentFrequency = \log(10,000/200) = 1.70$$

therefore

$$TFidf = 1.70 * 0.1 = 0.17$$

Now if we calculate these scores for all unique words in all the articles, we can understand what words are the most important in any given article.

In [75]:
import pandas as pd
 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
 
# this is a very toy example, do not try this at home unless you want to understand the usage differences
docs=["the house had a tiny little mouse",
      "the cat saw the mouse",
      "the mouse ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]


#instantiate CountVectorizer()
cv=CountVectorizer()
 
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)
 
word_count_vector.toarray()

array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 2, 0],
       [0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 2, 0],
       [1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 2, 0]], dtype=int64)

In [79]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit_transform(word_count_vector).toarray()


array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.49356209, 0.39820278, 0.49356209, 0.23518498,
        0.        , 0.        , 0.        , 0.        , 0.23518498,
        0.49356209],
       [0.        , 0.        , 0.48334378, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.28547062,
        0.        , 0.        , 0.59909216, 0.        , 0.57094124,
        0.        ],
       [0.        , 0.45709287, 0.        , 0.        , 0.        ,
        0.45709287, 0.        , 0.36877965, 0.        , 0.2178072 ,
        0.        , 0.45709287, 0.        , 0.        , 0.43561441,
        0.        ],
       [0.51392301, 0.        , 0.41462985, 0.        , 0.51392301,
        0.        , 0.        , 0.        , 0.        , 0.24488707,
        0.        , 0.        , 0.        , 0.        , 0.48977413,
        0.        ],
       [0.        , 0.        , 0.        , 0.49175319, 0.        ,
        0.        , 0.        , 