<a href="https://colab.research.google.com/github/IndraniMandal/CSC310-S20/blob/master/18a_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it 
!test ! -e ds-assets && git clone https://github.com/IndraniMandal/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/" 
import sys
sys.path.append(home)      # add home folder to module search path

Already up to date.


# Natural Language Processing (NLP)

Some of the most important data in our society is represented as unstructured text:

* Medical records
* Court cases
* Insurance documents

Other data perhaps not as fundamental but that provides interesting insights into trends and mindsets:

* Twitter and other online blogs
* News feeds


In all of these cases we want to extract meaning from the unstructured text:

* Perhaps we want to do classification (medical records - high risk/low risk)
* Perhaps we want to do a topic analysis of the twitter feeds
* Perhaps we would like to construct a recommendation engine for news feeds

Regardless, what the task, we need to convert the unstructured text into something that we can work with and perhaps most importantly, our models can work with.

☞ The **Vector Model** of text (sometimes called the **Bag-of-Words model**)


## The Vector Model

The vector model converts a document with unstructured text into a **point in an n-dimensional coordinate system** where the coordinate system is defined by the words contained in the text.

Consider: the quick brown fox jumps over the lazy dog

This text can be represented as the tuple rearranged in alphabetical order,
```
(brown,dog,fox,jumps,lazy,over,quick,the)
```

Let’s consider the fact that we have multiple documents and represent them as tuples,

* Doc 1: the quick brown fox jumps over the lazy dog &rarr; `(brown,dog,fox,jumps,lazy,over,quick,the)`
* Doc 2: rudi is a lazy brown dog &rarr; `(a,brown,dog,is,lazy,rudi)`

In order to compare the two documents we create a tuple of the **union** of the words appearing in the 
two sentence tuples,
```
(a,brown,dog,fox,is,jumps,lazy,over,quick,rudi,the)
```
and represent each document as bit vectors with the same length as the tuple above and with 1's and 0's
indicating if the document contains a word at a particular tuple position or not,

* Doc 1: (0,1,1,1,0,1,1,1,1,0,1)
* Doc 2: (1,1,1,0,1,0,1,0,0,1,0)

Notice that our word tuple now has become our coordinate system, in this case with 11 dimensions, and each document is now a point in this 11-dimensional space.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Rectangular_coordinates.svg/1280px-Rectangular_coordinates.svg.png" width="350" height="300">

The nice thing about this vector model representation is that we can do mathematics on the documents!

Consider adding another document to our collection

* Doc 3: princess jumps over the dog &rarr; `(dog,jumps,over,princess,the)`

Here we have the new word `princess`, so we need to extend our coordinate system to 12 dimensions by adding `princess`,
```
(a,brown,dog,fox,is,jumps,lazy,over,princess,quick,rudi,the)
```
Our three documents become vectors/points in this coordinate system,

* Doc 1: the quick brown fox jumps over the lazy dog &rarr; `(0,1,1,1,0,1,1,1,0,1,0,1)`
* Doc 2: rudi is a lazy brown dog &rarr; `(1,1,1,0,1,0,1,0,0,0,1,0)`
* Doc 3: princess jumps over the dog &rarr; `(0,0,1,0,0,1,1,1,1,0,0,1)`


Given our vector model of the three docs we can ask questions like this, 

> Is doc2 or doc3 more similar to doc1?

Since all three documents are considered points in our coordinate system we can Euclidean distances in that coordinate system to answer that question. More specifically, we can answer this question by considering the Euclidean distances doc1 &harr; doc2 and doc1 &harr; doc3 in our coordinate system.  

The Euclidean distance d in n-dimensional space between two points $p$ and $q$ is defined as:

$d(p,q) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\ldots+(p_n-q_n)^2}$

In our case the point $p$ and $q$ are document vectors and $p_i$ and $q_i$ are the components of the respective 
vectors.

In order to answer our question we have to perform the following computations,

* $d(doc1, doc2) = \sqrt{(0-1)^2+(1-1)^2+(1-1)^2+(1-0)^2+(0-1)^2+(1-0)^2+(1-1)^2+(1-0)^2+(0-0)^2+(1-0)^2+(0-1)^2+(1-0)^2}                      = \sqrt{1+0+0+1+1+1+0+1+0+1+1+1} = \sqrt{8} = 2.8$

* $d(doc1,doc3) = \sqrt{(0-0)^2+(1-0)^2+(1-1)^2+(1-0)^2+(0-0)^2+(1-1)^2+(1-1)^2+(1-1)^2+(0-1)^2+(1-0)^2+(0-0)^2+(1-1)^2} = \sqrt{0+1+0+1+0+0+0+0+1+1+0+0} = \sqrt{4} = 2.0$

> So, doc3 is more similar to doc1 than doc2!


## The Vector Model in Sklearn

Let's try the above in sklearn.

In [4]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# set up our documents
doc_names = ["doc1", "doc2", "doc3"]
docs = ["the quick brown fox jumps over the lazy dog",
        "rudi is a lazy brown dog",
        "princess jumps over the lazy dog"]

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(docs).toarray()

# print out the coordinate system
# NOTE: sklearn filters out single character words -- is drops 'a'
print("Coordinates:")
coords = vectorizer.get_feature_names()
print(coords)

# print out how each document is represented in this coordinate system
# NOTE: traditional this mapping is called the 'docterm' matrix - the mapping
#       of each document into the set of terms/words.
print("\nDocterm:")
docterm = pandas.DataFrame(data=docarray,index=doc_names,columns=coords)
print(docterm)

# print pairwise distances between documents
distances = euclidean_distances(docterm)
distances_df = pandas.DataFrame(data=distances, index=doc_names, columns=doc_names)
print("\nPairwise Distances:")
print(distances_df)

Coordinates:
['brown', 'dog', 'fox', 'is', 'jumps', 'lazy', 'over', 'princess', 'quick', 'rudi', 'the']

Docterm:
      brown  dog  fox  is  jumps  lazy  over  princess  quick  rudi  the
doc1      1    1    1   0      1     1     1         0      1     0    1
doc2      1    1    0   1      0     1     0         0      0     1    0
doc3      0    1    0   0      1     1     1         1      0     0    1

Pairwise Distances:
          doc1      doc2      doc3
doc1  0.000000  2.645751  2.000000
doc2  2.645751  0.000000  2.645751
doc3  2.000000  2.645751  0.000000




> Just as we computed by hand - doc3 is more similar to doc1 than doc2.

# Real World Data: News Articles

The data set we will be using are articles from a newsgroup feed (think chat room before chat rooms existed).

We will look at two newsgroups: 
* Politics
* Space


In [5]:
import pandas as pd

# get the newsgroup data
newsgroups = pd.read_csv(home+"newsgroups.csv")
newsgroups.head(n=10)

Unnamed: 0,text,label
0,From: demon@desire.wright.edu (Not a Boomer)\n...,space
1,From: dreitman@oregon.uoregon.edu (Daniel R. R...,space
2,From: mcgoy@unicorn.acs.ttu.edu (David McGaugh...,space
3,From: blh@uiboise.idbsu.edu (Broward L. Horne)...,space
4,From: wiggins@cecer.army.mil (Don Wiggins)\nSu...,space
5,From: nickh@CS.CMU.EDU (Nick Haines)\nSubject:...,politics
6,From: mike@gordian.com (Michael A. Thomas)\nSu...,space
7,From: jbreed@doink.b23b.ingr.com (James B. Ree...,politics
8,From: baalke@kelvin.jpl.nasa.gov (Ron Baalke)\...,politics
9,From: DPierce@world.std.com (Richard D Pierce)...,politics


In [6]:
newsgroups['text'].iloc[0]



In [7]:
newsgroups.shape

(1058, 2)

In [8]:
newsgroups['label'].value_counts()

politics    593
space       465
Name: label, dtype: int64

## The Docterm Matrix


In [9]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# get the newsgroup data
newsgroups = pd.read_csv(home+"newsgroups.csv")

# process documents                                                                                               
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(newsgroups['text']).toarray()
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names()[:10]))

docarray shape: (1058, 23537)
first 10 coords: ['00', '000', '0000', '00000', '000000', '000007', '000021', '000062david42', '00041032', '0004136']




Look at at the shape of the docarray, we see that we have about 23,000+ different features.  That means our newsgroup articles "live" in a 23,000+ dimensional space. When we look at the features it is clear that there are many "nonsense" features.  We need more filtering!

## More Filtering

From this it is clear that we want to do some additional filtering:
* Minimum doc frequency = 2 -- that is, any word has to appear at least twice in the document collection
* Delete anything that is not a word - get rid of things like ‘000’ etc., we use the token pattern arg for that.


In [10]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# get the newsgroup data
newsgroups = pd.read_csv(home+"newsgroups.csv")

# process documents                                                                                               
vectorizer = CountVectorizer(analyzer = "word", 
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True, 
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(newsgroups['text']).toarray()
                                                                                                 
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names()[:10]))

docarray shape: (1058, 11862)
first 10 coords: ['a', 'aa', 'aammmaaaazzzzzziinnnnggggg', 'aaron', 'aas', 'ab', 'abandon', 'abandoned', 'abandonment', 'abbey']




Notice that we cut the number of features in the space to just about half of the  original features and the features look more like words.

## Stop Words

Stop word filtering is a way to reduce dimensionality of the feature space by removing words from the document that do not add to content/concept of the document.  Words like 'its', 'an', 'the', 'for', 'that', etc. are so common in each document that they do not any value during an analysis.  We will filter them out.

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# get the newsgroup data
newsgroups = pd.read_csv(home+"newsgroups.csv")

# process documents                                                                                               
vectorizer = CountVectorizer(analyzer = "word", 
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True, 
                             stop_words = 'english',
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(newsgroups['text']).toarray()
                                                                                                 
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names()[:10]))

docarray shape: (1058, 11563)
first 10 coords: ['aa', 'aammmaaaazzzzzziinnnnggggg', 'aaron', 'aas', 'ab', 'abandon', 'abandoned', 'abandonment', 'abbey', 'abc']




Notice that stop word filtering reduced the dimensionality by another 300 dimensions.

## Stemming

The first few coordinates are now:

['aa', 'aammmaaaazzzzzziinnnnggggg', 'aaron', 'aas', 'ab', **'abandon'**, **'abandoned'**, **'abandonment'**, 'abbey', 'abc']

Here, we see one more issue, three different shapes of the same root word, in this case **abandon**.

> Solution: Stemming!


In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root form.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". 

A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish".

The most popular stemming algorithm:

> The [Porter Stemmer](https://en.wikipedia.org/wiki/Stemming)


In [12]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

# get the newsgroup data
newsgroups = pd.read_csv(home+"newsgroups.csv")

# add doc names so that later analysis becomes more readable
doc_names = ['doc{}'.format(i) for i in range(newsgroups.shape[0])]
newsgroups = pd.DataFrame(newsgroups.values, index=doc_names,columns=newsgroups.columns)
print(newsgroups.head(n=10))

# build the stemmer object
stemmer = PorterStemmer()

# get the default text analyzer from CountVectorizer
analyzer = CountVectorizer(analyzer = "word", 
                           stop_words = 'english',
                           token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]

vectorizer = CountVectorizer(analyzer=stemmed_words,
                                 binary=True,
                                 min_df=2)
docarray = vectorizer.fit_transform(newsgroups['text']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names()[:10]))

                                                   text     label
doc0  From: demon@desire.wright.edu (Not a Boomer)\n...     space
doc1  From: dreitman@oregon.uoregon.edu (Daniel R. R...     space
doc2  From: mcgoy@unicorn.acs.ttu.edu (David McGaugh...     space
doc3  From: blh@uiboise.idbsu.edu (Broward L. Horne)...     space
doc4  From: wiggins@cecer.army.mil (Don Wiggins)\nSu...     space
doc5  From: nickh@CS.CMU.EDU (Nick Haines)\nSubject:...  politics
doc6  From: mike@gordian.com (Michael A. Thomas)\nSu...     space
doc7  From: jbreed@doink.b23b.ingr.com (James B. Ree...  politics
doc8  From: baalke@kelvin.jpl.nasa.gov (Ron Baalke)\...  politics
doc9  From: DPierce@world.std.com (Richard D Pierce)...  politics
docarray shape: (1058, 8437)
first 10 coords: ['aa', 'aammmaaaazzzzzziinnnnggggg', 'aaron', 'ab', 'abandon', 'abbey', 'abc', 'abdkw', 'abett', 'abid']




Notice that 'abandon', 'abandoned', and 'abandonment' have been mapped into the word 'abandon'.  Also notice that our final dimensionality for our feature space is now around 8,000+ features compared to the original 23,000 features.  That 
is close to a 50% drop in the number of features.  This also means that we will
save 50% of effort during any kind of analysis on this document collection.

# Doc Similarity in High-Dimensional Spaces

In [13]:
distances = euclidean_distances(docarray)
doc_names = ['doc{}'.format(i) for i in range(docarray.shape[0])]
distances_df = pandas.DataFrame(data=distances,index=doc_names,columns=doc_names)
distances_df

Unnamed: 0,doc0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,...,doc1048,doc1049,doc1050,doc1051,doc1052,doc1053,doc1054,doc1055,doc1056,doc1057
doc0,0.000000,12.041595,12.727922,13.711309,11.489125,14.142136,15.198684,11.532563,12.449900,11.489125,...,14.662878,11.313708,13.601471,26.570661,12.529964,11.832160,15.842980,12.649111,11.000000,13.711309
doc1,12.041595,0.000000,12.449900,13.747727,11.269428,13.747727,14.491377,10.677078,11.135529,10.816654,...,13.711309,10.344080,12.649111,26.589472,12.000000,11.445523,15.099669,11.789826,10.862780,12.767145
doc2,12.727922,12.449900,0.000000,13.928388,11.575837,14.352700,14.525839,11.532563,12.688578,11.661904,...,14.456832,11.401754,13.379088,26.570661,12.609520,12.000000,15.842980,12.649111,11.618950,13.564660
doc3,13.711309,13.747727,13.928388,0.000000,12.961481,15.099669,15.652476,12.767145,13.601471,13.038405,...,15.588457,12.727922,14.594520,26.944387,13.674794,12.727922,16.462078,13.856406,12.845233,14.764823
doc4,11.489125,11.269428,11.575837,12.961481,0.000000,13.490738,14.106736,10.148892,11.445523,10.488088,...,13.601471,10.099505,12.609520,26.267851,11.357817,10.862780,15.000000,11.401754,10.344080,12.489996
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
doc1053,11.832160,11.445523,12.000000,12.727922,10.862780,13.416408,14.317821,10.246951,11.445523,10.583005,...,13.674794,10.392305,12.529964,26.457513,11.357817,0.000000,14.933185,11.575837,10.630146,12.884099
doc1054,15.842980,15.099669,15.842980,16.462078,15.000000,16.703293,17.435596,14.491377,15.362291,14.730920,...,17.204651,14.456832,16.186414,27.658633,15.427249,14.933185,0.000000,15.716234,14.899664,16.155494
doc1055,12.649111,11.789826,12.649111,13.856406,11.401754,13.784049,14.594520,10.816654,11.874342,11.045361,...,9.848858,10.392305,13.076697,26.683328,12.124356,11.575837,15.716234,0.000000,11.445523,13.114877
doc1056,11.000000,10.862780,11.618950,12.845233,10.344080,13.379088,14.212670,10.000000,11.313708,10.246951,...,13.784049,9.643651,12.083046,26.210685,11.045361,10.630146,14.899664,11.445523,0.000000,12.288206


## Find out which stories are most similar

In [14]:
import sys

# map 0.0 across the major diagonal into FLOAT_MAX
newdist_df = distances_df.apply(lambda c: c.apply(lambda x: sys.float_info.max if x == 0.0 else x))

In [15]:
# find the column with the minimal value
cix = newdist_df.min().idxmin()
print(cix)

doc127


In [16]:
# find the row with the minimal value
rix = newdist_df.loc[:,cix].idxmin()
print(rix)

doc496


In [17]:
# these two news stories are most similar
newdist_df.loc[rix, cix]

2.0

In [18]:
print(newsgroups['label'].loc[rix])
print(newsgroups['label'].loc[cix])

#print(newsgroups_train.target_names[newsgroups_train.target[930]])

politics
politics


In [19]:
print(newsgroups['text'].loc[rix])

From: nsmca@aurora.alaska.edu
Subject: 30826
Article-I.D.: aurora.1993Apr25.151108.1
Organization: University of Alaska Fairbanks
Lines: 14
Nntp-Posting-Host: acad3.alaska.edu

I like option C of the new space station design.. 
It needs some work, but it is simple and elegant..

Its about time someone got into simple construction versus overly complex...

Basically just strap some rockets and a nose cone on the habitat and go for
it..

Might be an idea for a Moon/Mars base to.. 

Where is Captain Eugenia(sp) when you need it (reference to russian heavy
lifter, I think).
==
Michael Adams, nsmca@acad3.alaska.edu -- I'm not high, just jacked



In [20]:
print(newsgroups['text'].loc[cix])

From: nsmca@aurora.alaska.edu
Subject: Space Station Redesign (30826) Option C
Article-I.D.: aurora.1993Apr25.214653.1
Organization: University of Alaska Fairbanks
Lines: 22
Nntp-Posting-Host: acad3.alaska.edu

In article <1993Apr25.151108.1@aurora.alaska.edu>, nsmca@aurora.alaska.edu writes:
> I like option C of the new space station design.. 
> It needs some work, but it is simple and elegant..
> 
> Its about time someone got into simple construction versus overly complex...
> 
> Basically just strap some rockets and a nose cone on the habitat and go for
> it..
> 
> Might be an idea for a Moon/Mars base to.. 
> 
> Where is Captain Eugenia(sp) when you need it (reference to russian heavy
> lifter, I think).
> ==
> Michael Adams, nsmca@acad3.alaska.edu -- I'm not high, just jacked
> 
> 
> 
> 


This is a report, I got the subject messed up..



> It is a reposting of the message!