## Setup

This guide was written in Python 3.6.

### Libraries

We'll be working with the re library for regular expressions and nltk for natural language processing techniques, so make sure to install them! To install these libraries, enter the following commands into your terminal: 

``` 
pip3 install nltk
pip3 install spacy
pip3 install pandas
pip3 install scikit-learn
```

### Other

Sentence boundary detection requires the dependency parse, which requires data to be installed, so enter the following command in your terminal. 

```
python3 -m spacy.en.download all
```

## Background

### Polarity Flippers

Polarity flippers are words that change positive expressions into negative ones or vice versa. 

#### Negation 

Negations directly change an expression's sentiment by preceding the word before it. An example would be

```
The cat is not nice.
```

#### Constructive Discourse Connectives

Constructive Discourse Connectives are words which indirectly change an expression's meaning with words like "but". An example would be 

``` 
I usually like cats, but this cat is evil.
```

### Multiword Expressions

Multiword expressions are important because, depending on the context, can be considered positive or negative. For example, 

``` 
This song is shit.
```
is definitely considered negative. Whereas

``` 
This song is the shit.
```
is actually considered positive, simply because of the addition of 'the' before the word 'shit'.

### WordNet

WordNet is an English lexical database with emphasis on synonymy - sort of like a thesaurus. Specifically, nouns, verbs, adjectives and adjectives are grouped into synonym sets. 

#### Synsets

nltk has a built-in WordNet that we can use to find synonyms. We import it as such:

In [54]:
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

If we feed a word to the synsets() method, the return value will be the class to which belongs. For example, if we call the method on good,  


In [55]:
print(wn.synsets('good'))

[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]


if we want to take it a step further, we can. We've previously learned what lemmas are - if you want to obtain the lemmas for a given synonym set, you can use the following method:


In [0]:
print(wn.synset('car.n.01').lemma_names())


['car', 'auto', 'automobile', 'machine', 'motorcar']


Even more, you can do things like get the definition of a word: 


In [0]:
print(wn.synset('car.n.01').definition())


a motor vehicle with four wheels; usually propelled by an internal combustion engine


#### Negation

With WordNet, we can easily detect negations. This is great because it's not only fast, but it requires no training data and has a fairly good predictive accuracy. On the other hand, it's not able to handle context well or work with multiple word phrases. 


### SentiWordNet

Based on WordNet synsets, SentiWordNet is a lexical resource for opinion mining, where each synset is assigned three sentiment scores: positivity, negativity, and objectivity.

In [0]:
from nltk.corpus import sentiwordnet as swn
cat = swn.senti_synset('cat.n.03')

In [0]:
cat.pos_score()

0.0

In [0]:
cat.neg_score()

0.125

In [0]:
cat.obj_score()

0.875

### Stop Words

Stop words are extremely common words that would be of little value in our analysis are often excluded from the vocabulary entirely. Some common examples are determiners like the, a, an, another, but your list of stop words (or <b>stop list</b>) depends on the context of the problem you're working on. 


## Information Extraction

Information Extraction is the process of acquiring meaning from text in a computational manner. 

### Data Forms

#### Structured Data

Structured Data is when there is a regular and predictable organization of entities and relationships.

#### Unstructured Data

Unstructured data, as the name suggests, assumes no organization. This is the case with most written textual data. 

### What is Information Extraction?

With that said, information extraction is the means by which you acquire structured data from a given unstructured dataset. There are a number of ways in which this can be done, but generally, information extraction consists of searching for specific types of entities and relationships between those entities. 

An example is being given the following text, 

```
Martin received a 98% on his math exam, whereas Jacob received a 84%. Eli, who also took the same test, received an 89%. Lastly, Ojas received a 72%.
```
This is clearly unstructured. It requires reading for any logical relationships to be extracted. Through the use of information extraction techniques, however, we could output structured data such as the following: 

```
Name     Grade
Martin   98
Jacob    84
Eli      89
Ojas     72
```

## Named Entity Extraction

Named entities are noun phrases that refer to specific types of individuals, such as organizations, people, dates, etc. Therefore, the purpose of a named entity recognition (NER) system is to identify all textual mentions of the named entities.

### spaCy

In the following exercise, we'll build our own named entity recognition system with the Python module `spaCy`, a Python module commonly used for Natural Language Processing in industry. 

In [3]:
!pip3 install -U spacy

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.0.12)


In [0]:
import spacy
import pandas as pd

In [6]:
!python -m spacy download en

Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 59.9MB/s 
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25l- \ | done
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')



Using spaCy, we'll load the built-in English tokenizer, tagger, parser, NER and word vectors. We indicate this with the parameter `'en'`:

In [0]:
nlp = spacy.load('en')

We need an example to actually process, so below is some text from Columbia's website. With this example in mind, we feed it into the tokenizer.

In [0]:
review = "Macbeth is most guilty of his own destruction and evil, but other characters played a significant part in his reasoning behind the crimes he committed. The Three Witches gave Macbeth a path to follow of how to obtain the goal that he had wanted for a long time, to become king. His wife, Lady Macbeth, was a huge incentive to commit the crimes he committed. She manipulated Macbeth in many ways. Even considering all of that, Macbeth is most guilty because he and only he can control his actions.The Three Witches tell Macbeth that he will become Thane of Cawdor and King of Scotland. They also predicted that Banquo's sons will end up being kings, but that Banquo would never become king. Because of their predictions, Macbeth murders many people. They also help Hecate concoct a potion that puts a curse on Macbeth. They definitely helped Macbeth along his evil path. Lady Macbeth is Macbeth's wife. She urges him to kill King Duncan so that he can be King. She later loses her nerve and starts sleepwalking because of the stress of killing Duncan. It gets so bad that she ends up committing suicide, but before that she did everything in her power to convince him to kill Duncan. She accused him of being weak like a woman. She knew that insulting him would motivate him. Macbeth is a nobleman of Scotland. Early on he is known as Thane of Glamis, but later becomes Thane of Cawdor after the original Thane of Cawdor is killed for treason. Macbeth is an extremely ambitious and power hungry man, and is always looking for a newer better title. Macbeth kills Duncan to become king, kills Banquo because his family was destined to become rulers over Scotland, and kills all of Macduff's family. The whole story seems to be about Macbeth and all of his efforts to get and keep the throne. Macbeth is definitely the guiltiest person in that whole ordeal"

In [0]:
doc = nlp(review) # entities

Going along the process of named entity extraction, we begin by segmenting the text, i.e. splitting it into a list of sentences. 

In [17]:
sentences = [sentence.orth_ for sentence in doc.sents] # list of sentences
print("There were {} sentences found.".format(len(sentences)))

There were 22 sentences found.


Now, we go a step further, and count the number of nounphrases by taking advantage of chunk properties.

In [18]:
nounphrases = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]
print(nounphrases[0])
#print("There were {} noun phrases found.".format(len(nounphrases)))

['Macbeth', 'is']


Lastly, we achieve our final goal: entity extraction. 

In [19]:
entities = list(doc.ents) # converts entities into a list
print("There were {} entities found".format(len(entities)))

There were 34 entities found


So now, we can turn this into a DataFrame for better visualization: 

In [20]:
orgs_and_people = [entity.orth_ for entity in entities if entity.label_ in ['ORG','PERSON']]
pd.DataFrame(orgs_and_people)

Unnamed: 0,0
0,Macbeth
1,Lady Macbeth
2,Macbeth
3,Macbeth
4,Macbeth
5,Thane of Cawdor
6,Banquo
7,Banquo
8,Macbeth
9,Hecate




> Indented block



In summary, named entity extraction typically follows the process of sentence segmentation, noun phrase chunking, and, finally, entity extraction. 

### nltk

Next, we'll work through a similar example as before, this time using the nltk module to extract the named entities through the use of chunk parsing. As always, we begin by importing our needed modules and example: 

In [43]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('ieer')
import re
content = "Starbucks has not been doing well lately"

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package ieer to /root/nltk_data...
[nltk_data]   Unzipping corpora/ieer.zip.


Then, as always, we tokenize the sentence and follow up with parts-of-speech tagging. 

In [30]:
tokenized = nltk.word_tokenize(content)
tagged = nltk.pos_tag(tokenized)
print(tagged)

[('Starbucks', 'NNP'), ('has', 'VBZ'), ('not', 'RB'), ('been', 'VBN'), ('doing', 'VBG'), ('well', 'RB'), ('lately', 'RB')]


Great, now we've got something to work with! 

``` 
[('Starbucks', 'NNP'), ('has', 'VBZ'), ('not', 'RB'), ('been', 'VBN'), ('doing', 'VBG'), ('well', 'RB'), ('lately', 'RB')]
```

Now, if you wanted to simply get the named entities from the namedEnt object we created, how do you think you would go about doing so?

## Chunking

Chunking is used for entity recognition and segments and labels multitoken sequences. This typically involves segmenting multi-token sequences and labeling them with entity types, such as 'person', 'organization', or 'time'. 

### Noun Phrase Chunking

Noun Phrase Chunking, or NP-Chunking, is where we search for chunks corresponding to individual noun phrases.

We can use nltk, as is the case most of the time, to create a chunk parser. We begin with importing nltk and defining a sentence with its parts-of-speeches tagged (which we covered in the previous tutorial). 


In [0]:
import nltk 
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

Next, we define the tag pattern of an NP chunk. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. `<DT>?<JJ>*<NN>`. This is how the parse tree for a given sentence is acquired.  


In [0]:
pattern = "NP: {<DT>?<JJ>*<NN>}" 

Finally we create the chunk parser with the nltk `RegexpParser()` class. 

In [0]:
NPChunker = nltk.RegexpParser(pattern) 

And lastly, we actually parse the example sentence and display its parse tree. 


In [41]:
result = NPChunker.parse(sentence) 
print(result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


## Relation Extraction 

Once we have identified named entities in a text, we then want to analyze for the relations that exist between them. This can be performed using either rule-based systems, which typically look for specific patterns in the text that connect entities and the intervening words, or using machine learning systems that typically attempt to learn such patterns automatically from a training corpus.

### Rule-Based Systems

In the rule-based systems approach, we look for all triples of the form (X, a, Y), where X and Y are named entities and a is the string of words that indicates the relationship between X and Y. Using regular expressions, we can pull out those instances of a that express the relation that we are looking for. 

In the following code, we search for strings that contain the word "in". The special regular expression `(?!\b.+ing\b)` allows us to disregard strings such as `success in supervising the transition of`, where "in" is followed by a gerund. 

In [44]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
         print (nltk.sem.relextract.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


Note that the X and Y named entitities types all match with one another! Object type matching is an important and required part of this process. 

So then we'll define the test and training data URLs to variables, as well as filenames for each of those datasets.

### Preparing the Data

To implement our bag-of-words linear classifier, we need our data in a format that allows us to feed it in to the classifer. Using sklearn.feature_extraction.text.CountVectorizer in the Python scikit learn module, we can convert the text documents to a matrix of token counts. So first, we import all the needed modules: 

In [0]:
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer        
from nltk.stem.porter import PorterStemmer

We need to remove punctuations, lowercase, remove stop words, and stem words. All these steps can be directly performed by CountVectorizer if we pass the right parameter values. We can do this as follows. 

We first create a stemmer, using the Porter Stemmer implementation.

In [0]:
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = [stemmer.stem(item) for item in tokens]
    return(stemmed)

Here, we have our tokenizer, which removes non-letters and stems:

In [0]:
def tokenize(text):
    text = re.sub("[^a-zA-Z]", " ", text)
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return(stems)

Here we init the vectoriser with the CountVectorizer class, making sure to pass our tokenizer and stemmers as parameters, remove stop words, and lowercase all characters.

In [51]:
vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = 'english',
    max_features = 85
)

print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=85, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenize at 0x7f778dd8de18>, vocabulary=None)


Next, we use the `fit_transform()` method to transform our corpus data into feature vectors. Since the input needed is a list of strings, we concatenate all of our training and test data. 