#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/2_1.png)

# 2.1 NLP Tasks with Python NLTK
* [2.1.1. Exploring NLTK](#2.1.1)
* [2.1.2. Exploring SpaCy](#2.1.2)

---

### Basic Natural Language Processing

- Natural Language:  
Any language used in everyday communication by humans written or spoken  

- Other Languages:   
Constructed languages, or Computer languages  

- Natural Language Processing:  
Any computation or manipulation of natural language to get insights about how words mean and how sentences are constructed.  


![](images/nltkbook.png)

In [None]:
#pro TIP: explore the functions/properties of an object
def functions(obj):
    return [prop for prop in dir(obj) if not prop.startswith('_')]

---
#### *Check the NLP with Python book [online version](https://www.nltk.org/book/)*

In [None]:
#import the nltk package
import nltk
#call the nltk downloader
#nltk.download()

In [None]:
nltk?

The library includes already curated corpora for research proposes

In [None]:
from nltk.book import *

It has loads of in-build functions to deal with text

In [None]:
text7

In [None]:
### Frequency of words
dist = FreqDist(text7)
len(dist)

In [None]:
dist

In [None]:
vocab1 = dist.keys()
# In Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

In [None]:
dist['risk']

In [None]:
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
freqwords

---
### 2.1.1. Exploring NLTK
<a id="2.1.1">

1. Tokenization — convert sentences to words  
3. Removing stop words — frequent words such as ”the”, ”is”, etc. that do not have specific semantic  
4. Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.  
5. Lemmatization — Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language.  

---
#### *Learn more about NLP steps at [towardsdatascience.com](https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958)*

#### Tokenization NLTK
- Is the process of converting text into tokens before transforming it into vectors. 
- It is also easier to filter out unnecessary tokens. 
- Can be document into paragraphs or sentences into words.

Some types of tokenization [text-processing.com](https://text-processing.com/demo/tokenize/)

In [None]:
import pandas as pd

In [None]:
data_clean = pd.read_pickle('pickle/AnnualReports_corpus.pkl')

In [None]:
report = data_clean.report[0]

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
#function to split text into word
tokens = word_tokenize(report)

In [None]:
len(tokens)

Example of sentence tokenization

In [None]:
text_ = "This is the first sentence. The stocks of AAL are higher. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text_)
print(len(sentences), sentences)

#### Stopwords NLTK

The main idea is to manually manipulate the list of stop words at our will, combining the list of financial stop words created and more.

With the stopwords method from `nltk` we can easily enlist a predefined list of English stop words for instance

In [None]:
from nltk.corpus import stopwords

In [None]:
english_stop_words = stopwords.words('english')

Likewise, using `pickle` we load the previous stop words

In [None]:
import pickle

In [None]:
stopwordsfile = "pickle/AnnualReports_stopwords.pkl"
file = open(stopwordsfile, "rb")
financial_stop_words = pickle.load(file)

In [None]:
stopwords = []
stopwords.extend(english_stop_words)
stopwords.extend(financial_stop_words)
stopwords = list(set(stopwords))

In [None]:
def removeStopWords(text, stopwords_list):
    text = text.lower()
    for item in stopwords_list:
        text = text.replace(" " + item.lower() + " "," ")
        text = text.replace(" " + item.lower() + ","," ")
        text = text.replace(" " + item.lower() + "."," ")
        text = text.replace(" " + item.lower() + ";"," ")
    text = text.replace("+","")
    return text

In [None]:
data_without_stopwords = removeStopWords(report,stopwords)

In [None]:
len(data_without_stopwords)

In [None]:
len(report)

#### Stemming NLTK

NLTK provides several stemmer interfaces like Porter stemmer, Lancaster Stemmer, Snowball Stemmer

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
porter = PorterStemmer()

In [None]:
plurals = ['List', 'listed', 'lists', 'listing','listings']
singles = [porter.stem(plural).lower() for plural in plurals]

In [None]:
singles

#### Lemmatization NLTK

In [None]:
from nltk.stem import WordNetLemmatizer 

In [None]:
lemmatizer = WordNetLemmatizer() 

In [None]:
print("assets :", lemmatizer.lemmatize("assets")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 
print("better :", lemmatizer.lemmatize("better", pos ="a")) # a denotes adjective in "pos" 

---
#### *Stem sub-package from NLTK [www.nltk.org](http://www.nltk.org/api/nltk.stem.html)*

---
### 2.1.2. Exploring SpaCy
<a id="2.1.2">

In [None]:
import spacy

In [None]:
# load core elements
nlp = spacy.load("en_core_web_sm")

In [None]:
#report

In [None]:
report_nlp = nlp(report)

In [None]:
type(report_nlp)

In [None]:
#functions(report_nlp)

SpaCy processes everything all at once, which explains why the command above takes so long. It's doing named entity recognition, looking up word vectors, doing POS tagging, and performing other tasks. 

In [None]:
# report lenght
len(report_nlp)

In [None]:
# sentences lenght
type(report_nlp.sents)

_Side note: generators_

Generators are functions that behave as iterators, i.e., you can iterate over them with a `for` loop, like you would with a list. But you can't index them. So this works: 

```python
for sent in grailDoc.sents: 
  print(sent)
```

But not this: 

```python
print(grailDoc.sents[0]) # Doesn't work
```

However, you can force a generator into a list, using `list()`, and then index it. So to get the first sentence of text, one could write:

```python
list(pride.sents)[0]
```

But actually if we just want the first one, we can do this: 

```python
next(pride.sents)
```

In [None]:
len(list(report_nlp.sents))

What's the longest sentence in the report

In [None]:
sent_lengths = [len(sent) for sent in report_nlp.sents]
[sent for sent in report_nlp.sents if len(sent) == max(sent_lengths)]

#### Exploring properties of the tokens

In [None]:
amro = report_nlp[4]
print(amro, type(amro))

In [None]:
amro.i, amro.idx

In [None]:
amro.prefix_

In [None]:
pd.Series([word.i for word in report_nlp if word.text == 'loss']).hist(figsize=(12,6))

#### Exploring named entities

The following list shows the entities encountered in the current corpus

In [None]:
set([w.label_ for w in report_nlp.ents])

In [None]:
entity_sentences = [ent.sent for ent in report_nlp.ents if ent.label_ == 'MONEY']
entity_sentences

In [None]:
spacy.displacy.render(entity_sentences, style='ent', jupyter=True)

#### POS tagging (Parts of speech)

Get the first 100 nouns in the report

In [None]:
print([w for w in report_nlp if w.pos_ == 'NOUN'][:100])

<br>
Dependency Parsing

In [None]:
#spacy.displacy.render(entity_sentences, jupyter=True, options={'compact': True, 'collapse_punct': True, 'collapse_phrases': True})

---
#### *SpaCy one of the most popular tools for NLP nowadays [https://spacy.io](https://spacy.io)*