# Text Mining
# Linguistic Features



**Objectives**

+ Work with PDF files using PyPDF2
+ Perform tokenization.
+ Do lemmatization.
+ Remove the stop words.
+ Perform Part of Speech Tagging.
+ Visualize the part of speech tags
+ Do Named Entity Recondition.
+ Visualize the name entity recognization



When we perform text mining, we typically need to clean and preprocess the texts. Spacy provides common APIs to perform linguistic features
based on a Doc object. By the end of this week, I hope you'll understand the linguistic features. I will show you how to process PDF files using PyPDF2 library. Then we will cover the following list of text preprocessing steps: tokenization, lemmatization, part of speech tagging (POS), and named entity recognization (NER). Last but not least, we will learn how to visualize the POS tags and NERs. 


**Readings**

+ Working with PDF files in Python (https://www.geeksforgeeks.org/working-with-pdf-files-in-python/)
+ Tokenization (https://spacy.io/usage/linguistic-features#tokenization)
+ Lemmatization (https://spacy.io/usage/linguistic-features#lemmatization)
+ Part of Speech Tagging (https://spacy.io/usage/linguistic-features#pos-tagging)
+ Named Entity Recondition (https://spacy.io/usage/linguistic-features#named-entities)




# Linguistic Features

We have learned how to clean and impute the numerical values. We have lots of tools to handle numerical features. But it is much harder for us to clean and preprocess the raw text features. Different words may mean the same thing; for example, STL, The Gateway City, and The Lou mean Saint Louis. On the other handle, sometimes the same word could mean different things. For example, there is an apple on the Apple Ipad. The apple has different meanings. 

Spacy provides the linguistic features to help us clean and preprocess the raw texts. We will cover the following topics this week:

+ Tokenization
+ Stemming
+ Lemmatization
+ Stop words
+ Part of speech Tagging
+ Named Entity Recognization


## Extract the Text from  PDF Files

Probably PDF is the most popular file format. These files can be accessed using a PC, Ipad, and smartphone on different operating systems such as Linux, macOS, and Windows. Therefore, we need a tool to handle PDF files using Python. There are many libraries to work with PDF files. We will use the PyPDF2 library (https://pypi.org/project/PyPDF2). It can perform the following tasks:

+ extracting document information (title, author, …)
+ splitting documents page by page
+ merging documents page by page
+ cropping pages
+ merging multiple pages into a single page
+ encrypting and decrypting PDF files

There are two methods to install PyPDF2.

1. Install it with conda; we need to  run the following command: **conda install -c conda-forge pypdf2** in Anaconda Prompt.
2. Install it with Jupyter Notebook; we need to run **pip install PyPDF2** in a cell.

Let's install it in the Jupyter Notebook.

In [8]:
# !pip install PyPDF2

Let's load a paper titled "Dollar-cost averaging just means taking risk later" by Vanguard, into memory.

In [9]:
# Load the required library into memory
import PyPDF2
import re

# Specify file
mypdf = '/Volumes/External/DSCI/508_ML/NN/transfer learning w Python.pdf'

#Creating a pdf file object
pdfFile = open(mypdf, 'rb')
  
# Creating a pdf reader object
pdfFileReader = PyPDF2.PdfFileReader(pdfFile)

# Get the number of pages in the pdf file
pageCount = pdfFileReader.numPages
# printing number of pages in pdf file
print(f' There are {pageCount} pages in the file :{mypdf}')
  
output = []    
for i in range(pageCount):
    # Get the i-th page contents from the pdf file
    pdfPage = pdfFileReader.getPage(i)
    # Extract text from each page and append it to the list
    output.append(pdfPage.extractText())
    
# Concatenate items in the list to a single string
alltexts = ' '.join(output)

# Print out  the first 300 chars from the texts
print("----" * 25)
print(alltexts[:300])
print("----" * 25)  

# Remove \n from the texts
alltexts = re.sub('\n', '', alltexts)
# Remove punctuation from the texts
alltexts = re.sub(r'[^\w\s]','',alltexts)

# Print out  the first 300 chars from the texts
print("----" * 25)
print(alltexts[:300])
print("----" * 25)  
# closing the pdf file object
pdfFile.close()


 There are 20 pages in the file :/Volumes/External/DSCI/508_ML/NN/transfer learning w Python.pdf




----------------------------------------------------------------------------------------------------



 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 



----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
                   
----------------------------------------------------------------------------------------------------




## Tokenization

We notice that all the texts are stored in a single string object of alltexts. We want to split the entire pdf document into smaller segments, such as individual words called tokens.

To process texts in Spacy; we first need to generate a **Doc** object using nlp function.

In [10]:
import spacy
# Load the model
nlp = spacy.load("en_core_web_sm")
# Process the texts
doc = nlp(alltexts)
# Print out the first 50 tokens
for token in doc[:50]:
    print(f'Token text = {token.text}; Is the token lowercase? {token.is_lower}; Does the token consit of digits? {token.is_digit} ')



Token text =                    ; Is the token lowercase? False; Does the token consit of digits? False 


Spacy token supports many attributes such as doc, sent, text, etc. Please see the official help documents for details. (https://spacy.io/api/token#attributes)

## Stemming and Lemmatization

The documents use different forms of a word/token such as go, went, going, etc. When we handle texts, there are typically lots of tokens. We like to reduce the derived words to their word stem, such as go for go, went, and going. According to Wikipedia (https://en.wikipedia.org/wiki/Stemming), "The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root." Spacy doesn't provide any function for stemming at all.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

According to Wikipedia:

" Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words that have different meanings depending on the part of speech. However, stemmers are typically easier to implement and run faster. The reduced "accuracy" may not matter for some applications. In fact, when used within information retrieval systems, stemming improves query recall accuracy, or true positive rate, when compared to lemmatization. Nonetheless, stemming reduces precision, or the proportion of positively-labeled instances that are actually positive, for such systems.

For instance:

+ The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
+ The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatization.
+ The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatization attempts to select the correct lemma depending on the context."

Source: https://en.wikipedia.org/wiki/Lemmatisation

In [11]:
import spacy
# Load the model
nlp = spacy.load("en_core_web_sm")
# Process the texts
doc = nlp(alltexts)
# Print out the first 50 tokens
for token in doc[:50]:
    print(f'Token text = {token.text}; The lemma ={token.lemma_} ')

Token text =                    ; The lemma =                    


## Stop Words

When we speak English, the sentence may contain different words. For example, "It is a beautiful day!". However, there are five words in the previous sentence. There are only two keywords of "beautiful day". The other three words are the most common words in English. They contain little info. After we remove the stop words, we reduce the size of the tokens/words. It may help improve the performance of text mining.

### Remove Stopwords
We can remove stopwords while performing the following tasks:

+ Text Classification
    + Spam Filtering
    + Language Classification
    + Genre Classification
+ Caption Generation
+ Auto-Tag Generation
 

### Avoid Stopword Removal
+ Machine Translation
+ Language Modeling
+ Text Summarization
+ Question-Answering problems

Source: https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/



Let's look at Spacy's default stop words

In [12]:
import spacy

nlp = spacy.load('en_core_web_sm')

spacy_stopwords = nlp.Defaults.stop_words

print(spacy_stopwords)


{'now', 'why', 'namely', 'ourselves', 'show', 'whenever', 'due', 'them', 'those', 'take', 'give', 'only', 'well', 'least', 'anywhere', 'else', 'much', "'ve", 'across', 'perhaps', "n't", 'sixty', 'seem', 'anyway', 'us', '’re', 'own', 'been', 'along', 'since', 'somewhere', 'who', 'whole', 'twelve', 'were', 'from', 'anyhow', 'side', 'full', 'my', 'any', 'thereby', 'ours', 'eleven', 'thereafter', 'just', 'whom', 'again', 'hereby', 'be', 'nine', 'being', 'am', 'back', 'often', 'amongst', 'can', 'against', 'afterwards', 'n‘t', "'d", 'top', 'your', 'thereupon', 'together', 'what', '‘ll', 'anything', 'even', 'amount', 'third', 'throughout', 'unless', 'some', 'after', 'whatever', 'almost', 'because', 'bottom', 'how', 'still', 'in', 'except', "'m", 'whereafter', 'they', 'seemed', 'less', 'through', 'would', 'either', 'such', 'could', 'this', 'wherever', 'yours', 'me', '‘d', 'fifteen', 'seems', 'three', 'may', 'ever', 'no', 'everything', 'few', 'name', 'fifty', 'whose', 'itself', 'down', 'most', 

We remove all the default stop words from the texts.

In [13]:

doc = nlp(alltexts)

tokens_without_stopword= [token for token in doc if not token.text in spacy_stopwords]


print(tokens_without_stopword[:100])

[                   ]


We notice that Vanguard appears many times in this report. It doesn't contain lots of information.  We may decide to remove it by adding it to the stop word list. 
Sometimes, we may need to remove several words from the default stop words and keep them in the texts for a specific task. For example, "call" and "put" are included in the spacy's default stop words. This report covers investment strategies. It may involve put and call options. Therefore, we decide to keep them in the texts, remove "call", and "put" from Spacy's default stop words. 

### Add Customized Stop Words

In [14]:
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')
# Specify the user defined stop words
customized_stop_words = ['Vanguard', 'market']
# Add the user specified stop words to the Spacy default stop words
for token in customized_stop_words:
    nlp.Defaults.stop_words.add(token)

# Set the tag of the customized stop words as stop word 
for token in customized_stop_words:
    nlp.vocab[token].is_stop = True
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')

There are 326 stop words in Spacy
There are 328 stop words in Spacy


### Remove Stop Words

In [15]:
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')
# Remove the the specified words from the default stop words of Spacy
remove_stop_words = ['call', 'put']
# Remove the user specified stop words from the Spacy default stop words
for token in remove_stop_words:
    nlp.Defaults.stop_words.remove(token)

# Set the tag of the removed stop words as non-stop word 
for token in remove_stop_words:
    nlp.vocab[token].is_stop = False
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')

There are 328 stop words in Spacy
There are 326 stop words in Spacy


Let's clean the texts again based on the new defined stop words

In [16]:
doc = nlp(alltexts)

# Get the new stop words
spacy_stopwords = nlp.Defaults.stop_words

tokens_without_stopword= [token for token in doc if not token.text in spacy_stopwords]

print(tokens_without_stopword[:250])

[                   ]


## Part of Speech Tagging 

A Part-Of-Speech Tagging (POS Tagging) can read text  and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.


In [17]:
import spacy
nlp = spacy.load("en_core_web_sm")
# Process the texts
doc = nlp(alltexts)

# Summarize the first 20 tokens
for token in doc[:20]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

                                        SPACE _SP ROOT      False False


### Visualize the Dependency Parser

It is well known that a graph is worth 1000 words. We can take advantage of the visualizer to show the dependency parser.

In [18]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Text mining is fun!")
# Visualize it by seeting style to be "dep" and jupyter to be True
displacy.render(doc, style="dep", jupyter = True)

## Named Entity Recondition (NER)
When we read the text, we naturally recognize named entities such as organizations, money, people, locations, etc. For example, in the news title"Tesla buys $1.5 billion in bitcoin, plans to accept it as payment", we can find the following named entities:

+  **Tesla** is a company
+  **\$1.5billion** is the money

In information extraction, a named entity is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, or anything else that can be named. Named entities can simply be viewed as entity instances (e.g., New York City is an instance of a city).

Source: https://en.wikipedia.org/wiki/Named_entity


Named entities are available as the ents property of a Doc object in Spacy.

In [19]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Tesla buys $1.5 billion in bitcoin, plans to accept it as payment")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Tesla 0 5 ORG
$1.5 billion 11 23 MONEY


## NER Labels
The NER labels are summarized as follows:

<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

source: https://notebook.community/rishuatgithub/MLPy/nlp/UPDATED_NLP_COURSE/02-Parts-of-Speech-Tagging/02-NER-Named-Entity-Recognition


Let's count the total number of DATE mentioned in the paper.

In [20]:
import spacy
nlp = spacy.load("en_core_web_sm")
# Process the texts
doc = nlp(alltexts)
len([ent for ent in doc.ents if ent.label_=='DATE'])

0

### Visualize the Named Entity

We can use the entity visualizer to highlight the named entities and their labels in the given texts.

In [21]:
import spacy
from spacy import displacy

text = "Tesla buys $1.5 billion in bitcoin, plans to accept it as payment."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
# Visualize it by seeting style to be "ent" and jupyter to be True
displacy.render(doc, style="ent", jupyter = True)

We can customize the entity visualizer by specifying the entities to mark and the colors to use for those entities using the options parameter. Let's look at the example of visualize the person with a specified color.

In [22]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(alltexts)
# set person to mark and color to use
options = {"ents": ['PERSON'], "colors": {'PERSON':'#82E0AA'}}
## Visualize it by seeting style to be "ent" and jupyter to be True and the corresponding options
displacy.render(doc, style="ent", jupyter = True, options=options)



# Summary


+ We can use PyPDF2 library to read pdf files.
+ We split the texts into tokens using nlp with a doc object.
+ We understand the lemmatization and can extract it from a doc object.
+ We learn how to remove the stop words from the given texts.
+ We can add or remove stop words from the Spacy stop words list.
+ We learn how to perform part of speech tagging. 
+ We visualize the part of speech tagging.
+ We learn how to perform named entity recognization.
+ We visualize the named entities.


