# Natural Language Processing (NLP): Part 1

**By: Dr. Reza Mousavi, University of Virginia, mousavi@virginia.edu**

In this notebook, we cover basic concepts in Natural Language Processing (NLP). We start with an introduction to NLP, then we cover pre-processing text, and finally, we will covert dictionary-based approaches in NLP.

## 1. Introduction

Before the emergence and proliferation of automated text mining approaches that we deploy these days, manual text mining approaches surfaced in mid 1980’s. The challenge of exploiting the large proportion of enterprise information that originates in “unstructured” form was first recognized in an IBM article by H.P. Luhn. As BI emerged in the 80s and 90s as a software category, the emphasis was on numerical data stored in relational databases.

By the booming of the World Wide Web and then the proliferation of social media websites, humans created a universal repository of knowledge such that almost any kind of knowledge can be found there. The users pay little to access the web. There is no central editorial board in the web, so anyone can contribute to the web. These properties stimulate the users to retrieve information from the web and add information to the web, making the web grow in an incredible speed.

Along with the availability of text due to the growth in the web and social media, technological advances (hardware and software) have enabled the field to advance during the past decade. Furthermore, the evolution of artificial intelligence urged scholars and practitioners to explore beyond the structured data and apply their models to unstructured data and particularly text data. Therefore the problem of text mining, **i.e. discovering useful knowledge from unstructured or semi-structured text,** has been becoming more and more attractive.

There are many applications for text mining techniques. A few examples include:

* Security applications: Monitoring the text generated online to identify threats.

* Biomedical applications: knowledge based search engine for biomedical texts.

* Marketing applications: analytical customer relationship management, advertisement, and product development. Please read “Sentiment Analysis Can Do More than Prevent Fraud and Turnover” by Michael Schrage (available in module Week 6).

* Academia: To study human behavior in a social environment.

## 2. What Is Text Analytics?

NLP, Text Analytics, and Text Mining (these terms are used interchangeably) is about looking for patterns in text. However, the superficial similarity between the two conceals real differences. Data mining can be more fully characterized as the extraction of implicit, previously unknown, and potentially useful information from data. The information is implicit in the input data: it is hidden, unknown, and could hardly be extracted without recourse to automatic techniques of data mining. With text mining, however, the information to be extracted is clearly and explicitly stated in the text. It’s not hidden at all—most authors go to great pains to make sure that they express themselves clearly and unambiguously—and, from a human point of view, the only sense in which it is “previously unknown” is that human resource restrictions make it infeasible for people to read the text themselves. The problem, of course, is that the information is not couched in a manner that is amenable to automatic processing. Text mining strives to bring it out of the text in a form that is suitable for consumption by computers directly, with no need for a human intermediary.

Though there is a clear difference philosophically, from the computer’s point of view the problems are quite similar. Text is just as opaque as raw data when it comes to extracting information— probably more so.

Another requirement that is common to both data and text mining is that the information extracted should be “potentially useful.” In one sense, this means actionable—capable of providing a basis for actions to be taken automatically. In the case of data mining, this notion can be expressed in a relatively domain-independent way: actionable patterns are ones that allow non-trivial predictions to be made on new data from the same source. Performance can be measured by counting successes and failures, statistical techniques can be applied to compare different data mining methods on the same problem, and so on. However, in many text mining situations it is far harder to characterize what “actionable” means in a way that is independent of the particular domain at hand. This makes it difficult to find fair and objective measures of success.

In many data mining applications, “potentially useful” is given a different interpretation: the key for success is that the information extracted must be comprehensible in that it helps to explain the data. This is necessary whenever the result is intended for human consumption rather than (or as well as) a basis for automatic action. This criterion is less applicable to text mining because, unlike data mining, the input itself is comprehensible. Text mining with comprehensible output is tantamount to summarizing salient features from a large body of text, which is a subfield in its own right: text summarization.

The Text Mining (TM) methods usually involve the following steps:

* Basic pre-processing operations such as tokenization, stemming, and removing stop words.

* Advanced text mining operations, involving identification of complex patterns. For instance, we can identify the main topics that were discussed in the text or we can quantify the sentiment in the text.

TM exploits techniques and methodologies from data mining, machine learning, information retrieval, corpus-based computational linguistics. The main objective in TM is to extract useful information from the text documents. There are many different methods used in TM. In this class, we focus on the following areas: Pre-processing the Text, Sentiment Analysis, and Topic Modeling.

## 3. Pre-processing Text

The main goal of pre-processing the text documents is to prepare the text for TM methods. Depending on the type of TM method that we want to deploy, there are different pre-processing steps that we should take to prepare our text data. The main steps include:

### 3.1. Tokenization:

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing. Electronic text is a linear sequence of symbols (characters or words or phrases). Naturally, before any real text processing is to be done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc. This process is called tokenization.

In English, words are often separated from each other by blanks (white space), but not all white space is equal. Both “Los Angeles” and “rock ‘n’ roll” are individual thoughts despite the fact that they contain multiple words and spaces. We may also need to separate single words like “I’m” into separate words “I” and “am”.

Tokenization is a kind of pre-processing in a sense; an identification of basic units to be processed. It is conventional to concentrate on pure analysis or generation while taking basic units for granted. Yet without these basic units clearly segregated it is impossible to carry out any analysis or generation.

The most common method for tokenization is called “standard (white space) tokenization”. In the standard tokenization, each word that has a white space before or after its location in the text will be considered a token. For instance, there are four tokens in the following text: “This is an example”. Those are “This”, “is”, “an”, and “example”.

### 3.2. True-casing (convert to lower/ upper case):

Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence.

After tokenization, we usually need to convert all of the tokens into lower (or upper) case. This way, the software would not assume any difference between “Good” and “good”.

### 3.3. Stop-word removal:

Another important pre-processing step is to remove stop-words from the text. Each language has its own list of stop-words. A comprehensive list of stop-words for many languages can be found here: http://www.ranks.nl/stopwords. Examples of stop-words include “a”, “an”, and “did”. Stop-words are basically a set of commonly used words in any language, not just English. The reason why stop-words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. For example, in the context of a search engine, if your search query is “how to develop information retrieval applications”, if the search engine tries to find web pages that contained the terms “how”, “to”, “develop”, “information”, “retrieval”, “applications” the search engine is going to find a lot more pages that contain the terms “how” and “to” than pages that contain information about developing information retrieval applications because the terms “how” and “to” are so commonly used in the English language. So, if we disregard these two terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop”, “information”, “retrieval”, “applications” – which would more closely bring up pages that are really of interest. This is just the basic intuition for using stop-words.

### 3.4. Stemming & lemmatization:

For grammatical reasons, documents are going to use different forms of a word, such as “organize”, “organizes”, and “organizing.” Additionally, there are families of derivationally related words with similar meanings, such as “democracy”, “democratic”, and “democratization.” In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

* am, are, is will change to be

* car, cars, car’s, cars’ will change to car

The result of this mapping of text will be something like: “the boy’s cars are different colors” will be changed to “the boy car be differ color”.

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source. It is worth noting that most of the software packages focus on stemming but not lemmatization. It is also worth noting that the decision about applying lemmatization to the text data should be based on the case.



## 4. NLP in Python

As mentioned above, there are four main pre-processing steps before mining text. Using a real data set about customer service on Twitter, we learn about pre-processing text using Python package spaCy. This package is one of the best packages for NLP tasks and offers a wide range of capabilities. More information about spaCy can be found here: https://spacy.io  

**For this class, we will work on a data set that was collected from Twitter. This data set contains tweets posted by Twitter users about the telecommunications firm "Sprint". "Sprint" is one of the firms that has a dedicated team of customer service representatives trained to address customer queries on Twitter.** 


Given the proliferation of Web 2.0 technology, customers are able to communicate with firms through new channels of communication such as social media websites. Almost two-thirds of customers have used a company's social media site to receive service. Such digitally provisioned service is attractive to customers due to its fast response, and is being increasingly preferred to more traditional service channels such as phone or email. Firms too have an incentive to embrace service provision through social media; the average cost of a service-related response in Twitter is only **one dollar**, while the average cost of interacting with a customer through a traditional call center can be close to **six dollars**. Furthermore, firms that use Twitter as a social care channel are seeing a large 19\% increase in customer satisfaction. Given that digital customer care has mutual benefits for firms and customers, it is not surprising to observe a 250\% increase in customer service interactions over Twitter during the past few years

Given the size of the actual data, we only work with the first 5K tweets that were posted about sprint. The file name is tweets_about_sprint.csv. We begin with installing/ importing Python packages and reading the CSV file:

### 4.1. Text Pre-processing

In [56]:
!pip install spacy # install spaCy
!pip install tqdm # install tqdm package to display the progress
!pip install textblob # for dictionary-based sentment analysis
!pip install nrclex # for dictionary-based emotions analysis

Collecting nrclex
  Downloading NRCLex-3.0.0.tar.gz (396 kB)
[K     |████████████████████████████████| 396 kB 2.9 MB/s eta 0:00:01
Building wheels for collected packages: nrclex
  Building wheel for nrclex (setup.py) ... [?25ldone
[?25h  Created wheel for nrclex: filename=NRCLex-3.0.0-py3-none-any.whl size=43310 sha256=4609d5690b5839844b51218e8ea24965032bb601f7fac3c15600cfeb4637fb4d
  Stored in directory: /Users/reza/Library/Caches/pip/wheels/83/95/c0/42b43fb15eb48e4f5a67cba8915540cb2783591c59c037a9e5
Successfully built nrclex
Installing collected packages: nrclex
Successfully installed nrclex-3.0.0


In [21]:
# import and load spaCy:

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from spacy import displacy
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm') 

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [43]:
# Import other packages:

import os 
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from textblob import TextBlob
from nrclex import NRCLex

%matplotlib inline

In [47]:
df = pd.read_csv('data/tweets_about_sprint.csv').sample(n = 5000, random_state=2).reset_index(drop = True)

df.head()

Unnamed: 0,user,tweet,date
0,PortAGroom,Retweeted Sprint sprint The on the left has 1 ...,6/27/16 6:54
1,androidNewsIn,T Mobile AT amp T Sprint and Verizon offer fre...,6/29/16 9:21
2,FaithLunden,RT SprintPalmer Can you say almost there Keep ...,7/2/16 18:51
3,jasleicol,RT t marieeeeee sprint how about LifeOfDesiign...,6/20/16 20:42
4,CraigpParrish,RT MW55 Our NASCARONFOX NASCAR XFINITY season ...,6/20/16 16:00


We can use spaCy's powerful tokenizer to parse our text. Let's take a look at some features of spaCy using an example sentence:

In [26]:
nlp_doc = nlp("Apple and Samsung are looking at #buy buying Pixar for 100 billion.")

for token in nlp_doc:
    print(token.text)

Apple
and
Samsung
are
looking
at
#
buy
buying
Pixar
for
100
billion
.


For each token in our text, we can get the lemmatized version of the token, check if it is alpha

In [34]:
for token in nlp_doc:
    print(token.text, # Returns the original form of the token
          token.lemma_, # Returns the lemmatized version of the token
          token.is_alpha, # Checks if the token consists of alphabetic characters.
          token.is_stop # Checks if the token is a stop-word
         )

Apple Apple True False
and and True True
Samsung Samsung True False
are be True True
looking look True False
at at True True
# # False False
buy buy True False
buying buy True False
Pixar Pixar True False
for for True True
100 100 False False
billion billion True False
. . False False


In [28]:
displacy.serve(nlp_doc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


**Important Note:** After you run the cell above, please select the cell again and click on interrupt the kernel icon from the notbook's menu bar. 

In [30]:
for ent in nlp_doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
Samsung 10 17 ORG
Pixar 45 50 ORG
100 billion 55 66 CARDINAL


In [31]:
displacy.serve(nlp_doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


**Important Note:** After you run the cell above, please select the cell again and click on interrupt the kernel icon from the notbook's menu bar. 

Package spaCy has the functionality to remove stop-words from our text. We can even customize the list of stop-words we want to remove from the text. For instance, let's add a couple of words to the list of stop-words to be removed from our data:

In [32]:
# Add new stop words: 
customize_stop_words = [
    'attach','startup'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

Now, we can apply the pre-processing steps to our text data:

In [48]:
tqdm.pandas() # To display the progress
df['pr_tweet'] = df.tweet.progress_apply(lambda text: 
                                          " ".join(token.lemma_ for token in nlp(text) 
                                                   if not token.is_stop and token.is_alpha))

  from pandas import Panel
100%|██████████| 5000/5000 [00:22<00:00, 221.00it/s]


In [49]:
df.head(2)

Unnamed: 0,user,tweet,date,pr_tweet
0,PortAGroom,Retweeted Sprint sprint The on the left has 1 ...,6/27/16 6:54,retweete Sprint sprint left foam mean s worth ...
1,androidNewsIn,T Mobile AT amp T Sprint and Verizon offer fre...,6/29/16 9:21,T Mobile amp T Sprint Verizon offer free call ...


### 4.2. Dictionary-based NLP

Now that the text data is pre-processed, we can analyze them, or we can use them to build machine learning models. There are certain things that we can do with text without even training a model. For instance, we can find the list of positive and negative words in our text data. Imagine that there is a list of positive words such as “happy”, “glad”, “nice”, etc. we can search our text data to find aby of these words. If we find many positive words in our text data, then perhaps our text carries a positive sentiment. On the other hand, if we find lots of negative words in our text document, then that document carries a negative sentiment. The absence of either positive words and negative words in our text could mean that we have a neutral text document. Therefore, to be able to determine the sentiment of a document, we need to have a list of words commonly associated with positive emotions, and a list of words commonly with negative words; something like a dictionary that has words as the keys of the dictionary and positive/ negative as the values for the keys. For example:

**Sentiment_Dictionary = {‘happy’ : ’positive’, ‘glad’ : ’positive’, ‘sad’ : ‘negative’, ‘angry’ : ‘negative’}** 

This dictionary could contain a comprehensive list of words associated with positive/ negative words. We can then use such a dictionary to determine the sentiment in each text document. This is an unsupervised method as we don’t need to train any model to determine the sentiment of each text document. We just simply find positive/ negative words in our text and use a formula to calculate a sentiment score. The presence of adjectives in a text could also indicate that they are subjective (as opposed to objective) text. We can therefore calculate the subjectivity of the text documents as well.  

Python package “TextBlob” offers such functionality. Let’s use it to calculate the sentiment score for each tweet in our example data:  


In [54]:
%%time
from textblob import TextBlob

df['subj'] = df.pr_tweet.progress_apply(lambda x: TextBlob(str(x)).sentiment.subjectivity)
df['sent'] = df.pr_tweet.progress_apply(lambda x: TextBlob(str(x)).sentiment.polarity)

100%|██████████| 5000/5000 [00:00<00:00, 6895.49it/s]
100%|██████████| 5000/5000 [00:00<00:00, 7004.43it/s]

CPU times: user 1.42 s, sys: 22.6 ms, total: 1.44 s
Wall time: 1.44 s





In [55]:
df.head()

Unnamed: 0,user,tweet,date,pr_tweet,subj,sent
0,PortAGroom,Retweeted Sprint sprint The on the left has 1 ...,6/27/16 6:54,retweete Sprint sprint left foam mean s worth ...,0.2625,-0.004167
1,androidNewsIn,T Mobile AT amp T Sprint and Verizon offer fre...,6/29/16 9:21,T Mobile amp T Sprint Verizon offer free call ...,0.8,0.4
2,FaithLunden,RT SprintPalmer Can you say almost there Keep ...,7/2/16 18:51,RT SprintPalmer share free speaker follower sp...,0.8,0.4
3,jasleicol,RT t marieeeeee sprint how about LifeOfDesiign...,6/20/16 20:42,RT t marieeeeee sprint LifeOfDesiigner get cal...,0.0,0.0
4,CraigpParrish,RT MW55 Our NASCARONFOX NASCAR XFINITY season ...,6/20/16 16:00,RT NASCARONFOX NASCAR XFINITY season fun year ...,0.2,0.3


If we are interested in learning to what extent our text documents contain specific emotions such as “fear”, “trust, “anger”, etc., we can use another Python package called “NRCLex”: 

In [57]:
%%time
from nrclex import NRCLex

df['fear'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['fear'])
df['anger'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['anger'])
df['anticip'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['anticip'])
df['trust'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['trust'])
df['surprise'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['surprise'])
df['positive'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['positive'])
df['negative'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['negative'])
df['sadness'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['sadness'])
df['disgust'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['disgust'])
df['joy'] = df.pr_tweet.progress_apply(lambda x: NRCLex(str(x)).affect_frequencies['joy'])

100%|██████████| 5000/5000 [00:01<00:00, 4691.55it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4820.18it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4859.64it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4883.45it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4917.68it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4793.13it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4941.49it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4938.42it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4946.94it/s]
100%|██████████| 5000/5000 [00:01<00:00, 4912.33it/s]

CPU times: user 10.2 s, sys: 115 ms, total: 10.3 s
Wall time: 10.3 s





In [58]:
df.head()

Unnamed: 0,user,tweet,date,pr_tweet,subj,sent,fear,anger,anticip,trust,surprise,positive,negative,sadness,disgust,joy
0,PortAGroom,Retweeted Sprint sprint The on the left has 1 ...,6/27/16 6:54,retweete Sprint sprint left foam mean s worth ...,0.2625,-0.004167,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,androidNewsIn,T Mobile AT amp T Sprint and Verizon offer fre...,6/29/16 9:21,T Mobile amp T Sprint Verizon offer free call ...,0.8,0.4,0.2,0.2,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0
2,FaithLunden,RT SprintPalmer Can you say almost there Keep ...,7/2/16 18:51,RT SprintPalmer share free speaker follower sp...,0.8,0.4,0.0,0.0,0.0,0.4,0.0,0.2,0.0,0.0,0.0,0.2
3,jasleicol,RT t marieeeeee sprint how about LifeOfDesiign...,6/20/16 20:42,RT t marieeeeee sprint LifeOfDesiigner get cal...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CraigpParrish,RT MW55 Our NASCARONFOX NASCAR XFINITY season ...,6/20/16 16:00,RT NASCARONFOX NASCAR XFINITY season fun year ...,0.2,0.3,0.0,0.0,0.0,0.125,0.125,0.25,0.0,0.0,0.0,0.25
