# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Intro to Natural Language Processing
What are our learning objectives for this lesson?
* Gain an overview of the field of NLP
* Install several Python libraries for natural language processing
* Perform common text processing and analyses with the textblob module, including
    * Parts of speech tagging
    * Sentiment analysis
    * Stemming and lemmatization
    * Etc.

Content used in this lesson is based upon information in the following sources:
* Intro to Python for Computer Science and Data Science by Deitel and Deitel

## Warm-up Task(s)
* TBD

## Today
* Announcements
    * Let's go over IQ9
    * IQ10 (last one!!) on U7/DA7 (no NLP) TBD
    * DA7 is due TBD. Questions?
    * Mid project check-ins due TBD (Cleaning, EDA, and at least 1 hypothesis test)
* Today
    * Mid project check ins
* Next class (last one!!): course evals, project stuff, confusion matrices, NLP Yelp API demo, closing thoughts

## Install NLP Libraries
* There are some NLP libraries we are going to use that don't come with Anaconda, so we will need to install them separately
* Open Anaconda prompt and install the following modules:
    * textblob
        * `pip install textblob`
        * `python -m textblob.download_corpora`
    * wordcloud
        * `pip install wordcloud`
    * textatistic
        * `pip install textatistic`
    * spaCy
        * `pip install spacy`
        * `python -m spacy download en_core_web_sm`
        
Notes:
* On Windows you may need to right click on Anaconda prompt in the start menu and choose "More" -> "Run as administrator"
* On Mac M1, `spacy` will install but is not currently running on the arm architecture (for this reason I'll be using Google Colab or my older Intel-based Mac instead)
* If pip cannot find the package you can try conda, e.g.: `conda install -c conda-forge spacy`

## Intro to Natural Language Processing
* Natural language processing (NLP) is the processing of a text collection (AKA corpus, or corpora for plural corpus), such as
    * Tweets
    * Facebook posts
    * Conversations
    * Product/service reviews
    * Meeting logs
    * Etc. 
* NLP is notoriously difficult because all of the above examples lack mathematical precision. A text's meaning can be influenced by context and perspective.
* Thankfully, there are some really great Python NLP libraries with lots of built-in functionality and trained machine learning models we can use!!
    * TextBlob
    * WordCloud
    * Textastatistic
    * spaCy
    * Gensim
    * Google Cloud Natural Language API
    * Microsoft Linguistic Analysis API
    * etc.

## Basic NLP Tasks w/TextBlob
TextBlob is an OOP NLB text-processing library built on NLTK and pattern NLP libraries that simplifies many of the capabilities of these libraries, including but not limited to:
1. Tokenization: splitting text into pieces called tokens, which are meaningful units, such as words and numbers
1. Parts-of-speech tagging: identifying each word's part of speech, such as noun, verb, adjective, etc.
1. Noun phrase extraction: locating groups of words that represent nouns, such as "red brick factory"
1. Sentiment analysis: determining whether text has positive, neutral, or negative sentiment
1. Inter-language translation and language detection: powered by [Google Translate](https://translate.google.com/)
1. Inflection: pluralizing and singularizing words
1. Spell checking and spelling correction
1. Stemming: reducing words to their stems by removing prefixes or suffixes. For example, the stem of "varieties" is "varieti"
1. Lemmatization: like stemming, but produces real words based on the original words' context. For example, the lemmatization of "varieties" is "variety"
1. Word frequencies: determining how often each word appears in a corpus
1. WordNet integration: WordNet is a database used to find word definitions, synonyms, and antonyms
1. Stop word elimination: removing common words, such as "a", "an", "the", "I", "we", "you", and more to analyze the important words in a corpus
1. n-grams: producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another

In [1]:
from textblob import TextBlob

In [2]:
# grabbing text from https://www.gonzaga.edu/catalogs/current/undergraduate/school-of-engineering-and-applied-science/computer-science
gu_cs_text = "The Computer Science program at Gonzaga prepares students for careers and graduate study in the practice and science of computing. The program is built on a broad and rigorous foundation of science, mathematics, software engineering, and advanced computer science topics."
blob = TextBlob(gu_cs_text)
print(blob.sentences)
print(blob.words)

[Sentence("The Computer Science program at Gonzaga prepares students for careers and graduate study in the practice and science of computing."), Sentence("The program is built on a broad and rigorous foundation of science, mathematics, software engineering, and advanced computer science topics.")]
['The', 'Computer', 'Science', 'program', 'at', 'Gonzaga', 'prepares', 'students', 'for', 'careers', 'and', 'graduate', 'study', 'in', 'the', 'practice', 'and', 'science', 'of', 'computing', 'The', 'program', 'is', 'built', 'on', 'a', 'broad', 'and', 'rigorous', 'foundation', 'of', 'science', 'mathematics', 'software', 'engineering', 'and', 'advanced', 'computer', 'science', 'topics']


## Parts of Speech Tagging

In [3]:
# POS tagging
print(blob.tags)
# tagset list with examples available here: https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b

[('The', 'DT'), ('Computer', 'NNP'), ('Science', 'NNP'), ('program', 'NN'), ('at', 'IN'), ('Gonzaga', 'NNP'), ('prepares', 'VBZ'), ('students', 'NNS'), ('for', 'IN'), ('careers', 'NNS'), ('and', 'CC'), ('graduate', 'NN'), ('study', 'NN'), ('in', 'IN'), ('the', 'DT'), ('practice', 'NN'), ('and', 'CC'), ('science', 'NN'), ('of', 'IN'), ('computing', 'VBG'), ('The', 'DT'), ('program', 'NN'), ('is', 'VBZ'), ('built', 'VBN'), ('on', 'IN'), ('a', 'DT'), ('broad', 'JJ'), ('and', 'CC'), ('rigorous', 'JJ'), ('foundation', 'NN'), ('of', 'IN'), ('science', 'NN'), ('mathematics', 'NNS'), ('software', 'NN'), ('engineering', 'NN'), ('and', 'CC'), ('advanced', 'VBD'), ('computer', 'NN'), ('science', 'NN'), ('topics', 'NNS')]


In [17]:
# task
# display parts of speech tags for "My dog is cute"

### Noun Phrases

In [5]:
# noun phrases
print(blob.noun_phrases)

['computer', 'science program', 'gonzaga', 'graduate study', 'rigorous foundation', 'software engineering', 'computer science topics']


In [11]:
# task
# show the noun phrases for the sentence "The red brick factory is for sale"

### Sentiment Analysis

In [7]:
# sentiment analysis
for sentence in blob.sentences:
    print(sentence)
    print(sentence.sentiment)
    print()
# polarity indicates sentiment in [-1.0 (negative), 1.0 (positive)] 0.0 is neutral
# subjectivity in [0.0 (objective), 1.0 (subjective)]

The Computer Science program at Gonzaga prepares students for careers and graduate study in the practice and science of computing.
Sentiment(polarity=0.0, subjectivity=0.0)

The program is built on a broad and rigorous foundation of science, mathematics, software engineering, and advanced computer science topics.
Sentiment(polarity=0.23125, subjectivity=0.45625)



In [12]:
# task
# show sentiment analysis for "The food is not good", "The movie was not bad", "The movie was excellent!"
from textblob import Sentence

### Inter-language Translation and Language Detection

In [None]:
# language detection/translatiom with textblob is now deprecated:
# https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=detect_language#textblob.blob.TextBlob.detect_language
# Deprecated since version 0.16.0: Use the official Google Translate API instead.
# link to learn more about Google Translate API: https://cloud.google.com/translate

### Inflection

In [12]:
# inflection
# inflections are different forms of the same words, such as singular and plural
# e.g. person and people
# and different verb tenses
# e.g. run and ran

# often want to convert all inflected words into same form for more accurate word frequencies

from textblob import Word

index = Word("index")
print(index.pluralize())
fish = Word("fish")
print(fish.pluralize())
cacti = Word("cacti")
print(cacti.singularize())

wordlist = blob.words
wordlist.pluralize()

indices
fish
cactus


WordList(['Thes', 'Computers', 'Sciences', 'programs', 'ats', 'Gonzagas', 'preparess', 'studentss', 'fors', 'careerss', 'ands', 'graduates', 'studies', 'ins', 'thes', 'practices', 'ands', 'sciences', 'ofs', 'computings', 'Thes', 'programs', 'iss', 'builts', 'ons', 'some', 'broads', 'ands', 'rigorouss', 'foundations', 'ofs', 'sciences', 'mathematics', 'software', 'engineerings', 'ands', 'advanceds', 'computers', 'sciences', 'topicss'])

In [13]:
# task
# singularize "children" and pluralize "focus"

### Spell Checking and Spelling Correction

In [14]:
# spell checking
word = Word("theyr")
print(word.spellcheck())

# spell correction
print(word.correct()) # returns correctly spelled word that has the highest confidence
print(TextBlob("Ths sentense has missplled wrds.").correct())

[('they', 0.5713042216741622), ('their', 0.42869577832583783)]
they
The sentence has misspelled words.


In [14]:
# task
# correct the spelling in "I canot beleive I misspeled thees werds"

### Stemming and Lemmatization

In [16]:
# normalization: preparing words for analysis
# e.g. convert all words to lowercase, convert to word roots, etc.
# e.g. "program", "programs", "programmer", "programming", "programmed", "progammes" -> "program"

# stemming removes a prefix or suffix from a word leaving only a stem (which may or may not be a real word)
word = Word("varieties")
print(word.stem())

varieti


In [17]:
# lemmatization is like stemming but factors in the word's part of speech and meaning thus resulting in a real word
print(word.lemmatize())

variety


In [15]:
# task
# stem and lemmatize "strawberries"

## Word Frequencies

In [19]:
# word frequencies 
print(blob.word_counts["science"])

4


### Definitions, Synonyms, and Antonyms w/WordNet

In [20]:
# definitions
# uses WordNet database from Princeton U
# has word definitions, synonyms, and antonyms
word = Word("happy")
print(word.definitions)

['enjoying or showing or marked by joy or pleasure', 'marked by good fortune', 'eagerly disposed to act or to be of service', 'well expressed and to the point']


In [21]:
# synonyms
print(word.synsets)
# word.part of speech.index number of the corresponding meaning in the WordNet database

synonyms = set()
for synset in word.synsets:
    for lemma in synset.lemmas():
        synonyms.add(lemma.name())
print("synonyms:", synonyms)

[Synset('happy.a.01'), Synset('felicitous.s.02'), Synset('glad.s.02'), Synset('happy.s.04')]
synonyms: {'happy', 'felicitous', 'glad', 'well-chosen'}


In [22]:
# antonyms
antonyms = set()
for lemma in word.synsets[0].lemmas():
    for antonym in lemma.antonyms():
        antonyms.add(antonym.name())
print("antonyms:", antonyms)

antonyms: {'unhappy'}


### N-Grams

In [23]:
# n-grams
# an n-gram is a sequence of n text items, such as letters in words or words in a sentence
# used to identify letters or words that frequently appear adjacent to one another
# helpful for predicting the next letter or word the user will type
# e.g. tab completion in an IDE or text suggestion in a messaging app
blob.ngrams(5) # default is 3-gram

[WordList(['The', 'Computer', 'Science', 'program', 'at']),
 WordList(['Computer', 'Science', 'program', 'at', 'Gonzaga']),
 WordList(['Science', 'program', 'at', 'Gonzaga', 'prepares']),
 WordList(['program', 'at', 'Gonzaga', 'prepares', 'students']),
 WordList(['at', 'Gonzaga', 'prepares', 'students', 'for']),
 WordList(['Gonzaga', 'prepares', 'students', 'for', 'careers']),
 WordList(['prepares', 'students', 'for', 'careers', 'and']),
 WordList(['students', 'for', 'careers', 'and', 'graduate']),
 WordList(['for', 'careers', 'and', 'graduate', 'study']),
 WordList(['careers', 'and', 'graduate', 'study', 'in']),
 WordList(['and', 'graduate', 'study', 'in', 'the']),
 WordList(['graduate', 'study', 'in', 'the', 'practice']),
 WordList(['study', 'in', 'the', 'practice', 'and']),
 WordList(['in', 'the', 'practice', 'and', 'science']),
 WordList(['the', 'practice', 'and', 'science', 'of']),
 WordList(['practice', 'and', 'science', 'of', 'computing']),
 WordList(['and', 'science', 'of', 'co

In [16]:
# task
# produce n-grams of 3 words for "TextBlob is easy to use."