# Intro to Natural Language Processing
In this notebook:
* Gain an overview of the field of NLP
* Python libraries for natural language processing
* Common text processing and analyses with the textblob module, including
    * Parts of speech tagging
    * Sentiment analysis
    * Stemming and lemmatization
    * Etc.

Content used in this lesson is based upon information in the following sources:
* Intro to Python for Computer Science and Data Science by Deitel and Deitel

# MAKE A COPY OF THIS NOTEBOOK SO YOU CAN KEEP YOUR EDITS!

## Installing NLP Libraries (locally)
    * textblob
        * `pip install textblob`
        * `python -m textblob.download_corpora`
    * wordcloud
        * `pip install wordcloud`
    * textatistic
        * `pip install textatistic`
    * spaCy
        * `pip install spacy`
        * `python -m spacy download en_core_web_sm`
        
## Intro to Natural Language Processing
* Natural language processing (NLP) is the processing of a **text collection** (AKA corpus, or corpora for plural corpus), such as
    * Tweets
    * Facebook posts
    * Conversations
    * Product/service reviews
    * Meeting logs
    * Etc.
* NLP is notoriously difficult because all of the above examples lack mathematical precision. A text's meaning can be influenced by context and perspective.
* Thankfully, there are some really great Python NLP libraries with lots of built-in functionality and trained machine learning models we can use!!
    * TextBlob
    * WordCloud
    * Textastatistic
    * spaCy
    * Gensim
    * Google Cloud Natural Language API
    * Microsoft Linguistic Analysis API
    * etc.

## Basic NLP Tasks w/TextBlob
TextBlob is a text-processing library **built on** NLTK (Natural Language Toolkit) and pattern NLP libraries that simplifies many of the capabilities of these libraries, including but not limited to:
1. Tokenization: splitting text into pieces called tokens, which are meaningful units, such as words and numbers
1. Parts-of-speech tagging: identifying each word's part of speech, such as noun, verb, adjective, etc.
1. Noun phrase extraction: locating groups of words that represent nouns, such as "red brick factory"
1. Sentiment analysis: determining whether text has positive, neutral, or negative sentiment
1. Inter-language translation and language detection: powered by [Google Translate](https://translate.google.com/)
1. Inflection: pluralizing and singularizing words
1. Spell checking and spelling correction
1. Stemming: reducing words to their stems by removing prefixes or suffixes. For example, the stem of "varieties" is "varieti"
1. Lemmatization: like stemming, but produces real words based on the original words' context. For example, the lemmatization of "varieties" is "variety"
1. Word frequencies: determining how often each word appears in a corpus
1. WordNet integration: WordNet is a database used to find word definitions, synonyms, and antonyms
1. Stop word elimination: removing common words, such as "a", "an", "the", "I", "we", "you", and more to analyze the important words in a corpus
1. n-grams: producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another

In [4]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('brown')
nltk.download('wordnet')
!pip install textatistic

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/fadyyoussef/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/fadyyoussef/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package brown to
[nltk_data]     /Users/fadyyoussef/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/fadyyoussef/nltk_data...


Defaulting to user installation because normal site-packages is not writeable


In [None]:
from textblob import TextBlob
# grabbing text from https://www.gonzaga.edu/catalogs/current/undergraduate/school-of-engineering-and-applied-science/
##### gu_cs_text = "The Computer Science program at Gonzaga prepares students for careers and graduate study in the practice and science of computing. The program is built on a broad and rigorous foundation of science, mathematics, software engineering, and advanced computer science topics."
gu_SEAS_text = 'The over-arching goal of the undergraduate programs in the School of Engineering and Applied Science (SEAS) at Gonzaga University is to provide an education that prepares the student with a baccalaureate degree to be a professional engineer or computer scientist. In addition, the programs provide a base both for graduate study and for lifelong learning in support of evolving career objectives, which include being informed, effective, and responsible participants in the profession and society. It is also an education that is designed to challenge the intellect of the student and help him/her learn the value and reward of analytical and logical thinking.'
blob = TextBlob(gu_SEAS_text)
print(blob.sentences)
print(blob.words)

## Parts of Speech Tagging

In [None]:
# Parts of speech tagging
print(blob.tags)

# tagset list with examples available here: https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b

In [None]:
# TASK 1: Display parts of speech tags for "My dog is cute"


### Noun Phrases

In [None]:
# noun phrases
print(blob.noun_phrases)

In [None]:
# TASK 2: Show the noun phrases for the sentence "The red brick factory is for sale"


### Sentiment Analysis

In [None]:
# sentiment analysis
for sentence in blob.sentences:
    print(sentence)
    print(sentence.sentiment)
    print()
# Polarity indicates sentiment:
#      -1.0 (negative), 1.0 (positive), 0.0 (neutral)
# and subjectivity:
#       0.0 (objective), 1.0 (subjective)

In [None]:
# TASK 3: Show sentiment analysis for "The food is not good", "The movie was not bad", and "The movie was excellent!"


### Inter-language Translation and Language Detection

In [None]:
# language detection/translatiom with textblob is now deprecated:
# https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=detect_language#textblob.blob.TextBlob.detect_language
# Deprecated since version 0.16.0: Use the official Google Translate API instead.
# link to learn more about Google Translate API: https://cloud.google.com/translate

### Inflection

In [None]:
# Inflections are different forms of the same words, such as singular and plural
# e.g. person and people
# And different verb tenses
# e.g. run and ran

# You may want to convert all inflected words into same form for more accurate word frequencies

from textblob import Word

index = Word("index")
print(index.pluralize())

In [None]:
fish = Word("fish")
print(fish.pluralize())

cacti = Word("cacti")
print(cacti.singularize())

In [None]:
wordlist = blob.words
wordlist.pluralize()

In [None]:
# TASK 4: Singularize "children" and pluralize "focus"


### Spell Checking and Spelling Correction

In [None]:
# Spell checking
word = Word("theyr")
print(word.spellcheck())  # Returns list of tuples with possible correct words and their confidence level

In [None]:
# Spell correction
print(word.correct()) # Returns correctly spelled word that has the highest confidence

In [None]:
print(TextBlob("Ths sentense has missplled wrds.").correct())

In [None]:
# TASK 5: Correct the spelling in "I canot beleive I misspeled thees werds"


### Stemming and Lemmatization

In [None]:
# Normalization: preparing words for analysis
# e.g. convert all words to lowercase, convert to word roots, etc.
# e.g. "program", "programs", "programmer", "programming", "programmed", "progammes" -> "program"

# Stemming removes a prefix or suffix from a word leaving only a stem (which may or may not be a real word)
word = Word("varieties")
print(word.stem())

In [None]:
# Lemmatization is like stemming but factors in the word's part of speech and meaning, thus resulting in a real word
print(word.lemmatize())

In [None]:
# TASK 6: Stem and lemmatize "strawberries"


## Word Frequencies

In [None]:
# Word frequencies
print(blob.word_counts["student"])

### Definitions, Synonyms, and Antonyms w/WordNet

In [None]:
#WordNet database from Princeton U has word definitions, synonyms, and antonyms

# Definitions
word = Word("happy")
print(word.definitions)

In [None]:
#SKIP
# Synonyms
print(word.synsets)
# <word>.<part of speech>.<index number of corresponding meaning>

synonyms = set()
for synset in word.synsets:
    for lemma in synset.lemmas():
        synonyms.add(lemma.name())
print("synonyms:", synonyms)

In [None]:
#SKIP
# Antonyms
antonyms = set()
for lemma in word.synsets[0].lemmas():
    for antonym in lemma.antonyms():
        antonyms.add(antonym.name())
print("antonyms:", antonyms)

### N-Grams

In [None]:
# An n-gram is a sequence of n text items, such as letters in words or words in a sentence
# Used to identify letters or words that frequently appear adjacent to one another
# Helpful for predicting the next letter or word the user will type
# e.g. tab completion in an IDE or text suggestion in a messaging app
blob.ngrams(5) # default is 3-gram

In [None]:
# TASK 7: Produce n-grams of 3 words for "TextBlob is easy to use."


## Stop Words
A stop word is a common word in text that is often removed from text before analyzing because they typically do not provide useful information (e.g. "a", "the", "you", etc.)

In [None]:
from nltk.corpus import stopwords

nltk.download("stopwords")
stops = stopwords.words("english")

In [None]:
# Remove stop words
blob_lower = blob.lower()
no_stops_SEAS_text = []
for word in blob_lower.words:
    if word not in stops:
        no_stops_SEAS_text.append(word)
print(no_stops_SEAS_text)

## Visualizing Word Frequencies
Word frequencies are typically visualied with bar charts and word clouds

### Bar Charts

In [None]:
import matplotlib.pyplot as plt
from matplotlib import cm  # color map
import pandas as pd

# word frequencies demo
# clean first

seuss = 'One fish, Two fish, Red fish, Blue fish, \
Black fish, Blue fish, Old fish, New fish. \
This one has a little car. \
This one has a little star. \
Say! What a lot of fish there are. \
Yes. Some are red, and some are blue. \
Some are old and some are new. \
Some are sad, and some are glad, \
And some are very, very bad. \
Why are they sad and glad and bad? \
I do not know, go ask your dad. \
Some are thin, and some are fat. \
The fat one has a yellow hat. \
From there to here, \
From here to there, \
Funny things are everywhere. \
Here are some who like to run. \
They run for fun in the hot, hot sun. \
Oh me! Oh my! Oh me! oh my! \
What a lot of funny things go by. \
Some have two feet and some have four. \
Some have six feet and some have more. \
Where do they come from? I can\'t say. \
But I bet they have come a long, long way. \
we see them come, we see them go. \
Some are fast. Some are slow. \
Some are high. Some are low. \
Not one of them is like another. \
Don\'t ask us why, go ask your mother.'

seuss_text = TextBlob(seuss)
cleaned_seuss_text = []
for word in seuss_text.lower().words:
    # remove stop words and contractions
    if word not in stops and "'" not in word:
        # get roots of words
        word = word.singularize()
        word = word.lemmatize()
        cleaned_seuss_text.append(word)
print(cleaned_seuss_text)
cleaned_blob = TextBlob(" ".join(cleaned_seuss_text))
counts_ser = pd.Series(cleaned_blob.word_counts)
counts_ser = counts_ser.sort_values(ascending=False)

viridis = cm.get_cmap("viridis", len(counts_ser))   # viridis is one of matplotlib's color gradient styles
plt.bar(counts_ser.index, counts_ser, color=viridis.colors[::-1])
plt.xticks(rotation=45, horizontalalignment="right")
plt.rcParams.update({"figure.figsize": (25,5)})
plt.rcParams.update({'font.size': 14})
plt.show()

### Word Clouds
This solution uses the `wordcloud` module which is built on top of `matplotlib`. Words that appear more frequently in the text show up in a word cloud in a larger font size.
* Note: `wordcloud` removes stop words before generating the word cloud

In [None]:
from wordcloud import WordCloud
plt.figure()
wordcloud = WordCloud(colormap="prism", background_color="white")
wordcloud = wordcloud.generate(seuss)
# write to file
#wordcloud.to_file("wordcloud.png")
# plot with matplotlib
plt.imshow(wordcloud)
plt.show()

## Readability Assessment w/Textatistic
Text readability is affected by the vocabulary used, sentence structure, sentence length, topic, etc. There are many formulas for assessing readability:
* char_count: number of characters in the text
* word_count: number of words in the text
* sent_count: number of sentences in the text
* sybl_count: number of syllables in the text
* notdalechall_count: count of words that are not on the Dale-Chall list (list of words understood by 80% of 5th graders); higher this number is compared to the total word count, the less readable the text is considered to be
* polysyblword_count: number of words with three or more syllables
* flesch_score: Flesch Reading Ease score, which can be mapped to a grade level (scores > 90 are readable by 5th graders; scores < 30 require a college degree)
* fleschkincaid_score: Flesch-Kincaid score, which corresponds to a specific grade level
* gunningfog_score: Gunning Fog index value, which corresponds to a specific grade level
* smog_score: Simple Measure of Gobbledygook (SMOG), which corresponds to the years of education requried to understand text
* dalechall_score: Dale-Chall score, which can be mapped to grade levels from 4 and below to college graduate (grade 16) and above
    * Note: this score is considered to be most reliable for a broad range of text types

In [None]:
from textatistic import Textatistic

readability = Textatistic(gu_SEAS_text)
print('Summary of SEAS text:\n')
for stat, value in readability.dict().items():
    print(stat, ":", value)
print()
# Note, higher Flesch scores mean easier to read

# Calculate the average number of words per sentence, characters per word, and syllables per word
print("Average number of words per sentence:", readability.word_count / readability.sent_count)
print("Average characters per word:", readability.char_count / readability.word_count)
print("Average syllables per word:", readability.sybl_count / readability.word_count)

In [None]:
# TASK 8: Show the stats and readability scores for the Dr. Seuss text
