## Week 8 - Live Session

# Natural Language Processing (NLP)
![Python logo](https://www.python.org/static/img/python-logo@2x.png "Python logo")

## DSS 615 Python Programming
## John Michl


# Topics for Tonight

* Retrieving Text from Static Website
* Beautiful Soup
* Using Newspaper3K to handle text cleanup
* Several Web Examples
* Sentiment Analysis with TextBlob
* Final assignments

# Focused Reading from Chapter 12

* **Required:** 12.1 Introduction
* **Required:** 12.2 `TextBlob` (Sections 12.2.1 through 12.2.5 Sentiment Analysis)
* **Review:** 12.2.6 through 12.2.14
* **Suggested but not required:** 12.3 `WordCloud`
* **Suggested but not required:** 12.4 Readability Assessment with `Textatistic`
* **Optional:** 12.5-12.9 Advanced NLP

Note: Slight changes from Canvas -- less required than what is posted


# Module 8 - High Level Objectives

Be able to:
* Download text from (some) web pages and prep for text analysis.
* Clean up the text with Beautiful Soup, if possible.
* Learn to use a library like Article to extract articles from most news sites and blogs including key meta-data.
* Practice manipulating speech transcript data from Rev.com.
* Perform sentiment analysis and plot sentence level subjectivity and polarity data with `matplotlib` and `plotly express`

# Warning: Wrangling Text from Web-pages is Hard!

* Each web-site stores data different so you need to be a sleuth.
* Most modern sites no longer store the text as part of the page.
* Static web pages are hard to find.
* You could spend a semester just on retrieving data from web-pages or other APIs.
* Many web-pages have restrictions on what you can retrieve. (See robots.txt before making heavy use of a web-page.)
* Most book examples will use a static, locally stored text file as input.
* Some newer tools (e.g. `Article`) can make it "easier" to retrieve properly formatted pages.

*Suggestion: Read, practice, read, practice, read and practice, some more.*

## Install `BeautifulSoup` from Command or Terminal Prompt (continued)

* Reliable text scraping
* **Via pip:** `pip install beautifulsoup4`
* Already installed with Anaconda
* https://pypi.org/project/beautifulsoup4/

https://www.crummy.com/software/BeautifulSoup/

In [1]:
! pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/69/bf/f0f194d3379d3f3347478bd267f754fc68c11cbf2fe302a6ab69447b1417/beautifulsoup4-4.10.0-py3-none-any.whl (97kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Using cached https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.10.0 soupsieve-2.2.1


You are using pip version 10.0.1, however version 21.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## Install `Newspaper3k` from Command or Terminal Prompt (continued)

* Reliable text scraping
* **Via pip:** `pip3 install newspaper3k`
* For our purposes, shouldn't need to install corpora
* https://github.com/codelucas/newspaper


https://newspaper.readthedocs.io/en/latest/

In [2]:
! pip3 install newspaper3k

Collecting newspaper3k
  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading https://files.pythonhosted.org/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading https://files.pythonhosted.org/packages/d8/b2/15bf6781a861bbc5dd801d467f26448fb322bfedcd30f2e62b148d104dfb/feedparser-6.0.8-py3-none-any.whl (81kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading https://files.pythonhosted.org/packages/c5/1e/58ad28cd1c6be6d37ec86b3f4790770f1cc49050d479fe773573a4704bc0/tldextract-3.1.2-py2.py3-none-any.

  The script normalizer.exe is installed in 'c:\users\timfs\appdata\local\programs\python\python37\Scripts' which is not on PATH.
  The script tldextract.exe is installed in 'c:\users\timfs\appdata\local\programs\python\python37\Scripts' which is not on PATH.
  The script tqdm.exe is installed in 'c:\users\timfs\appdata\local\programs\python\python37\Scripts' which is not on PATH.
  The script nltk.exe is installed in 'c:\users\timfs\appdata\local\programs\python\python37\Scripts' which is not on PATH.
You are using pip version 10.0.1, however version 21.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## Install TextBlob Module from Command or Terminal Prompt (not Jupyter Notebook)

**Via pip and python:**

* `pip install -U textblob`
* `python -m textblob.download_corpora lite`

**Via conda:**

* `conda install -c conda-forge textblob`
* `python -m textblob.download_corpora lite`

**Notes:**
* Windows users may need to run as administrator
* This installs the basic corpora which is sufficient for this class. Remove the lite argument to install the entire corpora

https://textblob.readthedocs.io/en/dev/install.html

In [3]:
! python -m textblob.download_corpora lite

C:\Users\timfs\AppData\Local\Programs\Python\Python37\python.exe: Error while finding module specification for 'textblob.download_corpora' (ModuleNotFoundError: No module named 'textblob')


## Install `requests` Library if not already installed with Anaconda
**What is it?**
* Standard library for making HTTP requests in Python
* Typically the first thing we'll do in order to get something to process

**Via pip:**
* `pip install requests`

**To check if it is installed already:**
*`import requests`  # in notebook or python script
* `pip list`  # from terminal to see installed packages
* `conda list`  # packages installed in conda environment

Tutorial on `requests`: https://realpython.com/python-requests/

# `requests` Library -- The Basics

GET request -- issues a request to retrieve data from the web-site
* `requests.get('https://thewebsite.com')`

RESPONSE is the object that is created from a get request. To work with it, we need assign it to an object variable, frequently call `response`. Then, we only need to access the object. 
* `response = requests.get('https://thewebsite.com')`

Attributes of the response object allow us to look at the content or payload.
* `response.content`........ content gives access to the raw bytes
* `response.text`........... text gives converts content to string, guesses on encoding
* `response.encoding = 'utf-8'`.... explicit encoding scheme, if needed, before accessing text
* `response.json`........... deserializes the response content to a dictionary

In [4]:
import requests        # import from web
from bs4 import BeautifulSoup      # clean up text
# from wordcloud import WordCloud    # create word clouds
from textblob import TextBlob      # basic NLP, install first
# from textatistic import Textatistic   # readability, install first
from newspaper import Article

from pathlib import Path    # for quick import of text file for NLP

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly import express as px

# Magics
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

ModuleNotFoundError: No module named 'wordcloud'

## Example 1: Extraction Block or Not Allowed
https://www.americanrhetoric.com/speeches/mlkihaveadream.htm

In [None]:
## 403 Forbidden Error, extract blocked / not allowed
url = 'https://www.americanrhetoric.com/speeches/mlkihaveadream.htm'

response = requests.get(url)   # retrieve the webpage
response.content               # show content from the retrieved page

## Example 2: Static, Predominately Text-based Web-Page
https://er.jsc.nasa.gov/seh/ricetalk.htm

In [None]:
## Static web page - JFK speech re: moon
url = 'https://er.jsc.nasa.gov/seh/ricetalk.htm'

response = requests.get(url)
response.content       # Notice moderate amount of HTML code

In [None]:
# response.encoding = 'utf-8'
response.text

In [None]:
# if response text is in json format, can deserialize with
response.json()

In [None]:
# use BeautifulSoup to clean up the response content

soup = BeautifulSoup(response.content, 'html5lib')
text = soup.get_text(strip=True)  # text without tags
text       #BeautifulSoup has done a decent job on this page removing HTML

## Example 3: Sometimes it is easier to copy and paste to a file
https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/12/remarks-by-president-biden-on-the-american-rescue-plan-2/

In [None]:
## Somewhat hidden text
url = 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/12/remarks-by-president-biden-on-the-american-rescue-plan-2/'
response = requests.get(url)
response.content            # UGLY! significant amount of code -- Where's the text??

In [None]:
## Soup doesn't help much in this case
soup = BeautifulSoup(response.content, 'html5lib')
text = soup.get_text(strip=True)  
text

## Let's try `Article`

Steps: 
Install `newspaper3k` via `pip` (only do this once per machine)

1. Import `Article` from `newspaper` (once per notebook)
2. Create an article object and set it to the URL of the web-page (required once per web-page)
3. Download (required after creating article object)
4. Parse the downloaded object (required once per download, separate data into text, authors, title, date, etc.)

Now you are ready for other tasks (view text, check authors and publication date; perform NLP tasks).

https://newspaper.readthedocs.io/en/latest/

### Repeat previous example with Article: Whitehouse Briefing Web-page

In [None]:
from newspaper import Article   # Must be installed first

In [None]:
url = 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/12/remarks-by-president-biden-on-the-american-rescue-plan-2/'

#create the article object
article = Article(url) # required

# download and parse the data  #required
article.download()    # required
article.parse()     # required

# retrieve the text from the parsed article object
text = article.text   # no output here, see next cells

In [None]:
# Accessing the object shows unformatted data, notice the \n sequences
# This is fine for verifying we got the text but a little hard to actually read.
text

In [None]:
# If we want to read the output rather than process it, use a print statement
print(text)

In [None]:
# Some articles/blogs may have other attributes but not necessarily
print("Title: ", article.title)   # retrieves the title
print("Authors: ", article.authors)   # creates a list of authors; no authors on this page
print("Publication Date: ", article.publish_date)  # no publish date on this web-page
print("First Image:", article.top_image)  # retrieves the top image on the page
print("Video Links:", article.movies)  # creates a list of video links; none on this page

In [None]:
# the article NLP atribute allows us to process the text for NLP type analysis
# It will create attributes key words and summary

article.nlp()    # note the parenthesis indicating this is a method

print("KeyWords: ", article.keywords)   # creates a list of authors; no authors on this page
print()
print("Summary: ", article.summary)  # no publish date on this web-page


# Try it! 
Try the `newspaper article` code on your own link to a site that is likely to have all of the atributes

In [None]:
# Change the url to your own, comment out all urls but one, note may not show all of article if behind pay wall
url = 'https://hbr.org/2020/04/bringing-an-analytics-mindset-to-the-pandemic'
# url = 'https://www.wsj.com/articles/ceos-increasingly-see-sustainability-as-path-to-profitability-11602535250'
# url = 'https://www.cnn.com/2020/10/13/health/us-coronavirus-tuesday/index.html'

In [None]:
article = Article(url)
article.download()
article.parse()

In [None]:
print("Title: ", article.title)  
print("Authors: ", article.authors)  
print("Publication Date: ", article.publish_date)  
print("First Image:", article.top_image)  
print("Video Links:", article.movies)  

In [None]:
print("Title: ", article.title)
print()
print(article.text)

In [None]:
article.nlp()

print("KeyWords: ", article.keywords)   # creates a list of authors; no authors on this page
print()
print("Summary: ", article.summary)  # no publish date on this web-page

# Processing a Transcript with Newspaper3k

* We can leverage `article` to retrieve text of transcribed speeches though we may need to process the data a bit to prepare it for analysis.
* Most transcripts include speaker names, time stamps and other information.
* Each one could require some specific code


# Speech Transcript

* These examples are specifically for the transcript site https://www.rev.com/blog/transcripts
* Modifications likely for other speech sources
* In fact, some more recent transcriptions follow a different format than last semester

In [None]:
# Set the url
url = 'https://www.rev.com/blog/transcripts/ruth-bader-ginsburg-stanford-rathbun-lecture-transcript-2017'
event = '-RBGstandford-2017'   # will use in file names


# From rev.com - same speech as WH example with intro speakers
# url = 'https://www.rev.com/blog/transcripts/biden-harris-schumer-pelosi-rose-garden-speech-on-american-rescue-plan-transcript-march-12'
# event = '-rescueplan031221'   # this will be part of the file name for a text file we create 



In [None]:
# Minimum code needed to get to the text of the speech
article = Article(url)
article.download()
article.parse()
text = article.text   # saves to an object for later use

print(article.text) 

In [None]:
# write the text to a file

with open('speech.txt', 'w') as f:
    f.writelines(text)

with open('speech.txt', 'r') as f:
    for cnt, line in enumerate(f):
        print(f'Line {cnt}: {line}')

In [None]:
# Custom processing for rev site
# line 0 = speaker and time
# line 2 = what speaker said
# lines 1 and 3 = blanks
# create four lists of the components of speech

with open('speech.txt', 'r') as f:
    speech = f.readlines()

tmp = []
speaker = []
time = []
words = []

for cnt, line in enumerate(speech):
    if cnt % 2 == 0:
        tmp.append(line.rstrip())   # temp list of just the text lines 0,2
        
for i in range(0,len(tmp),2):
    speaker.append(tmp[i].split(': ')[0])  #split speaker line into 2 parts
    time.append(tmp[i].split(': ')[1])
    words.append(tmp[i+1])    # words from speaker


In [None]:
speaker

In [None]:
# find unique speaker names for later filter

set(speaker)

In [None]:
spkr = 'Ruth Bader Ginsburg'
spkr_file = spkr.split()[len(spkr.split())-1] + event + '.txt'
spkr_file


In [None]:
# use write instead of writelines since we don't want entire list
# remember to add new line

with open(spkr_file,'w') as f:
    for i in range(0,len(speaker)):
        if speaker[i] == spkr:      # only save lines associated with the wanted speaker
            f.write(words[i]+'\n')    

In [None]:
# Confirm good file
spkr_text = Path(spkr_file).read_text()
spkr_text

# Basics of TextBlob

Be sure it is installed and imported. 

## Some Properties of the Blob

* `words`
* `sentences`
* parts-of-speech `tags`
* `noun_phrases`
* `sentiment`
    * `polarity`
    * `subjectivity`
    
**NOTE: Will need to install NLTK Data to analyze properties.** 
* NLTK should be installed with Anaconda http://www.nltk.org/install.html
* NLTK Data instructions are here: https://www.nltk.org/data.html

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

## Most modules of NLTK require specifc databases or corpora
![image.png](attachment:image.png)

## Download common corpora 
* `python -m textblob.download_corpora lite`  # sufficient for our purposes
* `python -m textblob.download_corpora`    # for advanced modeling
![image.png](attachment:image.png)

In [None]:
# Create a TextBlob object, call it what you want, often is is called blog
# pass some text to it, typically that's an object called text but doesn't need to be

blob = TextBlob(spkr_text)

In [None]:
# return a list of the unique words in the blob

blob.words

In [None]:
# return a list of sentence objects in the blob

blob.sentences

## Parts_of_Speech (a.k.a. `tags`)

* `TextBlob` uses a `PatternTagger` to determine parts-of-speech
* Uses **pattern library** POS tagging
* Pattern's 63 parts-of-speech tags
* Samples patterns:
    * `NN`—a **singular noun** or **mass noun
    * `NNS` - a **plural noun**
    * `NNP`—a **proper singular noun**
    * `VB` - a verb, base form
    * `VBZ`—a [**third person singular present verb**](https://www.grammar.cl/Present/Verbs_Third_Person.htm)
    * `DT`—a [**determiner**](https://en.wikipedia.org/wiki/Determiner) (the, an, that, this, my, their, etc.)
    * `JJ`—an **adjective**
    * `IN`— a **subordinating conjunction** or **preposition**
    * `PRP` - a **personal pronoun**
    * `CC` - a **coordinating conjunction**
    
Textbook links out of date:
* https://github.com/clips/pattern
* https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/
* Instead of https://www.clips.uantwerpen.be/pages/MBSP-tags`

In [None]:
# returns a list of tuples where [0] is the word and [1] is the parts of speech
blob.tags

In [None]:
# returns a WordList object containing a list of noun phrases, not perfect but automatic
blob.noun_phrases

## Basic Sentiment with TextBlob

* The `sentiment` property returns a namedtuple of the form `Sentiment(polarity, subjectivity)`
* **polarity** ranges from -1.0 (negative) to + 1.0 (positive)
* **subjectivity** ranges from 0.0 (objective) to 1.0 (subjective)

https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis

In [None]:
# Overall sentiment -- note the output

blob.sentiment

In [None]:
# Access individial sentiment components by name
print('Polarity -1 +1,  Subjectivity 0 1')
print('=================================')
print(f'Polarity: \t{blob.polarity:.3f}')
print(f'Subjectivity:\t{blob.subjectivity:.3f}')

In [None]:
# Access individial sentiment components by index, easier with keyword
print('Polarity -1 +1,  Subjectivity 0 1')
print('=================================')
print(f'Polarity: \t{blob.sentiment[0]:.3f}')
print(f'Subjectivity:\t{blob.sentiment[1]:.3f}')

## Sentiment by Sentence

In [None]:
# Remember the Sentence object? Each sentence has sentiment, polarity, and subjectivity

for indx, sentence in enumerate(blob.sentences):
    print(f'{indx}:  {sentence}')    
    print(f'\tSentiment: {sentence.sentiment}')
    print(f'\tPolarity:\t{sentence.polarity:>6.2f}')
    print(f'\tSubjectivity:\t{sentence.subjectivity:>6.2f}')
    print('\n--------------------------------------------------------\n')


In [None]:
# Create sentiment dataframe
pd.set_option('max_colwidth', 400)    # set this to not truncate output

p = []
s = []
txt = []
for sentence in blob.sentences:
    p.append(sentence.sentiment.polarity)
    s.append(sentence.sentiment.subjectivity)
    txt.append(str(sentence))

df_sent = pd.DataFrame(p,columns=['polarity'])
df_sent['subjectivity'] = s
df_sent['text'] = txt

df_sent.sort_values('polarity', ascending=True, inplace = True) 


df_sent.head(10)

In [None]:
# Some stats

df_sent.describe()

In [None]:
# compare to calculated sentiment, probably best not to use mean
blob.sentiment

In [None]:
# is there a correlation between polarity and subjectivity for this speech
df_sent.corr()

In [None]:
# simple pandas plot
df_sent.plot.scatter('polarity','subjectivity')

In [None]:
# matplotlib scatter plot

print(blob.sentiment)
sns.scatterplot(x='polarity', y='subjectivity', data=df_sent)

# control x and y limits
plt.ylim(0, 1)
plt.xlim(-1, 1)


In [None]:
# filter the blob by scores  (could also filter in the dataframe)
# in this example most negative (<-.8) AND most subjective (>.8)

for sentence in blob.sentences:
    if sentence.polarity < -.8:
        if sentence.subjectivity >.8:
            print(f'{sentence} (p:{sentence.polarity:.2f}, s:{sentence.subjectivity:.2f})')
            print()

In [None]:
# Plot in Plotly to easily use hover box

import plotly.graph_objects as go
import plotly.express as px

In [None]:
fig = px.scatter(df_sent,
                 x = 'polarity' ,
                 y = 'subjectivity',
                 hover_data = ['text']
                )
fig.show()

# Bonus Material for Summer 2021

Not necessary for final assignments

## Install `Textatastic` from Command or Terminal Prompt (optional)


* Not required for our assignments but good for practice and examples
* **Via pip:** `pip install textatistic`
* Hasn't been updated in years though mentioned in our book
* https://pypi.org/project/textatistic/

*Note: Windows users may need to run as administrator. Some students have reported needing to install VS Code to get Textatistic to work.*

* Professor has been unable to install on newer computer though older install still works. 

In [None]:
! pip install textatistic

## Possible Alternatives to `Textatistic`

* `textstat` https://pypi.org/project/textstat/
* `readability` https://pypi.org/project/readability/
* `py-readibility-metrics` https://github.com/cdimascio/py-readability-metrics
* Others on github https://github.com/topics/flesch-kincaid
* Create your own! https://www.geeksforgeeks.org/readability-index-pythonnlp/
* Background article: https://medium.com/analytics-vidhya/visualising-text-complexity-with-readability-formulas-c86474efc730

## Readability

* Flesch_score -- 0 to 100 -- 100 easiest to read
    * General target -- 60+
* Flesch Kincaid -- 0 to ??, approximates required grade level
    * General target -- 8
    * Technical writing -- 12-16
    * PhD level -- around 20-22
    * Doesn't mean how smart the reader is, rather, how hard they have to work at reading the piece

In [None]:
# magic to set precision
%precision 3

In [None]:
# JFK Moon Speech -- Text Only as a Local File
moon = Path('JFK_moon.txt').read_text()
readability = Textatistic(moon)
readability.dict()

## Text of a Letter from the President to the Speaker of the House of Representatives and the President of the Senate

Source: https://www.whitehouse.gov/briefings-statements/text-letter-president-speaker-house-representatives-president-senate-80/

In [None]:
belarus = Path('Letter_on_Belarus.txt').read_text()
readability = Textatistic(belarus)
print(f'Flesch Score:  {readability.flesch_score:.2f}')
print(f'Grade level: {readability.fleschkincaid_score:.2f}')
print()
print(belarus)


## Install WordCloud Module from Command or Terminal Prompt (not Jupyter Notebook)

**Via pip:**
* `pip install wordcloud`

**Via conda:**
*`conda install -c conda-forge wordcloud`

**Notes:**
* Requires `numpy` and `pillow` which should already be installed
* If one method doesn't work, try the other

https://pypi.org/project/wordcloud/

## WordCloud
* Be sure you've installed `WordCloud`
* `imageoio` should be installed with Anaconda ([docs](https://imageio.readthedocs.io/en/stable/installation.html))
* Info on Wordcloud https://pypi.org/project/wordcloud/


In [None]:
from wordcloud import WordCloud
import imageio
import numpy as np

text = spkr_text   # from previous work in this notebook, change to your source

mask_image = imageio.imread('mask_star.png')
cloud = WordCloud(width=600,height=600,background_color='white', 
                  mask=mask_image, max_words=50,
                 contour_width=3, contour_color='red')
cloud = cloud.generate(text)   # note this is a text object, not the dataframe
plt.imshow(cloud)