<a href="https://colab.research.google.com/github/Bennetbecker02/BUS280-data/blob/main/BUS280_TextAnalysis_Simplified.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Analysis and Natural Language Processing - Simplified

## Python Demonstration

![Python logo](https://www.python.org/static/img/python-logo@2x.png "Python logo")

## BUS 280 - Applied Analytics
## John Michl


Stripped down code with limited tutorial


## Contents of this notebook
* Install Packages
* Import Dependencies
* Various Methods to Import Text
* Process into DataFrame (no reporting)
* Create a second DataFrame
* Combine DataFrames
* Produce simple charts


# Install Packages (if needed)


Load packages for the processing you intend to do. May need to reload depending on length of time since last use of this notebook.

In [1]:
# Install beautifulsoup4    >>for processing html pages
! pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Install newspaper3k    >>for processing blog-like pages
! pip3 install newspaper3k

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 KB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting feedfinder2>=0.0.4
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tldextract>=2.0.1
  Downloading tldextract-3.4.0-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.9/93.9 KB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting cssselect>=0.9.2
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting feedparser>=5.2.1


In [3]:
# Install textatistic  >>form basic writing analysis
! pip install textatistic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textatistic
  Downloading textatistic-0.0.1.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyhyphen>=2.0.5
  Downloading PyHyphen-4.0.3.tar.gz (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 KB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: textatistic, pyhyphen
  Building wheel for textatistic (setup.py) ... [?25l[?25hdone
  Created wheel for textatistic: filename=textatistic-0.0.1-py3-none-any.whl size=29067 sha256=fb0ad772bd02074faac0dea9a51e0cb58d299158363b5c96edac434085cd7822
  Stored in directory: /root/.cache/pip/wheels/05/4c/0d/38aaa3756ce86b8ebc14b877f64ff81e441327e2bcfc293d61
  Building wheel for pyhyphen (setup.py) ... [?25l[?25hdone
  Created wheel for pyhyphen: filename=PyHyphen-4.0.3-cp37-abi3-linux_x8

In [4]:
# textblob    >> for sentitiment analysis
! pip install -U textblob
! python -m textblob.download_corpora lite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Finished.


# Import all dependences and modules

If something crashes, it probably hasn't been installed in this environment of Google Colob. Find the error to locate the package. Then look above. 

Not all libraries are required depending on the processing needed.

In [5]:
# web and file handling imports
import requests        # retreive web pages
from pathlib import Path    # for quick import of text file for NLP

# text cleanup and parsing
from bs4 import BeautifulSoup       # clean up text
from newspaper import Article       # parsing and cleaning when blog format

# natural language processing and readability
from textblob import TextBlob       # basic NLP,
from textatistic import Textatistic # readability
import nltk                         # natural language tool kit

# data management
import pandas as pd
import numpy as np                  # needed for pandas and wordcloud

# graphics related imports
import seaborn as sns
import matplotlib.pyplot as plt
from plotly import express as px
import plotly.graph_objects as go
from wordcloud import WordCloud    # create word clouds
import imageio                     # used for mask in world cloud


# Magics
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [6]:
# Data and corpora downloads needed for TextBlob

#! python -m textblob.download_corpora

# nltk additional helper files for some TextBlob attributes
nltk.download('punkt')                        # for 
nltk.download('brown')                        # for identifying nouns
nltk.download('averaged_perceptron_tagger')   # for tagging parts of speech

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

<a name="retrieve-text"></a> 
# Retrieve Text - Various Methods
This section has code for retrieving text from the web (or a file). The method used will depend on the web-site. We'll start with example 1 and then skip to the [TextBlob section](#textblob)

# Retrieve Text - Example 1: Plain Text File Stored on GitHub


Git Hub is a great place to store files used for analysis.
* Log in to your GitHub account
* Navigate to the folder that contains your data
* Find the specific file and open it in Git Hub
* Click the `raw` button to see the plain text
* Capture the URL 
* Use that URL for the remainder of your text processes

In [38]:

url = 'https://raw.githubusercontent.com/Bennetbecker02/BUS280-data/main/CNN'
subject = "CNN on Trump-Biden debate"


In [39]:
# get the file from the internet by passing the URL
result = requests.get(url)

# extract the text from the result object
text = result.text

In [40]:
# View object to determine if more processing is needed
print(text)


President Donald Trump turned his first debate with Democratic rival Joe Biden into a chaotic disaster.

Trump bullied, bulldozed and obfuscated his way through the 90-minute showdown, interrupting Biden and moderator Chris Wallace of Fox News at every turn. He ignored substantive questions and Biden’s policy arguments, and instead swung at a straw-man version of Biden, taking aim at both Biden’s son and a distorted description of his record that exists primarily in far-right media.

05 0929 debate SPLIT
Fact checking Biden and Trump at the first presidential debate
Over Trump’s interruptions, Biden responded by mocking the President, calling him a “clown,” a “racist” and “the worst president America has ever had.” He criticized Trump’s handling of the coronavirus pandemic, his failure to produce a health care plan and his response to protests over racial injustice.



Over and over, Wallace tried to regain control of the debate, without success.

When Trump complained that only he wa

In [26]:
# IF NECESSARY, clean with beautiful soup
soup = BeautifulSoup(response.content, 'html5lib')
text = soup.get_text(strip=True)  # text without tags
print(text)

NameError: ignored

# Retrieve Text - Example 2 -- Project Gutenberg


Some web-page **only** have plain text. These are sometimes known as UTF-8 pages or files because they are encoded in the UTF-8 format. Check out [Project Gutenberg](https://www.gutenberg.org) for some examples. Retrieve the URL for a book you'd like to analyze. 

In [None]:
## Tale of Two Cities
url = "https://www.gutenberg.org/files/98/98-0.txt"

In [None]:
# get the file from the internet by passing the URL
result = requests.get(url)

# extract the text from the result object
text = result.text

In [None]:
# Confirm that we have the text, then click x to remove the output

print(text)

In [None]:
# Find first line of book, end of book that should not be included
# Important that the text is unique
start_text = "It was the best of times, it was the worst of times, it was the age of"
end_text = "*** END OF THE PROJECT GUTENBERG EBOOK***"

start_pos = text.find(start_text)
end_pos = text.find(end_text)

print(text[int(start_pos):int(end_pos)])

In [None]:
# Replace original text with a trimmed down version

text = text[int(start_pos):int(end_pos)-1]

When finished with the retrieve step, click here to go to the [TextBlob Code section](#textblob-code)

# Retrieve Text - Example 3: Static, Predominately Text-based Web-Page

In [None]:
## Static web page - JFK speech re: moon
url = 'https://www.rice.edu/kennedy'
subject = "JFK Speech"     # Enter a title or subject that will be used later in graphs and output

response = requests.get(url)
response.content       # Notice moderate amount of HTML code

In [None]:
response.encoding = 'utf-8'
response.text

In [None]:
# use BeautifulSoup to clean up the response content

soup = BeautifulSoup(response.content, 'html5lib')
text = soup.get_text(strip=True)  # text without tags

In [None]:
# BeautifulSoup has done a decent job on this page removing HTML
print(text)

# Retrieve Text - Example 4: Blog format web-sites


Many sites use a content management system that treats each page like a blog post. We can extract that text and other information with the `Newspaper` module.

In [None]:
# Change the url to your own, comment out all urls but one, note may not show all of article if behind pay wall
url = 'https://hbr.org/2020/04/bringing-an-analytics-mindset-to-the-pandemic'
subject = "Pandemic Analytics"     # Enter a title or subject that will be used later in graphs and output

In [None]:
article = Article(url)
article.download()
article.parse()
text = article.text

In [None]:
print("Title: ", article.title)  
print("Authors: ", article.authors)  
print("Publication Date: ", article.publish_date)  
print("First Image:", article.top_image)  
print("Video Links:", article.movies)  

In [None]:
print("Title: ", article.title)
print()
print(text)

In [None]:
article.nlp()

print("KeyWords: ", article.keywords)   # creates a list of authors; no authors on this page
print()
print("Summary: ", article.summary)  # no publish date on this web-page

When finished with the retrieve step, click here to go to the [TextBlob Code section](#textblob-code)

TextBlob automatically performs several NLP (Natural Language Processing) methods. See the [quick start manual here](https://textblob.readthedocs.io/en/dev/quickstart.html).

(This section assumes you learned the properties of the Blob from previous notebooks.)

## Some Properties of the Blob

* `words`
* `sentences`
* parts-of-speech `tags`
* `noun_phrases`
* `sentiment`
    * `polarity`
    * `subjectivity`
    

## Parts_of_Speech (a.k.a. `tags`)

* `TextBlob` uses a `PatternTagger` to determine parts-of-speech
* Uses **pattern library** POS tagging
* Pattern's 63 parts-of-speech tags
* Samples patterns:
    * `NN`—a **singular noun** or **mass noun**
    * `NNS` - a **plural noun**
    * `NNP`—a **proper singular noun**
    * `VB` - a verb, base form
    * `VBZ`—a [**third person singular present verb**](https://www.grammar.cl/Present/Verbs_Third_Person.htm)
    * `DT`—a [**determiner**](https://en.wikipedia.org/wiki/Determiner) (the, an, that, this, my, their, etc.)
    * `JJ`—an **adjective**
    * `IN`— a **subordinating conjunction** or **preposition**
    * `PRP` - a **personal pronoun**
    * `CC` - a **coordinating conjunction**
    
For more information:
* https://github.com/clips/pattern
* https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/


# Readability with Textatistic

Textatistic provides various readability statistics for a chunk of text. It hasn't been updates in years but is still usefull.

For more information about Flesch and Flesch-Kincaid, see this article on [Wikipedia](https://en.wikipedia.org/wiki/Flesch–Kincaid_readability_tests).
* Flesch_score -- 0 to 100 -- 100 easiest to read
    * General target -- 60+
* Flesch Kincaid  -- 0 to ??, approximates required grade level
    * General target -- 8
    * Technical writing -- 12-16
    * PhD level -- around 20-22
    * Doesn't mean how smart the reader is, rather, how hard they have to work at reading the piece

In [None]:
# magic to set precision
%precision 3

# Check basic readability stats
readability = Textatistic(text)
print(subject)
readability.dict()

# WordCloud


* Be sure you've installed `WordCloud`
* `imageoio` should be installed with Colab ([docs](https://imageio.readthedocs.io/en/stable/installation.html))
* Info on Wordcloud https://pypi.org/project/wordcloud/

In [None]:
from wordcloud import WordCloud
import imageio
import numpy as np

cloud = WordCloud(width=600,height=600,background_color='white', 
                   max_words=50,
                 contour_width=3, contour_color='red')
cloud = cloud.generate(text)   # note this is a text object, not the dataframe
plt.imshow(cloud)


<a name="textblob-code"></a> 
# TextBlob - The Code

In [41]:
# Create a TextBlob object by passing the previously created text object

blob = TextBlob(text) 

## Review various attributes of the blob object


In [42]:
# words - this attribute returns a list of the unique words in the blob
blob.words

WordList(['President', 'Donald', 'Trump', 'turned', 'his', 'first', 'debate', 'with', 'Democratic', 'rival', 'Joe', 'Biden', 'into', 'a', 'chaotic', 'disaster', 'Trump', 'bullied', 'bulldozed', 'and', 'obfuscated', 'his', 'way', 'through', 'the', '90-minute', 'showdown', 'interrupting', 'Biden', 'and', 'moderator', 'Chris', 'Wallace', 'of', 'Fox', 'News', 'at', 'every', 'turn', 'He', 'ignored', 'substantive', 'questions', 'and', 'Biden', '’', 's', 'policy', 'arguments', 'and', 'instead', 'swung', 'at', 'a', 'straw-man', 'version', 'of', 'Biden', 'taking', 'aim', 'at', 'both', 'Biden', '’', 's', 'son', 'and', 'a', 'distorted', 'description', 'of', 'his', 'record', 'that', 'exists', 'primarily', 'in', 'far-right', 'media', '05', '0929', 'debate', 'SPLIT', 'Fact', 'checking', 'Biden', 'and', 'Trump', 'at', 'the', 'first', 'presidential', 'debate', 'Over', 'Trump', '’', 's', 'interruptions', 'Biden', 'responded', 'by', 'mocking', 'the', 'President', 'calling', 'him', 'a', '“', 'clown', '”'

In [43]:
# sentences - returns a list of sentences objects in the blob
blob.sentences

[Sentence("
 President Donald Trump turned his first debate with Democratic rival Joe Biden into a chaotic disaster."),
 Sentence("Trump bullied, bulldozed and obfuscated his way through the 90-minute showdown, interrupting Biden and moderator Chris Wallace of Fox News at every turn."),
 Sentence("He ignored substantive questions and Biden’s policy arguments, and instead swung at a straw-man version of Biden, taking aim at both Biden’s son and a distorted description of his record that exists primarily in far-right media."),
 Sentence("05 0929 debate SPLIT
 Fact checking Biden and Trump at the first presidential debate
 Over Trump’s interruptions, Biden responded by mocking the President, calling him a “clown,” a “racist” and “the worst president America has ever had.” He criticized Trump’s handling of the coronavirus pandemic, his failure to produce a health care plan and his response to protests over racial injustice."),
 Sentence("Over and over, Wallace tried to regain control of th

In [44]:
# noun_phrases - returns a WordList object containing a list of noun phrases, not perfect but automatic
blob.noun_phrases

WordList(['donald trump', 'joe biden', 'chaotic disaster', 'trump', '90-minute showdown', 'biden', 'chris wallace', 'fox', 'substantive questions', 'biden', '’ s policy arguments', 'straw-man version', 'biden', 'biden', '’ s son', 'far-right media', 'split fact', 'biden', 'trump', 'presidential debate', 'trump', '’ s interruptions', 'biden', '“ clown', '“ racist ”', 'america', 'had. ”', 'trump', '’ s', 'coronavirus pandemic', 'health care plan', 'racial injustice', 'wallace', 'regain control', 'trump', 'biden', '’ s answers', 'wallace', 'frankly', 'interrupting. ”', 'trump', 'biden', 'swing-state polls', 'white supremacists', 'multiple times', 'presidential debates', 'trump', 'doesn ’ t condemn', 'white supremacists', 'fox', 'news host', 'trump', 'white supremacists', 'source', 'cnn repeatedly', 'biden', 'trump', 'dog whistle', 'generate racist', 'racist division', 'vice president', 'race relations', 'trump', 'biden', 'destructive elements', 'summer ’ s protests', 'police killings', 'g

In [45]:
# tags - returns a list of tuples where [0] is the word and [1] is the parts of speech
# Tag Examples: VB verbs, NNP proper noun, NNS plural noun , PRP personal pronoun (see more above)
blob.tags

[('President', 'NNP'),
 ('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('turned', 'VBD'),
 ('his', 'PRP$'),
 ('first', 'JJ'),
 ('debate', 'NN'),
 ('with', 'IN'),
 ('Democratic', 'JJ'),
 ('rival', 'JJ'),
 ('Joe', 'NNP'),
 ('Biden', 'NNP'),
 ('into', 'IN'),
 ('a', 'DT'),
 ('chaotic', 'JJ'),
 ('disaster', 'NN'),
 ('Trump', 'NNP'),
 ('bullied', 'VBD'),
 ('bulldozed', 'VBN'),
 ('and', 'CC'),
 ('obfuscated', 'VBN'),
 ('his', 'PRP$'),
 ('way', 'NN'),
 ('through', 'IN'),
 ('the', 'DT'),
 ('90-minute', 'JJ'),
 ('showdown', 'NN'),
 ('interrupting', 'VBG'),
 ('Biden', 'NNP'),
 ('and', 'CC'),
 ('moderator', 'NN'),
 ('Chris', 'NNP'),
 ('Wallace', 'NNP'),
 ('of', 'IN'),
 ('Fox', 'NNP'),
 ('News', 'NNP'),
 ('at', 'IN'),
 ('every', 'DT'),
 ('turn', 'NN'),
 ('He', 'PRP'),
 ('ignored', 'VBD'),
 ('substantive', 'JJ'),
 ('questions', 'NNS'),
 ('and', 'CC'),
 ('Biden', 'NNP'),
 ('’', 'NNP'),
 ('s', 'NN'),
 ('policy', 'NN'),
 ('arguments', 'NNS'),
 ('and', 'CC'),
 ('instead', 'RB'),
 ('swung', 'NN'),
 ('at', 'IN'),

## Sentiment analysis in TextBlob


In [46]:
# sentiment - returns tuple of overall sentiment
blob.sentiment

Sentiment(polarity=-0.008762590187590189, subjectivity=0.4134220057720057)

In [47]:
# Access overall sentiment components by name
print('Polarity -1 +1,  Subjectivity 0 1')
print('=================================')
print(f'Polarity: \t{blob.polarity:.3f}')
print(f'Subjectivity:\t{blob.subjectivity:.3f}')

Polarity -1 +1,  Subjectivity 0 1
Polarity: 	-0.009
Subjectivity:	0.413


In [48]:
# Loop through all sentences to display sentiment info

for indx, sentence in enumerate(blob.sentences):
    print(f'{indx}:  {sentence}')    
    print(f'\tSentiment: {sentence.sentiment}')
    print(f'\tPolarity:\t{sentence.polarity:>6.2f}')
    print(f'\tSubjectivity:\t{sentence.subjectivity:>6.2f}')
    print('\n--------------------------------------------------------\n')


0:  
President Donald Trump turned his first debate with Democratic rival Joe Biden into a chaotic disaster.
	Sentiment: Sentiment(polarity=0.25, subjectivity=0.3333333333333333)
	Polarity:	  0.25
	Subjectivity:	  0.33

--------------------------------------------------------

1:  Trump bullied, bulldozed and obfuscated his way through the 90-minute showdown, interrupting Biden and moderator Chris Wallace of Fox News at every turn.
	Sentiment: Sentiment(polarity=0.0, subjectivity=0.0)
	Polarity:	  0.00
	Subjectivity:	  0.00

--------------------------------------------------------

2:  He ignored substantive questions and Biden’s policy arguments, and instead swung at a straw-man version of Biden, taking aim at both Biden’s son and a distorted description of his record that exists primarily in far-right media.
	Sentiment: Sentiment(polarity=0.4, subjectivity=0.5)
	Polarity:	  0.40
	Subjectivity:	  0.50

--------------------------------------------------------

3:  05 0929 debate SPLIT


# Create a **pandas** `Dataframe`
* This of a dataframe as a two dimensional array with rows and columns like a spreadsheet. We can populate a Dataframe with the sentences and sentiment data for easy analysis.
* Add a column called `group` with a word or short phrase used to describe this data source. This can be used in charts and reports if multiple sources are compared.


In [49]:
group = 'CNN'

In [50]:
# Create sentiment dataframe
pd.set_option('max_colwidth', 1000)    # set this to not truncate output

p = []   # create a temporary list of all polarity scores
s = []   # create a temporary list of all subjectivity scores
txt = []    # create a temporary list of all sentences
for sentence in blob.sentences:   # loop through all sentences, append to lists
    p.append(sentence.sentiment.polarity)
    s.append(sentence.sentiment.subjectivity)
    txt.append(str(sentence))

# create sentiment DataFrame
df_sent = pd.DataFrame(p,columns=['polarity'])
df_sent['subjectivity'] = s
df_sent['text'] = txt
df_sent['sentpos']= np.arange(len(df_sent))  #original position of sentence in text
df_sent['group'] = group

# Output the first 10 records (may include some junk depending on original file)
df_sent.head(10)

Unnamed: 0,polarity,subjectivity,text,sentpos,group
0,0.25,0.333333,\nPresident Donald Trump turned his first debate with Democratic rival Joe Biden into a chaotic disaster.,0,CNN
1,0.0,0.0,"Trump bullied, bulldozed and obfuscated his way through the 90-minute showdown, interrupting Biden and moderator Chris Wallace of Fox News at every turn.",1,CNN
2,0.4,0.5,"He ignored substantive questions and Biden’s policy arguments, and instead swung at a straw-man version of Biden, taking aim at both Biden’s son and a distorted description of his record that exists primarily in far-right media.",2,CNN
3,-0.355556,0.544444,"05 0929 debate SPLIT\nFact checking Biden and Trump at the first presidential debate\nOver Trump’s interruptions, Biden responded by mocking the President, calling him a “clown,” a “racist” and “the worst president America has ever had.” He criticized Trump’s handling of the coronavirus pandemic, his failure to produce a health care plan and his response to protests over racial injustice.",3,CNN
4,0.3,0.0,"Over and over, Wallace tried to regain control of the debate, without success.",4,CNN
5,0.002083,0.433333,"When Trump complained that only he was being chastised for talking over questions and Biden’s answers, Wallace shot back: “Frankly, you have been doing more interrupting.”\n\nTrump, who has trailed Biden in national and swing-state polls, made little effort to reach out to voters who do not currently support him.",5,CNN
6,0.0,0.166667,He could have further damaged his standing by refusing to condemn White supremacists after being asked to do so multiple times.,6,CNN
7,0.125,0.270833,Here are six takeaways from the first of three presidential debates:\n\nTrump doesn’t condemn White supremacists\n\nFox News host to Trump: Are you willing to condemn white supremacists?,7,CNN
8,0.1,0.4,"01:29 - Source: CNN\nRepeatedly and directly, Biden called Trump racist.",8,CNN
9,0.0,0.0,"“This is a President who has used everything as a dog whistle to try to generate racist hatred, racist division,” the former vice president said.",9,CNN


## Multiple Sources / Combining DataFrames
If you plan to compare several groups or sources of data, you'll need to process each source individually to its own DataFrame and then combine all DataFrames. Each should have a different value in the group series.

In [51]:
# Rename previous DataFrame
# Comment/uncomment as necessary

#df1 = df_sent
df2 = df_sent
#df3 = df_sent

# Now repeat process above to create additional DataFrame.
# Be sure to give it a new name.

Return to [Retrieve Text section](#retrieve-text) to import and process new text for another DataFrame.

In [52]:
# Combine DataFrames into one DataFrame
# replace original df_sent with this new combined DataFrame

# create a list of DataFrames to combine

df_list = [df1, df2]   # add/remove DataFrame names to list

df_sent=pd.concat(df_list)

# output DataFrame to CSV file for easy import in future
outfile = 'sentiment.csv'
df_sent.to_csv(outfile)

Previous cell combines multiple DataFrames into one and then creates a CSV file of the data. You could load this CSV file to GitHub, get the link to the RAW file, then import it in the future to reuse.

To import a CSV to Pandas, use code similar to:

`df_sent = pd.read_csv(filename)`

## Examine the extreme values and sentences
There are many ways to pull data from a Dataframe. For this step, we'll keep it simple. We'll sort by the polarity column then view the first five and last five by using the `head` and `tail` attribute of the Dataframe object. Five rows is the default. You can add a number between the ( ) to get more or less. Then we'll sort by subjectivity and print out those results.

In [None]:
# Show the five most negative
print("Most Negative Items")
df_sent.sort_values(by=['polarity']).head()

In [None]:
# Show the five most positive
print("Most Positive Items")
df_sent.sort_values(by=['polarity']).tail()

In [None]:
# Show the five most subjective
##-- remember subjectivity is on a 0 to 1 scale
print("Most Subjective Items")
df_sent.sort_values(by=['subjectivity']).tail()

In [None]:
# Show the five most objective (or least subjective )
##-- remember subjectivity is on a 0 to 1 scale
print("Most Objective Items")
df_sent.sort_values(by=['subjectivity']).head()

# Create Plots of the dataframe contents

In [58]:
# Single plot using sentence position as color

title = 'Analysis of sentence position of News Stations'

fig = px.scatter(df_sent,
                 x = 'polarity' ,
                 y = 'subjectivity',
                 hover_data = ['text'],
                 title = title,
                 width=900, height=600,
                 template = 'presentation',
                 color='sentpos'
                )
fig.update_xaxes(range=[-1.0, 1.0])
fig.update_yaxes(range=[0, 1])

fig.show()

In [57]:
# Scatter plot using group as color

title = 'Polarity and Subjectivity by News Stations'
## REVISE TITLE TO INCLUDE SOURCE OR SUBJECT

fig = px.scatter(df_sent,
                 x = 'polarity' ,
                 y = 'subjectivity',
                 hover_data = ['text'],
                 title = title,
                 width=900, height=600,
                 template = 'presentation',
                 color='group'
                )
fig.update_xaxes(range=[-1.0, 1.0])
fig.update_yaxes(range=[0, 1])

fig.show()

In [59]:
# Line plot using group as color

title = 'Polarity Over Time'

fig = px.line(df_sent,
                 x = 'sentpos' ,
                 y = 'polarity',
                 hover_data = ['text'],
                 title = title,
                 width=900, height=600,
                 template = 'presentation',
                 color='group'
                )
fig.update_yaxes(range=[-1.0, 1.0])

fig.show()

In [60]:
# Line plot using group as color
## REVISE TITLE TO INCLUDE SOURCE OR SUBJECT

title = 'Subjectivity Over Time'

fig = px.line(df_sent,
                 x = 'sentpos' ,
                 y = 'subjectivity',
                 hover_data = ['text'],
                 title = title,
                 width=900, height=600,
                 template = 'presentation',
                 color='group'
                )
fig.update_yaxes(range=[0, 1.0])

fig.show()