# Text Analysis in Python 2: Word Counts / Words Count

<h1 style="text-align:center;font-size:300%;">The United States is / are ____?</h1> 
  <img src="https://miro.medium.com/max/720/1*pp7HX01jBv2wbVRW9Ml_mA.png" style="width:%80;">
  <!--<img src="http://www.languagetrainers.com/blog/wp-content/uploads/2012/10/us-are-vs-us-is1.png" style="width:%140;">-->

From Benjamin Schmidt and Mitch Fraas, ["The Language of the State of the Union](https://www.theatlantic.com/politics/archive/2015/01/the-language-of-the-state-of-the-union/384575/)," *The Atlantic* (Jan. 15, 2015).  **Can we create our own version of this graph?**

## This Lesson

**Exploring the frequency of words and phrases in texts: what can they tell us about a text?**

In this session, participants will:
+ Apply Python (and the NLTK package) to read individual text files and apply essential pre-processing techniques (i.e. divide each text into a list of words or tokens, lower-case all words, remove punctuation, and lemmatize each word).
+ Create frequency lists identifying the most common words or ngrams (multi-word terms) in a text or corpus
+ Create graphs, charts, and word clouds visually representing word and term frequency patterns 
+ Identify some ways the language of State of the Union speeches has changed over time and discuss how this method could be applied to other texts and questions

**In short, one of our goals today is to recreate the graphic above (from *The Atlantic*) showing changes in the frequency of particular words or terms - as used in the State of the Union address - over time.** 


## Structure of Notebooks

These Jupyter Notebooks are designed to integrate instructions and explanations (in the white "markdown" cells below) with hands-on practice with the code (in the gray "code" cells below). To add, modify, or delete cells, please use the Menu above (especially under the Edit, Insert, and Cell tabs) or click ESC + H to see a list of keyboard shortcuts.

<div class="alert alert-success" role="alert"><h3 style="color:green">Code Together:</h3><p style="color:green">In these cell blocks, we will code together. You can find the completed version in our shared folder (ending with "_completed.ipynb").</p></div>

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercises:</h3><p style="color:blue">are in blue text. These are a chance to practice what you have learned.</p></div>

<div style = "background-color:#f3e5f5"><h3 style="color:purple">Python Basics - Additional Practice</h3><p style="color:purple">are in purple text. Work on these after the lesson if you would like more practice.</p></div>

## Lists of Words, Frequency Lists, N-Grams, and Dispersion Plots

*[intro / explanation / beg instructions / links back to prev notebooks]*

## Part I: Getting Started - Importing Python Packages, SOTU texts, and tokenizing

**1. First, we will need to import the necessary Python packages or libraries for today's lesson.**

In [None]:
import os, nltk, re, collections, pathlib, time
import pandas as pd
import matplotlib as plt
import seaborn as sns
from pathlib import Path
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk import ngrams, pos_tag, word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download("punkt")
from collections import Counter

plt.rcParams['figure.figsize'] = [12, 6]  #changes default figure size to make larger plots

%config InteractiveShellApp.matplotlib = 'inline'
#%config InlineBackend.figure_formats = ['svg']               #this command prints out images as svgs rather pngs. However, it also slows down the plotting
                                                              ## so uncomment if you want Jupyter to render clearer images

**2. Navigate to and examine the folder that has your SOTU files.** As in the previous lesson, we use the **pathlib** library and its functions to work with paths to files and folders (directories). For more on the challenges of working with filepaths across operating systems and how pathlib addresses see ["Python 3's pathlib module: Taming the File System."](https://realpython.com/python-pathlib/)

In [None]:
p = Path.cwd()
p2 = p.parent
sotudir = Path(p2,"strings-and-files","state-of-the-union-dataset","txt")  # creates a filepath to our dataset or corpus of texts
print(sotudir)
pathlist = sorted(sotudir.glob("*.txt"))    #glob("*.txt") retrieves only filepaths to .txt files; sotudir is the filepath we created above; 
                                            ## sorted() sorts the filepaths in ascending order
pathlist[:10]

**3. Open the George Washington's 1794 SOTU address:**

*Note*: In Python, it is recommended that you always close your files after finishing with them. One way to do this is to place an **open()** command within a **with statement**. This way, the files is closed as soon as we exit the indented block underneath the with statement. Another way is to immediately **close()** the file after extracting the information you need from it. Run either or both options below. See [Why Close Python Files](https://realpython.com/why-close-file-python/). 

In [None]:
with open(Path(sotudir,'1794_Washington.txt')) as f:
    wash94 = f.read()

In [None]:
f = open(Path(sotudir,'1794_Washington.txt'))
wash94 = f.read()
f.close()

<div class="alert alert-success" role="alert">
    <p style='color:green'><b>3b. Code Together:</b> Print the last (instead of the first) 400 characters of this address.</p>
</div>

In [None]:
#insert code here
wash94[0:400]

**4. Tokenize this address and count the number of tokens.**



We can then convert this SOTU text into a list of words or tokens.

There are many different ways we may want to create such a list, depending on our needs.

**4a. For example, we can just use NLTK's standard word tokenizer.**

In [None]:
#if we wanted all punctuation, including tokens we could just run the following code:
tokens = word_tokenize(wash94)  #this command uses the function word_tokenize() from the package ntlk (which we imported at the beginning of the lesson)
print("our tokens list contains",len(tokens),"tokens.")
print(tokens[:40])

**4b. Or we can remove punctuation by using NLTK's Regexptokenizer.**

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
tokens2=tokenizer.tokenize(wash94)
print("our tokens2 list - with punctuation removed - contains",len(tokens2),"tokens.")
print(tokens2[:40])


**4c. We can convert all the tokens to lower-case.**

In [None]:
ltokens2 = [tok.lower() for tok in tokens2]    #note we are lowercasing "tokens2" (our tokens list with punctuation removed) rather than tokens
print("our tokens2 list converted to lower-case (saved as ltokens2) - contains",len(ltokens2),"tokens.")
print(ltokens2[:40])

**4d. We often want to remove stopwords**. **Stop words** are common words that reveal little about the meaning of a text (such as articles like "the", conjunctions like "and", prepositions like "on", pronouns like "our", and helper verbs like "can"). Fortunately, NLTK provides a list of stop words in English (and other languages as well) that we can use to eliminate all such words from our texts.

Let's examine stopwords in English:

In [None]:
print(stopwords.words('english'))

<div class="alert alert-success" role="alert">
    <p style="color:green"><b>Code Together</b>: What if you work with another language? Let's print out the language options for NLTK's stopwords:</p>
</div>

In [None]:
# languages in nltk
print(stopwords.fileids())

<div class="alert alert-success" role="alert">
    <p style="color:green">Now try to print out stopwords from a language of your choice (using the same code we used above to print out English stopwords):</p>
</div>

In [None]:
# 



Next, with our English stopwrds list, we can further modify our ltokens2 list by removing stopwords:

In [None]:
stop=stopwords.words('english')
ltokens2ns=[tok for tok in ltokens2 if tok not in stop]        #list comprehension removes all stopwords from ltokens2 (see Part II below for an explanation of list comprehensions)

print("We had",len(ltokens2),"tokens in our ltoken2 list.")
print("beginning with:",ltokens2[:30]," \n")
print("After removing stop words, we now have",len(ltokens2ns),"tokens in our list.")
print("beginning with:",ltokens2ns[:30])

<div class="alert alert-info" role="alert">
    <h3 style = "color:blue">Exercises (Part I)</h3>
    <p style = "color:blue"><b>5. Tokenize and lower-case another SOTU address.</b><p>
    <p style = "color:blue">We counted the number of tokens within Washington's 1794 address. Choose another SOTU address and compute and print out the number of words contained within it.</p>
</div>

<div style = "background-color:#f3e5f5">
    <br/>
    <h3 style = "color:#7b1fa2">Part II. Python Basics: For Loops vs. List Comprehensions</h3>
    <p style = "color:#7b1fa2">In step 4c, we converted our tokens to lower-case and saved them to a new list called "ltokens2" using the list comprehension:</p></div>
        
        ltokens2 = [tok.lower() for tok in tokens2]
        
<div style = "background-color:#f3e5f5">
    <p style = "color:#7b1fa2">This task could also be accomplised with a <b>for loop</b>.</p>  
    <p style = "color:#7b1fa2">6. <b>FOR LOOPS to iterate through lists</b>: In the previous lesson we converted all tokens to lowercase using a simple <b>for loop</b>.</p>
    <br/>
</div>

In [None]:
ltokens2 = []                      #assigns an empty list to the variable ltokens2
for tok in tokens2:                #iterates through our tokens2 list we created above, assigning the variable name "tok" to each item in the list as it goes
    ltokens2.append(tok.lower())   #lowercases each item ("tok") from tokens2 list and adds it to ltokens2 list (tokens2 is not permanently changed) using append function

In [None]:
print("tokens2 list: \n",tokens2[:40],"\n\n")
print("ltokens2 list: \n",ltokens2[:40])

<div style = "background-color:#f3e5f5; color:purple">
    <br/>
    <p>7. <b>LIST COMPREHENSIONS</b>: We can also use <b>list comprehensions</b> to iterate through lists more efficiently and with fewer lines of code.</p> 
    <p>We can do the same thing using a <b>list comprehension</b>. The formula for list comprehensions is:</p>
        
        newList = [item (or modified item) for item in oldList]
    
</div>


<div style = "background-color:#f3e5f5; color:purple">
    <br/>
<p>You can also add a conditional:</p>
    
        newList = [item or (modified item) for item in oldList if item meets condition]
</div>

<div style = "background-color:#f3e5f5; color:purple">
    <br/>    
    <p>For more on list comprehensions see: <a href="https://www.w3schools.com/python/python_lists_comprehension.asp">w3schools</a> or <a href="https://realpython.com/list-comprehension-python/">realpython</a>.</p>
    <p style = "color:purple">To create a new list of tokens that have been converted to lower case, we can create the following list comprehension</p>
    <br/>
</div>

In [None]:
ltokens2 = [tok.lower() for tok in tokens2]              #in plain English, this looks at each item (which we assign the variable "tok" here) in our list of tokens and
                                                             ## lowercases it (with "tok.lower()") and then places this lower case version in a list (indicates by the "[]")
                                                             ## which we call ltokens2.  Note: the only variable previously defined here is tokens2
print("ltokens2 list created using a list comprehension: \n",ltokens2[:40])

<div style = "background-color:#f3e5f5">
    <p style = "color:purple"><b>7b. Run the code below. What did it do?</b></p> 
</div>

In [None]:
#here is an example of a list comprehension with a conditional. 

utokensT = [tok.upper() for tok in tokens2 if tok.startswith("T")]
print(utokensT)

## Part III. Creating Frequency Lists

**8. Counting Unique Items in a List:** As with most tasks in Python or other popular programming languages, there are multiple ways to count unique items in a list and create frequency items of them. First, we will use the **collections** package.

In [None]:
numsList = [3,2,2]                         #INSERT ADDITIONAL NUMBERS IN THIS LIST, include repeats       
collections.Counter(numsList)   #using the function "Counter" from the package "collections" - creates a 
                                                  ## frequency list from numslist


**9. Counting Unique Tokens:** Let's apply the same method to create a frequency list of our tokens from out list of lowercase tokens (ltokens2)

In [None]:
tokfreqs = collections.Counter(ltokens2)

**9b. To view the most or least common tokens in our list, run:**

In [None]:
print(tokfreqs.most_common(40))
print("\n")
print(tokfreqs.most_common()[:-40:-1]) #there is no function for returning the least common tokens
     #however, we can essentially do the same thing with the above code 
    # by taking the tokens found at the end of the ordered list of most common tokens (the extra ":" before -60
    # returns the end of this list in reverse)

**9c. FREQUENCY LIST WITH STOP WORDS REMOVED**: The list of most common words above does not seem to be very revealing. Let's try to remove stopwords to see how that changes the results.

In [None]:
n=40
tokfreqs_ns=collections.Counter(ltokens2ns)
print("\nNow our most common",n,"tokens (after removing stop words) are:\n",tokfreqs_ns.most_common(n))

<div class="alert alert-info" role="alert"><h3 style='color:blue'>Exercises (Parts I - III)</h3></div>

<div class="alert alert-info" role="alert"><p style='color:blue'>10. Open and read in a different SOTU address of your choice. Tokenize it using the tokenizer we used to remove all punctuation.</p></div>

<div class="alert alert-info" role="alert"><p style='color:blue'>11. Now convert all remaining tokens into lower case and remove all stopwords. How many tokens are now found in your token list?</p></div>

<div class="alert alert-info" role="alert"><p style='color:blue'>12. Create a frequency list of the top 30 words in your new list of tokens (lowercase with all punctuation tokens and stopwords removed). Compare the frequency list to the list we created for Washington's 1794 address. In what ways do they appear most different? Similar?</p></div>

## IV. Create a Dataframe of SOTU texts

**14. DATAFRAMES:** To enable easier analysis of the SOTU texts, we can store info about each in a **dataframe**. A dataframe in Python is a common data structure enabled with the **pandas** library. It is a two-dimensional data table that stores data in rows and columns. Run the code below, and then examine what each portion of the code does.

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
#n=50

txtList=[]
pathlist = sorted(sotudir.glob('*.txt'))       # .glob only stores the pathlist temporarily (for some reason), so you need to call it again!2
for path in pathlist:
    fn=path.stem                       #stem returns the filename minus the ".txt" (file extension). 
    year,pres=fn.split("_")            # fn = "1794_Washington" becomes year = "1794" and pres = "Washington"
    with open(path,'r') as f:  
        sotu = f.read()                #opens each file and reads it in as "sotu"
    tokens=tokenizer.tokenize(sotu)    # tokenizes "sotu"
    numtoks = len(tokens)             # counts the number of tokens in "sotu"
    txtList.append([pres,year,numtoks,tokens,sotu])   #add this info for "sotu" to a running list for all sotu addresses

colnames=['pres','year','numtoks','tokens','fulltext']
sotudf=pd.DataFrame(txtList,columns=colnames)  #places our completed list of SOTU info in a dataframe
sotudf['ltoks'] = sotudf['tokens'].apply(lambda x: [tok.lower() for tok in x]) #creates a new column to store lower case tokens
sotudf.head(10)      #prints out the first 10 rows of this dataframe (the default value for head() is 5 rows)

**15. SAVING DATAFRAMES:** It would be useful to reuse this dataframe in the future. Let's save it to a csv file. 

In [None]:
sotudf.to_csv("sotudf.tsv",encoding = "utf-8",sep="\t")
#we are saving this with a "tsv" extension to indicate we are using tabs ("\t") as our delimiter between columns, not commas
#csv = comma separated values; tsv = tab separated values
#within JupyterHub you should see this new file appear on the folder directory to the left. 
#you are welcome to download it (right click--> Download on PCS; Ctrl-Click --> Download on Macs)
# you can then open it in Excel by 1. opening a new, blank Excel workbook; 2. Go to the Data tab --> GetData/From Text/CSV
# 3. navigate to the folder you downloaded the tsv file to; 4. change the option at the bottom right to view "All Files" 
# 5. select and open the .tsv file 6. in the dialogue box make sure the delimiter is set to "Tab" (and also recommended but not required to set File Origin to "Unicode (UTF-8)" 
# 6. Select Load

## V. Searching for Specific Words

In this section, we will create code that searches for specific words across the entire SOTU corpus.

**16. QUICK VIEW OF WORDS IN CONTEXT:** Before searching, however, we can quickly use the **concordance** function from the **nltk** library to view words in context.

In [None]:
text1 = nltk.Text(tokens) #to use many of nltk's functions we need to convert our tokens list into a nltk.Text object
    #remember, the tokens variable stores our first set of tokens derived from Washington's 1794 address
    ## for concordances, it makes sense to work with unmodified tokens (in original case with all words and punctuation
    ## still included)
text1.concordance("government")



17. Like a lot of Python functions, NLTK's **concordance**() functions allows us to pass in additional parameters. In this case, we will expand the context of our search terms to 200 characters.

In [None]:
text1.concordance("government",200)

<div style = "background-color:#f3e5f5">
<h3 style = "color:purple">Python Basics: Writing Functions</h3>

<p style="color:purple"><b>18. FUNCTIONS:</b> Now, it would be helpful if we placed the above code into a small program or **function** so that we can easily search for other terms and plot their frequency.</p>

<p style="color:purple">We have already used a variety of core Python functions such as <b>sum()</b>, <b>len()</b>, and <b>print()</b>. We have also called on many functions defined in auxiliary Python libraries or packages: such as the <b>word_tokenize()</b> and <b>concordance</b> functions from the <b>nltk</b> library we imported.</p>

<p style="color:purple">Here, however, we will create our own function. The typical format of a Python function is:</p>

```python
def functionName(argumentsToPassIn):
    function instructions
    return(resultsOfFunction)
```
    
<p style="color:purple"><b>19. A SIMPLE FUNCTION:</b>So, for example, if we had a list of names and we wanted to create a function to retrieve the initial of each, we could use the following function:</p>
</div>

In [None]:
def Initials(fullname):
    caps = re.findall('([A-Z])', fullname) #this use sthe findall function from the re package to find all capitalized letters
    inits = ''.join(caps)  #takes our list of capitalized letters stored in "caps" and concatenates it
    return(inits)
    
fullname = "Jeremy M. Mikecz"     #replace w/ your name
Initials(fullname)

<div style = "background-color:#f3e5f5"><p style="color:purple">20. We can now apply this function to quickly return the initials from a long list of names.</p></div>

In [None]:
actorlist = ['Christoph Waltz','Tom Hardy','Doug Walker','Daryl Sabara','J.K. Simmons','Brad Garrett','Chris Hemsworth','Alan Rickman','Henry Cavill','Kevin Spacey','Giancarlo Giannini','Johnny Depp','Johnny Depp','Henry Cavill','Peter Dinklage','Chris Hemsworth','Johnny Depp','Will Smith','Aidan Turner','Emma Stone','Mark Addy','Aidan Turner','Christopher Lee','Naomi Watts','Leonardo DiCaprio','Robert Downey Jr.','Liam Neeson','Bryce Dallas Howard','Albert Finney','J.K. Simmons','Robert Downey Jr.','Johnny Depp','Hugh Jackman','Steve Buscemi','Glenn Morshower','Bingbing Li','Tim Holmes','Emma Stone','Jeff Bridges','Joe Mantegna','Ryan Reynolds','Tom Hanks','Christian Bale','Jason Statham','Peter Capaldi','Jennifer Lawrence','Benedict Cumberbatch','Eddie Marsan','Leonardo DiCaprio','Jake Gyllenhaal','Charlie Hunnam','Glenn Morshower','Harrison Ford','A.J. Buckley','Kelly Macdonald','Sofia Boutella','John Ratzenberger','Tzi Ma','Oliver Platt','Robin Wright','Channing Tatum','Christoph Waltz','Jim Broadbent','Jennifer Lawrence','Christian Bale','John Ratzenberger','Amy Poehler','Robert Downey Jr.','ChloÃ« Grace Moretz','Will Smith','Jet Li','Will Smith','Jimmy Bennett','Tom Cruise','Jeanne Tripplehorn','Joseph Gordon-Levitt','Amy Poehler','Scarlett Johansson','Robert Downey Jr.','Chris Hemsworth','Angelina Jolie Pitt','Gary Oldman','Tamsin Egerton','Keanu Reeves','Scarlett Johansson','Jon Hamm','Judy Greer','Damon Wayans Jr.','Jack McBrayer','Tom Hanks','Vivica A. Fox','Gerard Butler','Nick Stahl','Bradley Cooper','Matthew McConaughey','Leonardo DiCaprio','Mark Chinnery','Aidan Turner','Paul Walker','Brad Pitt','Jennifer Lawrence','Jennifer Lawrence','Nicolas Cage','Jimmy Bennett','Johnny Depp','Justin Timberlake','Dominic Cooper','J.K. Simmons','Bruce Spence','Jennifer Garner','Zack Ward','Anthony Hopkins','Robert Pattinson','Robert Pattinson','Will Smith','Will Smith','Johnny Depp','Janeane Garofalo','Christian Bale','Bernie Mac','Robin Williams','Hugh Jackman','Essie Davis','Josh Gad','Steve Bastoni','Chris Hemsworth','Tom Hardy','Tom Hanks','Chris Hemsworth','ChloÃ« Grace Moretz','Kelli Garner','Liam Neeson','Johnny Depp','Tom Cruise','Anthony Hopkins','Christoph Waltz','Matthew Broderick','Angelina Jolie Pitt','Seychelle Gabriel','Philip Seymour Hoffman','Channing Tatum','Elisabeth Harnois','Hugh Jackman','Hugh Jackman','Ty Burrell','Brad Pitt','Jada Pinkett Smith','Toby Stephens','Ed Begley Jr.','Bruce Willis','Will Smith','Robin Wright','J.K. Simmons','Tom Cruise','Hugh Jackman','John Michael Higgins','Tom Cruise','Christian Bale','Chris Hemsworth','J.K. Simmons','Gerard Butler','Gerard Butler','Sam Shepard','Matt Frewer','Jet Li','Kevin Rankin','Channing Tatum','Matthew McConaughey','Steve Buscemi','Chris Evans','Colin Salmon','James DArcy','Robert Pattinson','Robin Williams','Ty Burrell','Don Johnson','Mark Rylance','Leonardo DiCaprio','Ryan Reynolds','Johnny Depp','Benedict Cumberbatch','Matt Damon','Angelina Jolie Pitt','Judy Greer','Jennifer Lawrence','Robert Pattinson','Jim Parsons','Tom Cruise','Will Smith','Salma Hayek','Angelina Jolie Pitt','Anthony Hopkins','Toby Jones','Daniel Radcliffe','Essie Davis','Will Smith','Alfre Woodard','Rupert Grint','Robin Williams','J.K. Simmons','Daniel Radcliffe','Ryan Reynolds','Mark Chinnery','Johnny Depp','Rupert Grint','Jennifer Lawrence','Tom Hanks','Miguel Ferrer','Hugh Jackman','Paul Walker','Robert Downey Jr.','Liam Neeson','Ronny Cox','Tony Curran','Jeremy Renner','Michael Gough','Clint Howard','Jake Gyllenhaal','Tom Cruise','Karen Allen','Chris Evans','Suraj Sharma','Nicolas Cage','Matt Damon','Demi Moore','Michael Fassbender','Nathan Lane','Matt Damon','Vin Diesel','Gary Oldman','Scott Porter','Shelley Conn','Tom Cruise','Morgan Freeman','Natalie Portman','Natalie Portman','Steve Buscemi','Hugh Jackman','Natalie Portman','Ryan Reynolds','Alain Delon','Nicolas Cage','Chris Hemsworth','Noel Fisher','Phaldut Sharma','Jamie RenÃ©e Smith','Stephen Amell','Tim Blake Nelson','Robin Williams','Dwayne Johnson','Vincent Schiavelli','Heath Ledger','Brad Pitt','Brad Pitt','Kate Winslet','Leonardo DiCaprio','James Corden','Christoph Waltz','George Peppard','Eva Green','Mahadeo Shivraj','Steve Buscemi','Naomi Watts','Hugh Jackman','Jacob Tremblay','Jason Patric','Harrison Ford','Bruce Willis','Christopher Lee','Jim Broadbent','Will Smith','Sean Hayes','Will Smith','Liam Neeson','Chazz Palminteri','Oprah Winfrey','Matt Damon','Mathew Buck','Scarlett Johansson','Del Zamora','Nicolas Cage','Djimon Hounsou','Tom Cruise','Daniel Radcliffe','Eva Green','Cary-Hiroyuki Tagawa','Joe Morton','Johnny Depp','Denzel Washington','Jamie Lee Curtis','Denzel Washington','Robert De Niro','Dwayne Johnson','Vanessa Williams','Leonardo DiCaprio','Demi Moore','Eartha Kitt','Jason Statham','Nicolas Cage','Djimon Hounsou','Catherine OHara','Hugh Jackman','Josh Hutcherson','Johnny Depp','CCH Pounder','Leonardo DiCaprio','Leonardo DiCaprio','Michael Gough','Jake Busey','Tom Hanks','Abbie Cornish','Frances Conroy','Dwayne Johnson','Joseph Gordon-Levitt','Will Ferrell','Jason Statham','Ray Winstone','Jamie Kennedy','Chris Hemsworth','Rosario Dawson','Matt Damon','Francesca Capaldi','Ben Gazzara','Dwayne Johnson','Leonardo DiCaprio','Christian Bale','Jeff Bridges','Jon Lovitz','Ioan Gruffudd','Will Ferrell','Milla Jovovich','Chris Noth','Frank Welker','Peter Dinklage','Hayley Atwell','Michael Imperioli','Alexander Gould','Orlando Bloom','Christopher Lee','Jeff Bridges','Angelina Jolie Pitt','Johnny Depp','Michael Jeter','James Franco','Martin Short','Bruce Willis','Dennis Quaid','Holly Hunter','Christopher Masterson','Logan Lerman','Will Smith','Tom Hanks','Denzel Washington','Mei MelanÃ§on','Harrison Ford','Will Forte','Denis Leary','Adam Scott','Bill Murray','Leonardo DiCaprio','Ming-Na Wen','Robert Downey Jr.','Robin Wright','Bruce Willis','Robert Downey Jr.','Morgan Freeman','Leonard Nimoy','Bella Thorne','Tom Cruise','Adam Sandler','Peter Dinklage','Haley Joel Osment','Marsha Thomason','Matthew McConaughey','Greg Grunberg','Curtiss Cook','Logan Lerman','Gerard Butler','Daniel Radcliffe','Alun Armstrong','Brad Pitt','Don Cheadle','Anne Hathaway','Robin Williams','Don Cheadle','Harrison Ford','Liam Neeson','Tim Blake Nelson','William Smith','Paddy Considine','Shirley Henderson','Jeff Bridges','Philip Seymour Hoffman','Paul Walker','Tom Hanks','Robin Williams','Matt Damon','Harrison Ford','Brad Pitt','Milla Jovovich','Steve Buscemi','Jeff Bennett','Caroline Dhavernas','Denzel Washington','Ioan Gruffudd','Matthew Broderick','Kate Winslet','Will Smith','Meryl Streep','Al Pacino','Jon Favreau','Kate Winslet','Bob Hoskins','Dwayne Johnson','F. Murray Abraham','Li Gong','Amber Stevens West','Jim Broadbent','Anthony Hopkins','Raymond Cruz','Roy Scheider','Julia Roberts','Anna Kendrick','Glenn Morshower','Larry Miller','Sarah Michelle Gellar','Wood Harris','Adam Sandler','Ted Danson','Jack McBrayer','Kristen Stewart','Seth MacFarlane','Robert Downey Jr.','Robert Duvall','Morgan Freeman','Jason Statham','Tom Cruise','Jennifer Lawrence','Bradley Cooper','Michael Gough','Bruce Willis','Tia Carrere','Steve Buscemi','Morgan Freeman','Bruce Willis','Adam Sandler','Amy Poehler','Steve Buscemi','Bill Murray','Keanu Reeves','Leonardo DiCaprio','Jon Favreau','Jim Broadbent','Nicolas Cage','Adam Sandler','Tom Hanks','Adam Sandler','Elden Henson','Steve Buscemi','Rosario Dawson','Philip Seymour Hoffman','Denzel Washington','Robin Williams','Liam Neeson','Bill Murray','Roger Rees','Keanu Reeves','Julia Roberts','Brad Pitt','Harrison Ford','Justin Timberlake','Matt Damon','Rosario Dawson','Gary Oldman','Denzel Washington','Vanessa Redgrave','Steve Buscemi','Elizabeth Montgomery','Quincy Jones','Mark Addy','Charlize Theron','Hugh Jackman','Michael Emerson','Robin Williams','Adam Sandler','Matt Damon','Natalie Portman','Nissim Renard','Anthony Hopkins','Bruce Willis','Bruce Greenwood','Sylvester Stallone','Charlie Rowe','Richard Tyson','Brendan Fraser','Fergie','Paul Walker','Olivia Williams','Adam Goldberg','Vin Diesel','Bob Neill','Mia Farrow','Pedro ArmendÃ¡riz Jr.','David Oyelowo','Sasha Roiz','Sariann Monaco','Adam Goldberg','Matthew Broderick','Josh Hutcherson','Will Forte','Philip Seymour Hoffman','J.K. Simmons','Al Pacino','Paul Walker','Jeff Bridges','Roger Rees','Robert De Niro','Steve Coogan','Jason Flemyng','Steve Carell','Will Smith','Ariana Richards','Jada Pinkett Smith','Charlie Hunnam','Hugh Jackman','Angelina Jolie Pitt','Nicolas Cage','Denis Leary','Adam Sandler','Jerry Stiller','James DArcy','Matthew Broderick','Morgan Freeman','Steve Buscemi','Tom Hanks','Harold Perrineau','Don Cheadle','Nicholas Lea','Philip Seymour Hoffman','Robert De Niro','Loretta Devine','Adam Arkin','Dwayne Johnson','Ayelet Zurer','Bruce Willis','Tom Selleck','Henry Cavill','Adam Sandler','Steve Buscemi','Bruce Willis','Julia Ormond','Bai Ling','Henry Cavill','Jimmy Bennett','Matt Damon','Harrison Ford','Connie Nielsen','Christopher Meloni','Brendan Fraser','Dennis Quaid','Robin Wright','Steve Carell','Jon Hamm','Nicolas Cage','Peter Coyote','Peter Dinklage','Matthew McConaughey','Adam Sandler','Jennifer Garner','Will Ferrell','Raven-SymonÃ©','Mhairi Calvey','Jake Gyllenhaal','Albert Brooks','Martin Landau','Sylvester Stallone','David Gant','Bryce Dallas Howard','Oliver Platt','Rory Culkin','Rupert Everett','John Ratzenberger','Julia Roberts','Vin Diesel','Tim Conway','Lili Taylor','Michael Fassbender','Robin Williams','Dwayne Johnson','Bruce Willis','Jeremy Renner','Nicole Beharie','Tom Cruise','Bryce Dallas Howard','Sanaa Lathan','Amy Poehler','Jon Hamm']

In [None]:
ctr=0
for actor in actorlist:
    if ctr<20:                               #we add this conditional so that it only prints out the first 20 examples to save space below
        print(Initials(actor),"=",actor)
        ctr+=1

##without the if command: 
#for actor in actorlist:
#    print(Initials(actor),"=",actor)

Now, we are going to search across the entire SOTU corpus for a particular search term, count its frequency in each text using a function we create, and then store that count in a dataframe.

21. First, we will apply a function to search for a term within a specific tokenized list. 

In [None]:
def getWordFreq (term,ltoks):
    #ltoks = [tok.lower() for tok in toks]
    tokfreqs=collections.Counter(ltoks)
    wordFreq = tokfreqs[term]
    return(wordFreq)

21b. Let's test that function on one text: with the words stored in our ltokens2 list

In [None]:
searchTerm = "government"
#to apply to one text
print(getWordFreq(searchTerm,ltokens2))

21c. We can then apply this function to the entire SOTU corpus using the tokens list we stored in our dataframe. First, let's review our dataframe:

In [None]:
sotudf.head()

To create a new column in a dataframe, we simply start a line of code with:

```
dfname['newColName'] = [insert instructions for deriving values for column's cells]
```

The following code creates a new column ("wordFreq") which is calculated by applying (.apply() function)) our function "getWordFreq" 
on each value ("x") of the column "ltoks". The getWordFreq reads in not only the value of the ltoks cell (x) but also our searchTerm
(which we set as "government" above)

In [None]:
sotudf['wordFreq'] = sotudf['ltoks'].apply(lambda x: getWordFreq(searchTerm,x))
sotudf.head()

Below we will create a new column to calculate the frequency per million words of our searchTerm.

So, for example, if it appears 10 times in a 10,000 word address, it will have a freq_perMillion score of 1000.

In [None]:
sotudf['freq_perMillion'] = sotudf['wordFreq']/sotudf['numtoks']*1000000
sotudf.head()

In [None]:
#this code just temporarily sorts our dataset by the freq_perMillion column (in descending order)
#if we wanted to make this sorting permanent, we would have to add "sotudf = " before the line of code below
sotudf.sort_values('freq_perMillion',ascending=False).head()

Now, we will create a simple barplot using the Seaborn package/library for which we assigned the initials "sns" when we imported it at the beginning of this lesson.

Notice how simple this code is. We just identify the dataframe we are drawing data from, and the names of the columns for the x and y data.

In [None]:
sns.barplot(data=sotudf, x="year", y="wordFreq")

The previous plot looks similar to the plot we made in the Strings and Files lesson which compared the length of each SOTU address. So, it seems not to be revealing the pattern we want it to reveal.

It is always a good idea in data science to consider the denominator. When should you use absolute values? When should you use percentages or proportions? And, for the latter, what is the correct denominator to choose? In this case, we will use our "freq_perMillion" column which is calculated using the total word (or token) count as our denominator.

In [None]:
sns.barplot(data=sotudf, x="year", y="freq_perMillion")

## Part VI: Group and Plot Data by President

22. It would help to simplify this visualization. Following *The Atlantic*'s graphic, we can aggregate these results by President. Normally, we would just apply the **groupby** function to group this data by the president's name.

Unfortunately, when I first tried this, I realized that presidents with the same last name were being grouped together (think of the Adamses, Roosevelts, and Bushes). So, first we need to identify each unique president. To do this, we can use the **shift()** function to identify each time a new president's name appears in our chronologically-ordered dataframe. Thus, George H.W. Bush's administration (1989-1993) can be distinguished from his son's (2001-2009). *This works because, fortunately for our purposes, there has always been a gap between two presidents sharing the same name. If <s>Hunter</s>Ashley Biden is our next President then I will need to add first names to our dataset.*

In [None]:
sotudf["presnum"] = (sotudf["pres"] != sotudf["pres"].shift()).cumsum()
sotudf.head(20)

To explain the code above (sotudf["pres"] != sotudf['pres'].shift()) returns True only when the president in the previous row is different than that in the current row. The function .cumsum() then adds 1 to its running count each time the president changes (indicated by the True value returned by the previous section of code)

You can see the result in the 'presnum' column.

22b. Now, we are going to create a new dataframe ("sotudf2") by grouping together the data in "sotudf" by the "presnum" column (thus aggregating the data by President rather than just by year). The .agg() function then establishes which other columns we want to keep and how the data in those columns will be aggregated ("sum", "mean", "first" being common options). We will then have to re-calculate our proportional variable ("freq_PerMillion").

In [None]:
sotudf2 = sotudf.groupby(['presnum']).agg({'pres':'first','wordFreq':'sum','numtoks':'sum','year':'first'})
sotudf2['freq_perMillion'] = sotudf2['wordFreq'] / sotudf2['numtoks'] * 1000000
sotudf2.head(10)


23. Okay, let's see what our graphic looks like:

Note: we have cleaned up the axis labels and added a title using the .set command in Seaborn

In [None]:
g=sns.barplot(data=sotudf2, x="year", y="freq_perMillion")
g.tick_params(labelrotation=90)
g.set(title = "Frequency of '%s' in State of the Union Addresses"%searchTerm)
g.set(ylabel='per million words', xlabel='President')
g.set(xticklabels = sotudf2.pres); #adding the ";" removes the annoying text that Python sometimes prints out with a graphic

24. Below, I copied the code from **Step 14 - Step 23**, but this time placed into three functions. *Note: all code within a function must be indented. Once the indentation ends, so does the function.* 

The **sotuWordSearch2** function reads in a column filled with lower-case tokenized lists (assigned "ltoks" here) and a search term and then creates a new column calculating the number of appearance of that search term in each ltoks list.

**df_wordFreqCalc** reads in an entire dataframe (but only if it has the five required columns) and then sends one of these columns, "ltoks" to the function sotuWordSearch2 to caclulate the frequency of the search term, which is then applied to the column "wordFreq". Then it aggregates the entire dataframe by "presnum", calculates a frequency per million words, and sorts the dataframe by year.

**createWordFreqPlot** reads in a dataframe (with 3 required columns) and a searchTerm, and creates a bar plot of the frequencies of that search term.

In [None]:
def sotuWordSearch2(ltoksCol,searchTerm): #returns a column of frequencies after searching for a term across a column of lower-case tokens
    #searchTerm = searchTerm.lower()
    wordFreq = ltoksCol.apply(lambda x:collections.Counter(x)[searchTerm])
    return(wordFreq)

def df_wordFreqCalc(old_df,searchTerm):   #reads in a dataframe of SOTU addresses by year and a searchTerm 
                                          ## returns a dataframe aggregated by President, with the 'wordFreq' and 'freq_perMillion' calculated for each president
                                          ## which is calculated using the sotuWordSearch2 function
    requiredCols = ['ltoks','numtoks','pres','presnum','year']
    if not set(requiredCols).issubset(old_df.columns):
        print("missing required column from:",requiredCols)
        return None
    old_df['wordFreq'] = sotuWordSearch2(old_df['ltoks'],searchTerm) 
    new_df = old_df.groupby(['pres','presnum']).agg({'wordFreq':'sum','numtoks':'sum','year':'first'})
    print(new_df.head(2))
    new_df['freq_perMillion'] = new_df['wordFreq'] / new_df['numtoks'] * 1000000
    new_df = new_df.sort_values(['year'])
    new_df = new_df.reset_index()
    print("searching for... :",searchTerm)
    return(new_df)
 
def createWordFreqPlot(df,searchTerm):            #reads in our aggregated SOTU dataframe and creates a bar plot of the search term
    #newdf = df_wordFreqCalc(df,searchTerm)
    requiredCols = ['freq_perMillion','pres','presnum']
    if not set(requiredCols).issubset(df.columns):
        print("missing required column from:",requiredCols)
        return None
    g=sns.barplot(data=df, x="presnum",y="freq_perMillion")
    g.tick_params(labelrotation=90)
    g.set(title = "Frequency of '%s' in State of the Union Addresses"%searchTerm)
    g.set(ylabel='per million words', xlabel='President')
    g.set(xticklabels = df.pres)
    return(g)
    

25. Now we can use the functions above to quickly choose a new search term, return a dataframe with the results of this search, and then create a plot.

In [None]:
searchTerm = "freedom"
sotudf3 = df_wordFreqCalc(sotudf,searchTerm)
print(sotudf3.head(15))

In [None]:
createWordFreqPlot(sotudf3,searchTerm)

26. What additions or changes would make this plot more useful, informative, or eye-catching?

<div class="alert alert-info" role="alert"><h3 style='color:blue'>Exercise (Part VI)</h3>
    <p>27. Explore the dataset by creating bar plots showing the frequency of other words of your choosing. (Hint: we made this really simple with the functions created in Step 24. Just re-use and modify the code we used in Step 25 to call these functions).

</div>

## Part VII: NGrams

28. Often, in searching for patterns in texts, single words are not the most useful units of analysis. Instead, at times, the frequency of multiple-word terms may be more instructive. For example, historians mining historical scholarship may be interested in the rise and fall of sub-disciplines in his/her field, such as "social history", "cultural history", "environmental history", etc. In this case, the historian may want to examine patterns in two-word combinations, called **bigrams**. In other cases, it may be useful to examine three-, four-, or five-word combinations, called **ngrams**.

First, we will extract ngrams from a simple phrase:



In [None]:
text = "to be or not to be"
toks = nltk.word_tokenize(text) 
print(toks)

In [None]:
n = 2
n_grams=list(ngrams(toks,n))
print(n_grams)


28b. We can create a frequency list of these bigrams using:

In [None]:
collections.Counter(n_grams)

29. Now let us create a list of ngrams from a SOTU address. We can retrieve a list of tokens from our sotudf dataframe. To do that, first we will need some practice filtering a dataframe. Observe what the following do and see [this link for more](https://www.geeksforgeeks.org/difference-between-loc-and-iloc-in-pandas-dataframe/):

In [None]:
sotudf.loc[sotudf['pres'] == "Adams"]

In [None]:
print()
sotudf.loc[(sotudf['pres'] == "Adams") & (sotudf['year'] > "1820")]

In [None]:
sotudf.dtypes  #here we can print out the data types of each column. 
                ##Notice, the "year" column is not considered a number (either an integer or a float). 
                ## Hence, in the code above and below we placed the year we are searching for in quotes.
                ## Meanwhile, "numtoks" is identified as an integer so we do not place our desired number in quotes (see 2 code cells below)

In [None]:
sotudf.loc[sotudf['year'] == "1849"]

In [None]:
sotudf.loc[(sotudf['numtoks'] > 25000)]

In [None]:
sotudf.iloc[0:5]

In [None]:
# pres = "Eisenhower"   ##not necessary to define since only one president gives an address in a given year!
year = "1956"
#the line below extracts the tokens for just one SOTU address (specified with the pres and year variables)
# toks = sotudf.loc[(sotudf['pres'] == pres) & (sotudf['year'] == year)]['tokens'].values[0]
toks = sotudf.loc[sotudf['year'] == year]['tokens'].values[0]
print(toks[:50])

29b. Now, from the list of toks we extracted from the 1956 address, we are going to create a new list of lower-cased tokens, convert both token lists into ngrams (of length 3), create a frequency list of each list, and then output the most common 10 ngrams in each. Notice, the slight difference in the lower-case and regular-case results. Which method is preferable?

In [None]:
ltoks = [tok.lower() for tok in toks]
n = 3                                   #n sets the number of words in phrase we will search for; experiment by adjusting this number n
n_gramslower = list(ngrams(ltoks,n))
n_grams = list(ngrams(toks,n))
ng_freqslower = collections.Counter(n_gramslower)
ng_freqs = collections.Counter(n_grams)
print(ng_freqslower.most_common(10),"\n\n***\n")
print(ng_freqs.most_common(10))

**30. With more time, we could apply the same analysis we did with individual word frequencies to analyze the frequency of ngrams of various lengths.**

Ngrams analysis, however, sometimes poses other questions. 

For example, do we want to keep stopwords? Describing a person, as "the leader" rather than "a leader" makes a significant difference. 

Also, would we want to keep capitalized words? Obviously, "united nations" could mean something different than "United Nations." 

<div class="alert alert-info" role="alert"><h3 style = "color:blue">Exercise (Part VII)</h3>

<p style = "color:blue">Create a ngram frequency list for another SOTU address. Try different length ngrams</p></div>

## Part VIII: Visualizing Word Patterns - Comparing Multiple Words

### Dispersion Plots, Frequency Graphs, etc.

31. Another way to visualize changing word frequencies over time is through a dispersion plot. One way to create a dispersion plot is to work with a master text that contains all SOTU texts arranged in chronological order. We can access such a "master text" in the "allSotus.txt" file.

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
with open("allSotus.txt",encoding='utf-8') as f:
    allSotus = f.read()
print(len(allSotus))

#fdist = Counter(nltk.word_tokenize(allSotus))
#tokens = word_tokenize(allSotus) #this takes a long time
alltoks=tokenizer.tokenize(allSotus)
print(len(alltoks))
print(sotudf['numtoks'].sum())  #this sums up all the numtoks values for each address in our sotudf to see if the total equals that from our master text ("allSotus")
txt = nltk.Text(alltoks)




In [None]:
print(len(alltoks))
txt.dispersion_plot(["taxes", "democracy", "freedom", "God", "Indians"])   #the dispersion_plot function is from the nltk library

## Appendix 1: Creating a Graph of the Lexical Diversity of SOTU texts.

### Lexical Diversity



We can quantify the diversity and complexity of a text's vocabulary by calculating its **lexical diversity**. This is calculated simply by dividing the total number of words found in a text by the number of unique words. A text with high lexical diversity has a high percentage of unique words, whereas a text with low lexical diversity repeats the same words frequently.

In [None]:
#we will use ltokens2, which you may recall excludes punctuation but not stopwords
print("# of words = ",len(ltokens2))  #here we are using ltokens2, because we want to keep the stopwords as part of our count
print("# of unique words = ",len(set(ltokens2)))
lexdiv=len(ltokens2)/len(set(ltokens2))
print(lexdiv)

*We will come back to this. For example, after learning how to iterate through a corpus of texts, we will compare the lexical diversity of the SOTU texts to examine how it has changed over time and from president to president.*

In [None]:
start = time.time()
#sotudf['numUniq'] = len(set(sotudf['tokens']))
sotudf['numUniq'] = sotudf['tokens'].apply(lambda x:len(set(x)))
#sotudf['lexdiv'] = sotudf[ 'numtoks']/sotudf['numUniq']
def doDiv (numer,denom):
    if numer>0:
        ans = numer/denom
    else:
        ans = 0
    return(ans)
sotudf['lexdiv'] = sotudf.apply(lambda x: doDiv(x.numUniq,x.numtoks), axis = 1)
print(time.time() - start)
sotudf.head(20)

In [None]:
g=sns.barplot(data=sotudf, x="year",y="lexdiv")
g.tick_params(labelrotation=90)
g.set(title = "Lexical Diversity of State of the Union Addresses")
g.set(ylabel='Unique Words as Percent of Total Words', xlabel='Year')
#g.set(xticklabels = sotudf.pres);