# Text Analysis in Python 1: Working with Strings & Files

<h1 style="text-align:center;font-size:300%;">The State of the Union is ____?</h1> 
  <img src="https://miro.medium.com/max/720/1*pp7HX01jBv2wbVRW9Ml_mA.png" style="width:%80;">


This tutorial will offer a basic introduction to performing text analysis in Python. It is designed for researchers (of all levels) interested in an introduction to text analysis with Python (no prior knowledge necessary). These **Jupyter Notebooks** are designed to work for both novice and intermediate users. First-time and beginner users of Python and Jupyter are recommended to complete the sections marked with **Python Basics** after the completion of the lesson. The lesson itself will focus on providing the code and basic skills you need to get started with text analysis. 

## Structure of Notebooks

These Jupyter Notebooks are designed to integrate instructions and explanations (in the white "markdown" cells below) with hands-on practice with the code (in the gray "code" cells below).

<h3 style="color:green">Code Together:</h3><p style="color:green">In these cell blocks, we will code together. You can find the completed version in our shared folder (ending with "_completed.ipynb").</p>

<h3 style="color:blue;">Exercises:</h3><p style="color:blue">are in blue text. These are a chance to practice what you have learned.</p>

<h3 style="color:purple">Python Basics - Additional Practice</h3><p style="color:purple">are in purple text. Work on these after the lesson if you would like more practice.</p>

## Can't you just read the books the old-fashioned way? 

### Why code?

<!--+ **accessibility**: learning to work with huge amount of data-->
+ **scale-ability**: scale up from one paragraph to a million books
+ **automate** the tedious; spend more time on the fun stuff
+ **reproducibility**: do it once, do it a thousand times
    + **Reproducible Research**: Also increasing calls, especially in Sciences, for data to be published with research so that other scholars can reproduce and test their results
	<!--+ Reproducing exact results in the humanities is probably both impossible and antithetical to humanities research. Nonetheless, there is a movement for humanities people to publish and preserve their "datasets". As an Indigenous Studies scholar, I think this is especially important as the most accessible texts and sources are often the most problematic and many scholars spend years uncovering alternative accounts or analyzing well-known accounts in more critical ways. Why not allow young scholars to build off your work and then take it further?-->
+ **flexibility**: only limit on your choices is your imagination
    + As opposed to out-of-the-box software that limits you to the imagination and constraints of the software developers
+ **affordability**: free to run,
+ **transferability**: convert files from one system to another
    + Many forms of proprietary software push the user to save their data in data formats that only that software can read
+ **longevity**: work with plain text files and .csvs will enhance the likelihood your data can still be read and processed 20, 40 years from now




### Why text analysis?

## Why do Text Analysis with Python?

## This Tutorial

In this tutorial and notebook, you will practice working with a dataset or corpus of a well-known series of texts: the yearly State of the Union addresses given by Presidents of the United States since 1790.


## Part I. Setup

### Downloading and Saving Dataset(s)

1. Find the Class folder of code and data at ????.
2. Save this folder in an easy to find place on your own computer (suggestion: save it 

### Getting Started with Course Jupyter Notebooks

There are two main types of cells in Jupyter. This is a **markdown** cell.

In [None]:
# This is a code cell.
# To comment out a line, add a # at the beginning

print("Hello world my name is XXX!")  #replace "XXX" with your name
print("This is the line seemingly every intro programming lesson begins with.")
#To run the code in this cell, hit CTRL + ENTER or click on the Run/Play button at the top of the screen.

To create new cells below this one, type ESC + B. To add new cells above this one, type ESC + A. To change a coding cell (for the machine to read) to a text / markdown cell (with notes or instructions for humans), type ESC + M. To do the opposite, type ESC + Y. You may also add, delete, or change cells using the menu options above. For more keyboard shortcuts see click on the Help tab above --> Keyboard Shortcuts (or just type ESC + H).

<h3 style="color:blue;">Exercise: Create New Coding Cells / Practice with Basic Python</h3>

<p style="color:blue;">Please create some new coding cells below.</p>

In [None]:
#we are going to begin with some basic Python, let's first assign a number to a variable


In [None]:
# then apply some additional math to that variable


In [None]:
#create some simple strings (in this case sentences) and assign them to variables


In [None]:
#concatenate the strings (sentences)


In [None]:
#create a list of people names


In [None]:
#create a list of numbers


In [None]:
# add an item to the list


In [None]:
#sort the lists you have created



In [None]:
#combine and print the lists


## Part II: Importing Python Packages or Libraries

Before beginning, we need to import some packages. Often, we need to install and import customized Python packages (sometimes called "modules") in addition to the core functions (like **print()**, **len()**, **sum()**, and others).

[ADD COMMENT ABOUT INSTALLING PACKAGES, DEPENDING ON THE THE SETUP WE HAVE FOR STUDENTS TO USE]


In [None]:
import os, pathlib, glob #the os package allows us to navigate through the files on our own computers
from pathlib import Path #the pathlib package helps us work with file paths
#for more on using pathlib see: https://builtin.com/software-engineering-perspectives/python-pathlib
import nltk,re #we can import multiple packages on one line using commas to separate new package names
#import matplotlib as plt   #matplotlib and seaborn are used here to create graphs, charts, and other visualizations
import matplotlib.pyplot as plt #needed for xticks
import seaborn as sns


plt.rcParams['figure.figsize'] = [16, 10]  #changes default figure size to make larger plots

%config InteractiveShellApp.matplotlib = 'inline'
%config InlineBackend.figure_formats = ['svg']

#Press CTRL+Enter to run this codeblock! 

<h3 style="color:green;"> Code Together:</h3>

<p style="color:green">We will also need to use the "collections" package as well. Let's import that in the code cell below:</p>



In [None]:
#note: when importing packages, Python will only print something out if there is an error. 


#Press CTRL+Enter to run this codeblock!

# Part III: Navigating through your computer's files and folders

1. To work with the State of the Union addresses you downloaded (hereafter: SOTU), we will need to navigate to the folder you placed them in. First, check the "current working directory" that Python is working with:



In [None]:
#note: to navigate through your files we will be using the Python library pathlib, which has become the preferred package 
#   for this task. However, I will also include the code for using the more traditional method (with the package os), 
#   but commented out.
print(Path.cwd())
#print(os.getcwd()) 

#Press CTRL+Enter to run this codeblock! This is the last time this reminder will be provided.

2. Your current working directory, printed out in the previous step, should be the location where you saved this notebook. Before moving on, double-check to make sure that you also saved the "sotu" folder of texts in that same directory.

In [None]:
list(Path.iterdir(Path.cwd()))
#os.listdir()

#Do you see the "sotu folder in the list below?

3. Next, we will look inside the "sotu" folder containing our corpus of State of the Union speeches (henceforth: SOTU). We can learn something about this dataset simply by examining the titles of the individual files.

In [None]:
sotudir=Path("sotu")
#print(list(Path.iterdir(sotudir))) #to get fullpath
print(set([item.suffix for item in list(Path.iterdir(sotudir))]))  #get unique suffixes or file extensions in sotudir 
[item.name for item in list(Path.iterdir(sotudir))] #to get filename only

## Part IV: Reading Files and Examining Their Contents

1. Open one SOTU text.

In [None]:
with open(Path("sotu","Bush_2002.txt"),encoding='utf-8') as f:
    bush02 = f.read()


## ** also calling utf-8 encoding may not be necessary
## but is good practice if you ever work with foreign languages (besides special characters can appear in English too, as in 
## loan words like naïve and résumé )


##[DISCUSS WHY IT IS GOOD PRACTICE TO CLOSE FILES IMMEDIATELY AFTER YOU ARE DONE WITH THEM]

In [None]:
## we can view the whole text simply by typing the file name
bush02  #Jupyter, however, requires the print() command to print out any information not found in the last line of code in a codeblock

2. What do the following blocks of code do? Run them and then share your answer.

In [None]:
print(len(bush02))

In [None]:
bush02[0:20]

In [None]:
bush02[:20]  #this is exactly the same as bush02[0:20] 

In [None]:
bush02[20:40]

In [None]:
bush02[-60:]

<h3 style="color:blue;">Exercises for Part IV</h3>
    
<p style="color:blue;">1. Add a coding cell below and print out the first and last 200 characters in the Bush 02 speech.</p>

### IVb. Divide a text into tokens

<p style = "color:green">The **split()** function allows us to divide a string by a delimiter. The default delimiter is a single space(" "). Let's split the following two items: a sentence and a series of phone numbers.</p>

In [None]:
# sent = "This is a simple sentence; or maybe not, as it contains multiple clauses - and different forms of punctuation."


In [None]:
phonenums = "555-755-8340, 555-831-2911, 555-442-9182"
phonenums.split(", ")

We can "tokenize" this SOTU text using the core Python function "split()". See the results below:


In [None]:
rawtokens=bush02.split()
print(rawtokens[:30])
print(len(rawtokens))

Notice this just splits words separated by spaces. It does not remove punctuation or split hyphenated words. See, for example:

In [None]:
print(rawtokens[:10])
# notice the punctuation after the 4th and 6th tokens should be placed in separate tokens, while the 
# punctuation after Mr. needs to stay as it identifies it as an abbreviation.

Fortunately, the text analysis package, "NLTK", offers a more sophisticated way to tokenize the words of a text.

In [None]:
tokens = nltk.word_tokenize(bush02)
print(tokens[0:30]) #notice the difference between these tokens and "rawtokens" above
print(len(tokens))

In [None]:
#another way to tokenize
from nltk import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens2=tokenizer.tokenize(bush02)
print(tokens2[:30])
print(len(tokens2))

<h3 style="color:purple">Python Basics (Additional Practice): Data Types</h3>

<ol style="color:purple">
    Select on the links below for more practice with...
    <li>Lists and Tuples (and For Loops)</li>
    <li>Dictionaries</li>
    <li>Data Frames</li>
</ol>

<h3 style="color:purple">Python Basics (Additional Practice): Working with File Names</h3>

<p style="color:purple">A good text corpus usually will include some metadata describing some basic information about the texts included. Sometimes this metadata will be stored in separate files and sometimes at the beginning of a text file. Our SOTU dataset, however, does not include any metadata - with one exception: information about the President and year in which he gave the addresss is store in the filename.</p>

<p style="color:purple">Here, we will use some basic Python commands to retrieve information from these file names.</p>

<p style="color:purple">1. Retrieve the name of the first file in the folder.</p>

In [None]:
pathlist = sotudir.glob('*.txt')
pathnames=[path.name for path in pathlist]
print(pathnames[:10])
firstfilename=pathnames[0]   #Note: in Python the first item of a list is given the index 0 not 1!
print(firstfilename)

<p style="color:purple">2. We can then divide the filename into its three parts: president's name, year of address, and file type/extension.</p>

<p style="color:purple">We can do this in two steps using Python's core split() function, dividing the full file name first by "." and then by "_".</p>


In [None]:
print(firstfilename)
filename=firstfilename.split(".") #this line of code separates the filename into a list of parts that precede or follow a "."
print(filename)
filenameparts=filename[0].split("_") #same thing, but using "_" as a separator
print(filenameparts)
ftype=filename[1]
pres=filenameparts[0]
year=filenameparts[1]
print("President",pres,"delivered this State of the Union Address in",year,".","This address is stored as a",ftype,"file.")

## V: Iterating through lists using for loops

<h3 style="color:green">Code Together: Working with Lists</h3>

**For loops** provide a simple means to iterate or cycle through items in a list, whether each item be a single value, an entire book or file, or a large directory of files.


<p style="color:green">Below, we will create a list, perform some calculations on it, create a new empty list, and then add items to this new list.</p>

In [None]:
thisIsAnEmptyList = []
print(len(thisIsAnEmptyList))



In [None]:
for i in range(0,1000):
    thisIsAnEmptyList.append(i)
print(len(thisIsAnEmptyList))
print(thisIsAnEmptyList[-5:])
sum(thisIsAnEmptyList)

In [None]:
numlist=[1,2,3,4,5]
print(numlist)
print(len(numlist))
numlist.append(6)
print(numlist)
print(len(numlist))
print(sum(numlist))

<p style='color:green'>We can use **for loops** to iterate through a list. and then populate a new, empty list based on calculations performed on the original list.  *Run this code and then modify it to take each number to the third power.**</p>
    


In [None]:
sqlist = []

for num in numlist:
    sq = num ** 2  #in Python "*" signifies multiplication and "**" signifies exponents, in this case "num" is taken to the 2nd power
    sqlist.append(sq)
print(sqlist)


In [None]:
primes = []
for i in range(1,100):
    if not i>2:
        primes.append(i)
        continue
    lowerNums = range(2,i)
    isPrime = True
    for lnum in lowerNums:
        if i % lnum == 0:
            isPrime = False
            break
    if isPrime:
        primes.append(i)
print(primes)

<p style='color:green'>An example applying some basic string functions to a list of strings:</p>

In [None]:
authors=["George R. R. Martin","Chimamanda Ngozi Adichie","Margaret Atwood","Louise Erdrich"]
for author in authors:
    print("***")
    print(author.lower())  #some examples of some simple string functions, observe what they do!
    print(author.upper())
    print("by "+author)
    print(len(author))
    names=author.split()
    print(names)
    print((names[-1],' '.join(names[:-1])))
    print("\n\n")

## VI. How many Roosevelts?: For Loop to iterate through files

We can also use a **for loop** to iterate through all our SOTU files and count only those that fit a certain criteria (i.e. files that are .txt or .csv files, files that start with or end with a specific set of characters, or contain particular information).

In [None]:
sotudir=Path("sotu") #I already defined sotudir above, but re-inserting here for anyone skipping around
pathlist = sotudir.glob('*.txt') #returns a list of all .txt files from the filepath we saved as "sotudir"

for path in pathlist:
    # print(path)        
    print(path.name)

#print([path.name for path in pathlist])  #this is a list comprehension version of the above code
## experienced Python programmers prefer list comprehensions over for loops, but for loops are nice, 
##    very human-readable code that works well for beginners


In [None]:
pathlist = sotudir.glob('*.txt') # .glob only stores the pathlist temporarily (for some reason), so you need to call it again!
ctr=0
for path in pathlist:
    filename=path.name
    if filename.startswith("Roosevelt"):
        print(filename)
        ctr+=1
print(ctr,"SOTU addresses by a Roosevelt (FDR or Theodore) are included in this corpus")


    

In [None]:
#Copy and paste the above code, but this time calculate the number of SOTU addresses delivered by a Bush.

## Part VII: Creating a graph of the SOTU speeches.

In [None]:
#from nltk.tokenize import RegexpTokenizer #<--necessary only if you didn't run this above
import pandas as pd
from nltk.tokenize import RegexpTokenizer

txtList=[]
sotudir2=Path("sotu2")
pathlist = sotudir2.glob('*.txt') # .glob only stores the pathlist temporarily (for some reason), so you need to call it again!2
ctr=0
for path in pathlist:
#for item in os.listdir():
    fn=path.stem
    fileType=path.suffix
    #print(fileType)
    #if fileType!=".txt":
    #    print("***will not read: ",fn,"***")
    #    continue
    year,pres=fn.split("_")
    #print(year)
    #print(pres)
    with open(path,'r') as f:  
        sotu = f.read()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens=tokenizer.tokenize(sotu)
    numWords=len(tokens)
    #print(txtLen)
    txtList.append([pres,year,numWords,tokens])

colnames=['pres','year','numWords','tokens']
sotudf=pd.DataFrame(txtList,columns=colnames)  ##
sotudf.head(10)

In [None]:
sotudf['year'] = sotudf['year'].astype(int)
sotuSub = sotudf[['pres','year','numWords']]

In [None]:
sotuSub.to_csv("sotuList.csv",encoding='utf-8')

In [None]:
sns.barplot(data=sotuSub, x="year", y="numWords", palette = "colorblind")

In [None]:
g=sns.barplot(data=sotuSub, x="year", y="numWords", hue = "pres",dodge = False, palette = "colorblind")
g.tick_params(labelrotation=90)

#https://github.com/mwaskom/seaborn/issues/970
#Add attribute dodge=False
#Instead of creating different bar for single value. Adding dodge=False shows the data in single bar per variable.



One problem: there are too many labels on the x-axis. Run the following code. The line that begins with "plt.xticks" places x axis tick labels only at every ten years.

In [None]:
startYr = sotuSub['year'][0]
endYr = sotuSub['year'][len(sotuSub['year'])-1]
decades = [i for i in range(int(startYr),int(endYr)+1) if i % 10 == 0]

f, ax = plt.subplots(figsize = (20,15))
sns.barplot(data=sotuSub, x="year", y="numWords", hue = "pres",dodge = False, palette = "colorblind")
ax.tick_params(labelrotation=90)
#ax.set(xticks = decades)
plt.xticks([dec - 1790 for dec in decades],labels=decades)
plt.show()

Let's modify the graph by labeling and grouping each bar by president.



<h2 style="color:blue">Final Exercise</h2>

<p style="color:blue">2. (advanced). Return a count and the filenames of all SOTU addresses delivered in the nineteenth century. Hint: you will probably need to use an index of filenames to isolate the century (the first two digits of each SOTU file year). Review Lesson 1, Part 1b for clues how to do this.</p>

<p style="color:blue">If you are unable to complete this or just want to compare, code is available in the completed version of this notebook (suffix: "_completed.ipynb")</p>
