![DSL_logo](dsl_logo.png)


# Introduction to Text Analysis with Python

Welcome to the Digital Scholarship Lab introduction to Text Analysis with Python class. In this class we'll learn the basics of text analysis:

- parsing text
- analyzing the text

We'll use our own home made analysis tool first, then we'll use a python library called `gensim` to use some built in analysis tools.

This workshop assumes you've completed our self-paced Python prep [workshop](https://brockdsl.github.io/Intro_to_Python_Workshop/)

We'll use the Zoom's chat feature to interact.

Be sure to enable line numbers by looking for the 'gear' icon and checking the box in the 'Editor' panel.

## EG. Scrabble!

<img src="https://upload.wikimedia.org/wikipedia/commons/5/5d/Scrabble_game_in_progress.jpg" width =500x>

Scrabble is a popular game where players try to score points by spelling words and placing them on the game board. We'll use Scrabble scoring our our first attempt at text analysis. This will demonstart the basics of how Text Analysis works.

The function below gives you the Scrabble scored of any word you give it.

In [2]:
# This function will return the Scrabble score of a word

def scrabble_score(word):
    
    #Dictionary of our scrabble scores
    score_lookup = {
        "a": 1,
        "b": 3,
        "c": 3,
        "d": 2,
        "e": 1,
        "f": 4,
        "g": 2,
        "h": 4,
        "i": 1,
        "j": 8,
        "k": 5,
        "l": 1,
        "m": 3,
        "n": 1,
        "o": 1,
        "p": 3,
        "q": 10,
        "r": 1,
        "s": 1,
        "t": 1,
        "u": 1,
        "v": 4,
        "w": 4,
        "x": 8,
        "y": 4,
        "z": 10,
        "\n": 0, #just in case a new line character jumps in here
        " ":0 #normally single words don't have spaces but we'll put this here just in case
        
    }
    
    total_score = 0
    
    #We look up each letter in the scoring dictionary and add it to a running total
    #to make our dictionary shorter we are just using lowercase letters so we need to
    #change all of our input to lowercase with .lower()
    for letter in word:
        total_score = total_score + score_lookup[letter.lower()]
    
    return total_score


Text Analysis is a process comprised of three basic steps:
1. Identifying the text (or corpus) that you'd like to an analyze
1. Apply the analysis to your prepared text
1. Analyze the results

In our very basic example of scrabble we just are interested in finding the points we would get for spelling a specific word. 

In a more complext example with a larger corpus you can do any of the following types of analysis:
- determine the sentiment (positive / negative nature) of the text
- quantify how complex a piece of writing is based on the vocabulary it uses
- determine what topics are in your corpus, and match your items to these topics (this is what we are going to do)

Of course, there are many other different outcomes you can get from peforming text analysis.

Try questions Q1 - Q2 and type "All Done" in the chat box when you are done.

## Q1 Score your name

How many Points do you get for your name? Complete the expression below to find out the scrabble score of your name

In [3]:
name = "Tim"
print("Score for my name is:", scrabble_score(name))


Score for my name is: 5


## Q2

Score the name of your pet or favorite character from a story. Does your name or the name of your pet score higher in Scrabble?

In [4]:
pet_name = "Domino"
print("Score for my pet's name is:",scrabble_score(pet_name))

#Compare to see which gets more points!
if scrabble_score(pet_name) > scrabble_score(name):
    print("My pet's name scores more points!")
else:
    print("My name scores more (or the same) amount of points as my pets name")



Score for my pet's name is: 9
My pet's name scores more points!


# Beyond the basics

We just completed a very basic text analysis where we analyzed two different bits of text to see which one scores higher in Scrabble. Let's expand this idea to a more complex example using the [gensim](https://radimrehurek.com/gensim/) and [TextBlob](https://textblob.readthedocs.io/en/dev/) Python Libraries. There are other more complex libraries that you can use for text analysis, we are using more simple solutions so we can spend more time looking at results compared to setting up our code.

# Installing

This next cell will install and load the requires libraries that will do the text analysis.

In [5]:
#Install textblob using magic commands
#Only needed once
%pip install textblob
#%python -m textblob.download_corpora
#%pip install textblob.download_corpora

from textblob import TextBlob

#import gensim
#from gensim import corpora, models, similarities, downloader

import pandas as pd
import nltk
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')

#Let's make sure our previews show more information
pd.set_option('display.max_colwidth', 200)

Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package punkt to /Users/tim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tim/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to /Users/tim/nltk_data...
[nltk_data]   Package brown is already up-to-date!


# Corpus

Corpus is a fancy way of saying the text that we will be looking at. Typically a corpus is already cleaned up and ready for your analysis. For our example we are going to be looking at some entries from the 1900 [diary](https://dr.library.brocku.ca/handle/10464/7282) of Winnie Beam. The next cell will load this corpus and show us a few entires.

In [6]:
winnie_corpus = pd.read_csv('https://raw.githubusercontent.com/BrockDSL/Text_Analysis_with_Python/master/winnie_corpus.txt', header = None, delimiter="\t")
winnie_corpus.columns = ["page","date","entry"]
winnie_corpus['date'] = pd.to_datetime(winnie_corpus['date'])
winnie_corpus['entry'] = winnie_corpus.entry.astype(str)

#preview our top entries
winnie_corpus.head()

Unnamed: 0,page,date,entry
0,7,1900-01-01,New Year. First day of 1900 Charlie Merritt died at Grand Forks British Columbia yesterday of typhoid fever. To-day is election day and pap went up about 3 o'clock and did not get back until nearl...
1,7,1900-01-02,Went uptown in afternoon. Was up to Eckardt's but Miss Macfarlane was not there so I did not get what I wanted. Called at office and Nettie came home with me for tea. Mr Carman came over and borro...
2,8,1900-01-03,"Mrs Trusty was here washing School started to-day, but I was not going this week. Mamma went to the church and then to Mrs Chatfields Took her the church books. The queen Street Baptist church had..."
3,8,1900-01-04,Went over to Carman's to have Ella go with me to Dunn's greenhouse. We went about half past three. I brought a primrose Miss Chaplin was in there. Mamma went to Mrs Klotz at home Beatrice helped. ...
4,9,1900-01-05,"Sweep day. I read ""At the Camerons"" in the ""Harper's Young People"" when mamma was sweeping. We had a beggar in afternoon asking for a few cents as he had a long way to go. Rats! Went over to Lee's..."


# Measuring Sentiment

We can analyze the _sentiment_ of text using `textBlob`. The next cell demonstrates this

In [None]:

happy_sentence = "Python is the best programming language ever!"
sad_sentence = "Python is difficult to use, and very frustrating"


print("Sentiment of happy sentence ", TextBlob(happy_sentence).sentiment)
print("Sentiment of sad sentence ", TextBlob(sad_sentence).sentiment)

# polarity ranges from -1 to 1.
# subjectvity ranges from 0 to 1.



# Q3

Try a couple of different sentences in the code cell below. See if you can create something that scores -1 and another that scores 1 for _polarity_. See if you can minimize the _subjectivity_ of your sentence. Share your answers in the chat box.

In [None]:
test_sentence = ""
print("Score of test sentence is ", TextBlob(test_sentence).sentiment)

# Adding Sentiment to our Diary entries

This next cell will score each diary entry in a new column that will be added to the dataframe. We loop through each entry, calculate the two scores that represent the sentiment. After all the scores are computed with add them to the dataframe.

In [None]:
#Apply sentiment analysis from TextBlob

polarity = []
subjectivity = []


for day in winnie_corpus.entry:
    #print(day,"\n")
    score = TextBlob(day)
    polarity.append(score.sentiment.polarity)
    subjectivity.append(score.sentiment.subjectivity)
    
winnie_corpus['polarity'] = polarity
winnie_corpus['subjectvitity'] = subjectivity


#Let's look at our new top entries
winnie_corpus.head()

In [None]:
#Let's graph out the sentiment as it changes day to day.

plt.plot(winnie_corpus["date"],winnie_corpus["polarity"])
plt.xticks(rotation='45')
plt.title("Sentiment of Winnie's Diary Entries")
plt.show()

# What else can we get from the text?

Let's grab a random entry and see what we can find out about it.

In [None]:
bit_of_corpus = TextBlob(winnie_corpus["entry"][22])
bit_of_corpus

# Sentences

In [None]:
for sentence in bit_of_corpus.sentences:
    print(sentence)
    print(sentence.sentiment,"\n")


# Words in sentences

In [None]:
for sentence in bit_of_corpus.sentences:
    for word in sentence.words:
        print(word)

# Q??

## A closer look at the corpus

Let's look at the January Diary entries

In [None]:
#January Entries
#jan_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-01-01') & (winnie_corpus['date'] <= '1900-01-31')]

Let's see what Winnie talks about the most in the month. We can do this by extracting the _noun phrases_ in her entries. We can put them in a dictionary to count how many times a phrase is used

In [1]:

phrases = {}

for entry in jan_corpus.entries:
    tb = TextBlob(entry)
    for np in tp.noun_phrases:
        # check/add to dictionary
    

SyntaxError: unexpected EOF while parsing (<ipython-input-1-9fb93aa5ef82>, line 7)

In [None]:
#January Entries
#jan_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-01-01') & (winnie_corpus['date'] <= '1900-01-31')]

#February Entries
#feb_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-02-01') & (winnie_corpus['date'] <= '1900-02-28')]

#March Entries
#mar_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-03-01') & (winnie_corpus['date'] <= '1900-03-31')]