<a href="https://colab.research.google.com/github/BrockDSL/BRB_Intro_To_Text_Analysis/blob/main/BRB_Introduction_to_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![dsl_logo.png](https://raw.githubusercontent.com/BrockDSL/BRB_Intro_To_Text_Analysis/main/dsl_logo.png)

# Introduction to Text Analysis
## Buidling Better Research Workshop Series

This workshop will introduce you to the basics of the what/how of harvesting social media information.


## How this notebook works

This webpage is a Google Colab notebook and is comprised of different *cells*. Some are code cells that run Python snippets. To works through these cells simply click on the triangle _run_ button in each cell.

## Save a copy 

To save a copy of this notebook so you can return to it later please go to **File > Save Copy in Drive**

In [None]:
# This code cell will load up all the required pieces to run our notebook.
# Once you click into this cell you'll see a triangle 'play' button appear
# Click on that to start your session

import nltk
import pandas as pd
import matplotlib.pyplot as plt
from textblob import TextBlob

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')
pd.set_option('display.max_columns', None)

%matplotlib inline
print("Ready to proceed!")



---

## EG. 1: Scrabble!

Let's write some code the does the basics of text analysis in our Scrabble example.

In [None]:
# This function will return the Scrabble score of a word
# Click the play button to load this into memory

def scrabble_score(word):
    
    #Dictionary of our scrabble scores
    score_lookup = {
        "a": 1,
        "b": 3,
        "c": 3,
        "d": 2,
        "e": 1,
        "f": 4,
        "g": 2,
        "h": 4,
        "i": 1,
        "j": 8,
        "k": 5,
        "l": 1,
        "m": 3,
        "n": 1,
        "o": 1,
        "p": 3,
        "q": 10,
        "r": 1,
        "s": 1,
        "t": 1,
        "u": 1,
        "v": 4,
        "w": 4,
        "x": 8,
        "y": 4,
        "z": 10,
        "\n": 0, #just in case a new line character jumps in here
        " ":0 #normally single words don't have spaces but we'll put this here just in case
        
    }
    
    total_score = 0
    
    #We look up each letter in the scoring dictionary and add it to a running total
    #to make our dictionary shorter we are just using lowercase letters so we need to
    #change all of our input to lowercase with .lower()
    for letter in word:
        total_score = total_score + score_lookup[letter.lower()]
    
    return total_score

## Question: 1. Scrabble Fight ##

Fill in values into the form below to experiement with different scrabble scores. Remember that you need to click the run button in both cells to update your results

In [None]:
#@title
#@markdown Let's see who's name scores higher in Scrabble.
your_name = "Tim" #@param {type:"string"}
pets_name = "Shorty" #@param {type:"string"}

#@markdown Feel free to use something else if you don't have  pet.
#@markdown Be sure hit the run button before moving on!


In [None]:
print("Score for my name is:", scrabble_score(your_name))
print("Score for my pet's name is:",scrabble_score(pets_name))
print("---")

if scrabble_score(pets_name) > scrabble_score(your_name):
    print("My pet's name scores more points!")
else:
    print("My name scores more (or the same) amount of points as my pets name")

# Done!

A simple example but it shows exactly what the steps are.



---
## EG. 2: A more interesting example


Let's load up the tex of the diary of Winnie


In [None]:
winnie_corpus = pd.read_csv('https://raw.githubusercontent.com/BrockDSL/Text_Analysis_with_Python/master/winnie_corpus.txt', header = None, delimiter="\t")
winnie_corpus.columns = ["page","date","entry"]
winnie_corpus['date'] = pd.to_datetime(winnie_corpus['date'])
winnie_corpus['entry'] = winnie_corpus.entry.astype(str)

#display our top 10 entries
winnie_corpus.head()

## Enter sentiment

We can analyze the _sentiment_ of the text (more [details](https://planspace.org/20150607-textblob_sentiment/)) The next cell demonstrates this:

In [None]:
happy_sentence = "Python is the best programming language ever!"
sad_sentence = "Python is difficult to use, and very frustrating"


print("Sentiment of happy sentence ", TextBlob(happy_sentence).sentiment)
print("Sentiment of sad sentence ", TextBlob(sad_sentence).sentiment)

# polarity ranges from -1 to 1.
# subjectvity ranges from 0 to 1.



## Question: 2. Experimenting with Sentiment ##

Try a couple of different sentences in the code cell below. See if you can create something that scores -1 and another that scores 1 for polarity. See if you can minimize the subjectivity of your sentence. Share your answers in the chat box.

In [None]:
test_sentence = """

Just replace this sentence with some text!

"""
print("Score of test sentence is ", TextBlob(test_sentence).sentiment)

## Adding Sentiment to our Diary entries


This next cell will score each diary entry in a new column that will be added to the dataframe. We loop through each entry, calculate the two scores that represent the sentiment. After all the scores are computed with add them to the dataframe.

In [None]:
#Apply sentiment analysis from TextBlob

polarity = []
subjectivity = []


for day in winnie_corpus.entry:
    #print(day,"\n")
    score = TextBlob(day)
    polarity.append(score.sentiment.polarity)
    subjectivity.append(score.sentiment.subjectivity)
    
winnie_corpus['polarity'] = polarity
winnie_corpus['subjectivity'] = subjectivity


#Let's look at our new top entries
winnie_corpus.head()

## Graph it out?

Let's graph the changes in sentiment polarity to see what is happening with Winnie.

In [None]:
#Let's graph out the sentiment as it changes day to day.

plt.plot(winnie_corpus["date"],winnie_corpus["polarity"])
plt.xticks(rotation='45')
plt.title("Sentiment of Winnie's Diary Entries")
plt.show()

## Question: 3. Interesting Spikes?

We see some really strong negative and positive spikes in the sentiment. Let's just take a look at some of those entries. Run the next three cells to look at the individual negative and positive entries. Run the next couple of cell to see if we can isolate the _very positive_ and _very negative_ entries in the diary. 

In [None]:
#instead of looking at just the hightest and lowest value we'll reduce that number by a threshold value
#that way we can see numbers that are close to the highest sentiment and the lowest sentiment
#we'll start with 20%.


threshold = 0.2

In [None]:
#Very Negative
bad_sentiment = winnie_corpus["polarity"].min()

#Reduce this number by threshold %
bad_sentiment = bad_sentiment - (bad_sentiment * threshold)

winnie_corpus[winnie_corpus["polarity"] <= bad_sentiment]



In [None]:
#Very Positive
good_sentiment = winnie_corpus["polarity"].max()

#Reduce this number by threshold %
good_sentiment = good_sentiment - (good_sentiment * threshold)

winnie_corpus[winnie_corpus["polarity"] >= good_sentiment]



What do you think about the results of the _sentiment_ scoring. Do you agree with what constitutes a high score? How about a low score?

## Noun Phrases for Automatic Keywork generation

We can get a good idea about what a corpus is about by looking at the different nouns that show up in it. Nouns that show up a lot give us an idea of the contents of the text. Let's look at a random diary entry to see this in action

In [None]:
random_entry = winnie_corpus.sample(1)

print("Total text of entry:")
print(random_entry["entry"].values[0])


print("\nNow the noun phrases:")
entry_text = TextBlob(random_entry["entry"].values[0])

for np in entry_text.noun_phrases:
  print(np)


## Question: 4. ##

Let's see what Winnie talks about the most in first 6 months of the year. We can do this by extracting the noun phrases in her entries. We can put then put them into a frequency list and display them to the screen. Run the next week cells to build and display this information.

In [None]:
#Separate all entries into different months by using the data column and extract noun phrases
#by month

#January Entries
jan_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-01-01') & (winnie_corpus['date'] <= '1900-01-31')]
jan_phrases = dict()

for entry in jan_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in jan_phrases:
            jan_phrases[np] += 1
        else:
            jan_phrases[np] = 1



#February Entries
feb_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-02-01') & (winnie_corpus['date'] <= '1900-02-28')]

feb_phrases = dict()

for entry in feb_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in feb_phrases:
            feb_phrases[np] += 1
        else:
            feb_phrases[np] = 1

#March Entries
mar_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-03-01') & (winnie_corpus['date'] <= '1900-03-31')]


mar_phrases = dict()

for entry in mar_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in mar_phrases:
            mar_phrases[np] += 1
        else:
            mar_phrases[np] = 1



#April Entries
april_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-04-01') & (winnie_corpus['date'] <= '1900-04-30')]

april_phrases = dict()

for entry in april_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in april_phrases:
            april_phrases[np] += 1
        else:
            april_phrases[np] = 1

#May Entries
may_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-05-01') & (winnie_corpus['date'] <= '1900-05-31')]

may_phrases = dict()

for entry in may_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in may_phrases:
            may_phrases[np] += 1
        else:
            may_phrases[np] = 1




#June Entries
june_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-06-01') & (winnie_corpus['date'] <= '1900-06-30')]

june_phrases = dict()

for entry in june_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in june_phrases:
            june_phrases[np] += 1
        else:
            june_phrases[np] = 1

In [None]:
#We'll print the top ten noun phrases from each month below along with how many times they show up


#Print the top 10 things she mentioned in January

print("January")
for np in sorted(jan_phrases, key=jan_phrases.get, reverse=True)[0:10]:
    print(np, ">",jan_phrases[np])

print("\nFebruary")
for np in sorted(feb_phrases, key=feb_phrases.get, reverse=True)[0:10]:
    print(np, ">",feb_phrases[np])

print("\nMarch")
for np in sorted(mar_phrases, key=mar_phrases.get, reverse=True)[0:10]:
    print(np, mar_phrases[np])

print("\nApril")
#April Entries
for np in sorted(april_phrases, key=april_phrases.get, reverse=True)[0:10]:
    print(np, april_phrases[np])

print("\nMay")
for np in sorted(may_phrases, key=may_phrases.get, reverse=True)[0:10]:
    print(np, may_phrases[np])

print("\nJune")
for np in sorted(june_phrases, key=june_phrases.get, reverse=True)[0:10]:
    print(np, june_phrases[np])

What can you say about Winnie's topics over the first half of the year? Share your thoughts in the chat box.



---
## Conclusion

Text Analysis can go in many directions. The difficult part is usually getting your text ready. Once that is done there are many different venues to explore. Today will only looked a few basic examples of what can be done.

If you're interested in exploring social media data for a research project or class please contact: **dsl @ brocku.ca** or checkout our the [DSL webpage](https://brocku.ca/library/dsl) for more details on how the Digital Scholarship Lab can help your research.

