<h1>Preparation</h1>
<li>Download the nltk corpus (see below)
<li>Download Class 7 - Data.zip from canvas and unzip it in a local folder in your directory
<li>From the data folder copy the files: 2013-Obama.txt and data/2017-Trump.txt to:
<ul>
<li>~/nltk_data/corpora/inaugural/ (~ indicates your home directory)

In [None]:
#YOU NEED TO RUN THIS ONLY ONCE!
import nltk
nltk.download()

<h1>Working with text!</h1>


<h2>nltk: Python's natural language toolkit</h2>


<h3>ntlk documentation link:</h3> http://www.nltk.org/api/nltk.html
<h3>Commands cheat sheet</h3> https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf
<h3>nltk book</h3>http://www.nltk.org/book/

<h2>Types of analysis</h2>
<li>Sentiment analysis: Deciding whether a document (or concept) is positive or negative
<li>Entity analysis: Identifying entities (Named entities, Parts of speech) and properties of these entities
<li>Topic analysis: Deciding what the major topics associated with a piece of text
<li>Text summarization: Summarizing a document (Cliff notes version!)

<h2>Sentiment Analysis</h2>
Identify entities and emotions in a sentence and use these to determine if the entity is being viewed positively or negatively

<h3>Easy examples</h3>
<li>I had an <b style="color:green">excellent</b> souffle at the restaurant Cavity Maker</li>
<li>Excellent is a positive word for both the souffle as well as for the restaurant</li>

<h3>Not so easy examples</h3>
<h4>Often, looking at words alone is not enough to figure out the sentiment</h4>
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for a ‘stuck at home’ snow day</i></li> This one is easy since it includes an explicit positive opinion using a positive word
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for using as a liner for your cat’s litter box</i></li> Not so simple! The positive word "excellent" is used with a negative connotation. 
<li><i>The Girl on the Train is <span style="color:green">better</span> than Gone Girl</i></li> The positive word is used as a comparator. Whether the writer likes The Girl on the Train or not depends on what he or she thinks of Gone Girl

<h4>Bottom line</h4>
Sentiment analysis is generally a starting point in analyzing a text and is then coupled with other techniques (e.g., topic analysis)

<h2>Sentiment analysis is usually done using a corpus of positive and negative words</h2>
<li>Some sources compile lists of positive and negative words
<li>Others include the polarity - the degree of positivity or negativity - of each word

<h2>Sources of sentiment coded words</h2>
<ol>
<li>Hu and Liu's sentiment analysis lexicon: words coded as either positive or negative</li>
<ul>
<li>http://ptrckprry.com/course/ssd/data/positive-words.txt
<li>http://ptrckprry.com/course/ssd/data/negative-words.txt
</ul>
<li>NRC Emotion Lexicon: words coded into emotional categories (many languages)</li>
<ul>
<li>http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm</li>
</ul>
<li>SentiWordNet: Lists of words weighted by positive or negative sentiment. Includes guidance on how to use the words</li>
<ul>
<li>http://sentiwordnet.isti.cnr.it/</li>
</ul>
<li>Vader Sentiment tool: 7800 words with positive or negative polarity</li>
<ul>
<li>Included with python nltk</li>
</ul>
</ol>

<h2>Our examples</h2>
<li>Compiled set of reviews of neighborhood restaurants
<li>Presidential inaugural addresses (from Washington to Trump)
<li>Some data from yelp using the yelp API

<h3>Simple sentiment analysis</h3>
Compute the proportion of positive and negative words in a text

In [None]:
def get_pos_neg_words():
    def get_words(url):
        import requests
        words = requests.get(url).content.decode('latin-1')
        word_list = words.split('\n')
        index = 0
        while index < len(word_list):
            word = word_list[index]
            if ';' in word or not word:
                word_list.pop(index)
            else:
                index+=1
        return word_list

    #Get lists of positive and negative words
    p_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
    n_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
    positive_words = get_words(p_url)
    negative_words = get_words(n_url)
    return positive_words,negative_words

positive_words,negative_words = get_pos_neg_words()

p = set(positive_words)
n = set(negative_words)

print(p.intersection(n))

<h4>Read the text being analyzed and count the proportion of positive and negative words in the text</h4>
<li>We'll look at the reviews of two restaurants in the Morningside Heights neighborhood

In [None]:
with open('./Class 7 - Data/community.txt','r') as f:
    community = f.read()
with open('./Class 7 - Data/le_monde.txt','r') as f:
    le_monde = f.read()

<h4>Compute sentiment by looking at the proportion of positive and negative words in the text</h4>

In [None]:
from nltk import word_tokenize

cpos = cneg = lpos = lneg = 0
for word in word_tokenize(community):
    if word in positive_words:
        cpos+=1
    if word in negative_words:
        cneg+=1
for word in word_tokenize(le_monde):
    if word in positive_words:
        lpos+=1
    if word in negative_words:
        lneg+=1
print("community {0:1.2f}%\t {1:1.2f}%\t {2:1.2f}%".format(cpos/len(word_tokenize(community))*100,
                                                        cneg/len(word_tokenize(community))*100,
                                                        (cpos-cneg)/len(word_tokenize(community))*100))
print("le monde  {0:1.2f}%\t {1:1.2f}%\t {2:1.2f}%".format(lpos/len(word_tokenize(le_monde))*100,
                                                        lneg/len(word_tokenize(le_monde))*100,
                                                        (lpos-lneg)/len(word_tokenize(le_monde))*100))


<h2>Let's functionalize this</h2>

In [None]:
def do_pos_neg_sentiment_analysis(text_list,debug=False):
    positive_words,negative_words = get_pos_neg_words()
    from nltk import word_tokenize
    results = list()
    for text in text_list:
        cpos = cneg = 0
        for word in word_tokenize(text[1]):
            if word in positive_words:
                if debug:
                    print("Positive",word)
                cpos+=1
            if word in negative_words:
                if debug:
                    print("Negative",word)
                cneg+=1
        results.append((text[0],cpos/len(word_tokenize(text[1])),cneg/len(word_tokenize(text[1]))))
    return results

do_pos_neg_sentiment_analysis([('community',community),('le_monde',le_monde)])

<h2>Simple sentiment analysis using NRC data</h2>
<li>NRC data codifies words with emotions</li>
<li>14,182 words are coded into 2 sentiments and 8 emotions</li>


<h4>For example, the word abandonment is associated with anger, fear, sadness and has a negative sentiment</h4>
<li>abandoned	anger	1
<li>abandoned	anticipation	0
<li>abandoned	disgust	0
<li>abandoned	fear	1
<li>abandoned	joy	0
<li>abandoned	negative	1
<li>abandoned	positive	0
<li>abandoned	sadness	1
<li>abandoned	surprise	0
<li>abandoned	trust	0

<h4>Read the NRC sentiment data</h4>

In [None]:
nrc = "./Class 7 - Data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
count=0
emotion_dict=dict()
with open(nrc,'r') as f:
    all_lines = list()
    for line in f:
        if count < 46:
            count+=1
            continue
        line = line.strip().split('\t')
        if int(line[2]) == 1:
            if emotion_dict.get(line[0]):
                emotion_dict[line[0]].append(line[1])
            else:
                emotion_dict[line[0]] = [line[1]]
        

<h4>Functionalize this</h4>

In [None]:
def get_nrc_data():
    nrc = "./Class 7 - Data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
    count=0
    emotion_dict=dict()
    with open(nrc,'r') as f:
        all_lines = list()
        for line in f:
            if count < 46:
                count+=1
                continue
            line = line.strip().split('\t')
            if int(line[2]) == 1:
                if emotion_dict.get(line[0]):
                    emotion_dict[line[0]].append(line[1])
                else:
                    emotion_dict[line[0]] = [line[1]]
    return emotion_dict

In [None]:
emotion_dict = get_nrc_data()
emotion_dict['abandoned']

<h1>Yelp API</h1>
<li>https://www.yelp.com/developers/documentation/v3
<li>log into yelp (top right hand corner of the page)
<li>Click <span style="color:blue">Create App</span> on the left hand menu bar
<li>Enter app info (leave optional stuff blank)
<li>Copy the client id and client secret to a secure place (this notebook should do the trick or use a text file!)

In [None]:
with open('./YelpAPIKeys.txt','r') as f:
    count = 0
    for line in f:
        if count == 0:
            CLIENT_ID = line.strip()
        if count == 1:
            API_KEY = line.strip()
        count+=1


In [None]:
print(CLIENT_ID,API_KEY)

In [None]:
# API constants, you shouldn't have to change these.
API_HOST = 'https://api.yelp.com' #The API url header
SEARCH_PATH = '/v3/businesses/search' #The path for an API request to find businesses
BUSINESS_PATH = '/v3/businesses/'  # The path to get data for a single business

<h3>Now we can get reviews</h3>
<li>get_reviews(location,number=15) returns the reviews of "number" (default=15) restaurants in the vicinity of "location"
<li>First, we'll write a function that gets  restaurants in the vicinity of location



In [None]:
def get_restaurants(api_key,location,number=15):
    import requests
    
    #First we get the access token
    #Set up the search data dictionary
    search_data = {
    'term': "restaurant",
    'location': location.replace(' ', '+'),
    'limit': number
    }
    url = API_HOST + SEARCH_PATH
    headers = {
        'Authorization': 'Bearer %s' % api_key,
    }
    response = requests.request('GET', url, headers=headers, params=search_data).json()
    businesses = response.get('businesses')
    return businesses

In [None]:
get_restaurants(API_KEY,"Columbia University, New York, NY")

<h4>Then a function that, given a business id, returns a string containing the reviews</h4>


In [None]:
def get_business_review(api_key,business_id):
    import json
    import requests
    business_path = BUSINESS_PATH + business_id+"/reviews"
    url = API_HOST + business_path

    headers = {
        'Authorization': 'Bearer %s' % api_key,
    }


    response = requests.request('GET', url, headers=headers).json()
   
    review_text = ''
    for review in response['reviews']:
        review_text += review['text']
    return review_text

In [None]:
get_business_review(API_KEY,'flat-top-new-york')

<h4>Finally, put all this together to get review data for the set of restaurants</h4>


In [None]:
def get_reviews(location,number=15):

    restaurants = get_restaurants(API_KEY,location,number)

    if not restaurants:
        return None
    review_list = list()
    for restaurant in restaurants:
        restaurant_name = restaurant['name']
        restaurant_id = restaurant['id']
        review_text = get_business_review(API_KEY,restaurant_id)
        
        review_list.append((restaurant_name,review_text))
    return review_list
        


In [None]:
all_snippets = get_reviews("Columbia University, New York, NY")

In [None]:
all_snippets

<h2>A function that analyzes emotions</h2>

In [None]:
def emotion_analyzer(text,emotion_dict=emotion_dict):
    #Set up the result dictionary
    emotions = {x for y in emotion_dict.values() for x in y} 
    #list comprehension - emotion for (emotion_list in emotion_dict.values() for emotion in emotion_list
    emotion_count = dict()
    for emotion in emotions:
        emotion_count[emotion] = 0

    #Analyze the text and normalize by total number of words
    total_words = len(text.split())
    for word in text.split():
        if emotion_dict.get(word):
            for emotion in emotion_dict.get(word):
                emotion_count[emotion] += 1/total_words
    return emotion_count

<h4>Now we can analyze the emotional content of the review snippets</h4>

In [None]:
print("%-12s %1s\t%1s %1s %1s %1s   %1s %1s %1s %1s"%(
        "restaurant","fear","trust","negative","positive","joy","disgust","anticip",
        "sadness","surprise"))
        
for snippet in all_snippets:
    text = snippet[1]
    result = emotion_analyzer(text)
    print("%-12s %1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f"%(
        snippet[0][0:10],result['fear'],result['trust'],
          result['negative'],result['positive'],result['joy'],result['disgust'],
          result['anticipation'],result['sadness'],result['surprise']))


<h4>Let's functionalize this</h4>

<h3>For easy of analysis, we'll do the following:</h3>
<li>Generalize it so that we can analyze any document type, not just restaurant reviews
<li>Output a dataframe containing the results. This will make analyzing of the results easier
<li>We'll decide whether or not we should print the output from the function

In [None]:
def comparative_emotion_analyzer(text_tuples,object_name="Restaurant",print_output=False):
    if print_output:
        print("%-20s %1s\t%1s %1s %1s %1s   %1s %1s %1s %1s"%(object_name,
                                                              "fear","trust","negative","positive",
                                                              "joy","disgust","anticip", "sadness",
                                                              "surprise"))
    import pandas as pd
    df = pd.DataFrame(columns=[object_name,'Fear','Trust','Negative',
                           'Positive','Joy','Disgust','Anticipation',
                           'Sadness','Surprise'],)
    df.set_index(object_name,inplace=True)
    
    output = df    
    for text_tuple in text_tuples:
        text = text_tuple[1] 
        result = emotion_analyzer(text)
        if print_output:
            print("%-20s %1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f"%(
                text_tuple[1][0:20],result['fear'],result['trust'],
                  result['negative'],result['positive'],result['joy'],result['disgust'],
                  result['anticipation'],result['sadness'],result['surprise']))
        df.loc[text_tuple[0]] = [result['fear'],result['trust'],
                  result['negative'],result['positive'],result['joy'],result['disgust'],
                  result['anticipation'],result['sadness'],result['surprise']]
    return output
#And test it        
comparative_emotion_analyzer(all_snippets)

<h2>Package the emotion analyzer with the yelp API to get a yelp data analyzer</h2>

In [None]:
def analyze_nearby_restaurants(address,number=15):
    snippets = get_reviews(address,number)
    comparative_emotion_analyzer(snippets)

#And test it    
analyze_nearby_restaurants("Columbia University",15)

In [None]:
#Test it on some other place
analyze_nearby_restaurants("221 Baker Street, London, UK",15)

<h1>Working with organized bodies of texts</h1>

<h2>Text corpora</h2>
<li>Corpus: An organized set of text documents</li>
<li>Examples:
<ol>
<li>The collection of inaugural speeches
<li>An entire book
<li>Collection of all books by Graham Greene
<li>Collection of tweets by Donald Trump
<li>Collection of tweets that reference AAPL
</ol>
<li>As the examples illustrate, the texts in a corpus are related and what a corpora contains depends on what sort of analysis you want to do


<h1>Building text corpora</h1>
<li>Read a collection of documents from a directory
<li>Use APIs to get text documents or fragments
<ul>Examples:
<li>Tweets
<li>Yelp reviews
</ul>
<li>Use existing corpora
<ul>
<li>nltk.download() downloads sample corpora

<h2>Creating a corpus from text files</h2>

<h2>Let's do a detailed comparison of local restaurants</h2>
<h4>I've saved a few reviews for each restaurant in four directories</h4>
<h4>We'll use the PlainTextCorpusReader to read these directories</h4>
<li>PlainTextCorpusReader reads all matching files in a directory and saves them by file-ids

In [None]:
### Data Structure example ###
# For community:
# - Root folder - "Class 7 - Data/community"
# - file names - "community.*"

import nltk
from nltk.corpus import PlaintextCorpusReader
restaurants = ['community', 'le_monde', 'shakeshack', 'fiveguys']
restaurants_data = {}
for restaurant in restaurants:
    restaurants_data[restaurant] = PlaintextCorpusReader('Class 7 - Data/%s' % restaurant, '%s.*' % restaurant)

In [None]:
restaurants_data['shakeshack'].fileids()

In [None]:
restaurants_data['shakeshack'].raw()

<h4>We need to construct text tuples that match the format of the argument of comparative_emotion_analyzer</h4>

In [None]:
restaurant_tuples = []
for key in restaurants_data.keys():
    restaurant_tuples.append((key, restaurants_data[key].raw()))

comparative_emotion_analyzer(restaurant_tuples)

<h2>Using nltk sample corpora</h2>

<h2>nltk contains a large corpora of pre-tokenized text</h2>
Load it using the command:<p>
<b>You should have already done this!</b><br>
nltk.download()

    

<h4>Import the corpora</h4>
<li>Look for texts under nltk_data in your home directory</li>

In [None]:
from nltk.book import *

<h1>Often, a comparitive analysis helps us understand text better</h1>
<h2>Let's look at US Presidential Inaugural speeches</h2>
<h4>Copy the files 2013-Obama.txt and 2017-Trump.txt to the nltk_data/corpora/inaugural directory. nltk_data should be under your home directory</h4>

In [None]:
inaugural.fileids()

In [None]:
inaugural.raw('1861-Lincoln.txt')

In [None]:
all_addresses = list()
for file in inaugural.fileids():
    all_addresses.append((file,inaugural.raw(file)))
all_addresses

<h2>Let's compare the speeches by emotion using our function</h2>

In [None]:
all_speeches = comparative_emotion_analyzer(all_addresses,print_output=False,object_name="President")

In [None]:
all_speeches

<h2>Try the following</h2>
<li>The most positive presidential speeches
<li>The most negative presidential speeches
<li>The most "surprise" oriented presidential speeches
<li>The most Net Positive speeches:
<ul>
<li>Positive = Trust + Positive	+ Joy + Anticipation
<li>Negative = Fear + Negative + Disgust + Sadness


In [None]:
all_speeches.sort_values(by="Surprise",ascending=False)
all_speeches["All_Pos"]=(all_speeches['Trust']+all_speeches['Positive']+ all_speeches['Joy']+ all_speeches['Anticipation'])
all_speeches["All_Neg"]=(all_speeches['Fear']+all_speeches['Negative']+ all_speeches['Disgust']+ all_speeches['Sadness'])
all_speeches['Net']=all_speeches["All_Pos"]-all_speeches["All_Neg"]
all_speeches.sort_values(by="Net",ascending=False)['Net']

In [None]:
all_speeches

<h2>Naive sentiment analysis on inaugural speeches</h2>

In [None]:
sents = do_pos_neg_sentiment_analysis([(x[0],x[1]) for x in all_addresses])
sents

In [None]:
sorted(sents,key=lambda x: x[1]-x[2],reverse=True)

<h3>Twitter APIs</h3>
<li>Sign in to the developer site using your twitter account
<li>Go to https://apps.twitter.com/app/new 
<li>Give your app a name and a description
<li>Enter anything for website (e.g., http://www.columbia.edu)
<li>Leave callback url blank
<li>Accept the terms and conditions and create an account
<li>Then, click on the "Keys and Access Tokens" tab
<li>Copy the two consumer keys to a text file
<li>Scroll down and click "Generate Access key and Secret"
<li>Copy the two access keys to the text file


In [None]:
with open("./TwitterAPIKeys.txt",'r') as token_file:
    contents = token_file.read().split('\n')
    consumer_key = contents[0]
    consumer_secret = contents[1]
    access_token = contents[2]
    access_token_secret = contents[3]
    
print(consumer_key,consumer_secret,access_token,access_token_secret,sep='\n')

<h2>tweepy</h2>
<li>A python library that interfaces with the twitter API
<li>Returns a lot of useful stuff but we'll only look at the tweet text

<h4>Set up an authentication object using OAuth and send a search request</h4>
<li>Tweepy constructs a "list like" SearchResults object
<li>Each item in SearchResults is a Status object
<li>The status object has a _json attribute that contains a json tweet
<li>http://docs.tweepy.org/en/v3.5.0/getting_started.html#

In [None]:
!pip install tweepy

In [None]:
import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
search_term = 'AAPL'
## fill in your search query and store your results in a variable
results = api.search(q = search_term, lang = "en", result_type = "recent", count = 1000)

In [None]:
print(type(results))
print(len(results))
print(type(results[0]))

In [None]:
print(results[0]._json.keys())

In [None]:
print(results[0]._json['text'])

<li>We can save this to files that can then be read by a plaintextcorpus reader
<li>The order is important
<li>We could use  datetime for ordering
<li>But we'll just number them for now
<li>Saving to a file also helps us build a tweet corpus


In [None]:
search_term = 'AAPL'
for i in range(len(results)):
    fname = search_term+'.'+str(len(results)-i)
    with open('./Class 7 - Data/tweets/'+fname,'w') as f:
        f.write(results[i]._json['text']+'\n')

<h4>Now we can do sentiment analysis on these tweets</h4>

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
tweets_root = "./Class 7 - Data/tweets"
apple_files = "AAPL.*"
apple_data = PlaintextCorpusReader(tweets_root,apple_files)

apple_data.raw()

In [None]:
do_pos_neg_sentiment_analysis([['apple',apple_data.raw()]])

In [None]:
comparative_emotion_analyzer([['apple',apple_data.raw()]],object_name='Apple Tweets',print_output=False)

<h2>Let's package this</h2>

In [None]:
def get_tweets(search_term,consumer_key=consumer_key,consumer_secret=consumer_secret,
               access_token=access_token,access_token_secret=access_token_secret):
    import tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    results = api.search(q = search_term, lang = "en", result_type = "recent", count = 1000)
    for i in range(len(results)):
        fname = search_term+'.'+str(len(results)-i)
        with open('./Class 7 - Data/tweets/'+fname,'w') as f:
            f.write(results[i]._json['text']+'\n')

get_tweets(search_term="GOOG")
            

<h2>And do a comparative analysis</h2>

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
tweets_root = "./Class 7 - Data/tweets"
apple_files = "AAPL.*"
google_files = "GOOG.*"
apple_data = PlaintextCorpusReader(tweets_root,apple_files)
google_data = PlaintextCorpusReader(tweets_root,google_files)
do_pos_neg_sentiment_analysis([['apple',apple_data.raw()],['google',google_data.raw()]])
comparative_emotion_analyzer([
    ['apple',apple_data.raw()],
    ['google',google_data.raw()]],
    object_name='Equity',print_output=False)

<h1>Simple analysis: Word Clouds</h1>

<h4>Let's see what sort of words the snippets use</h4>
<li>First we'll combine all snippets into one string
<li>Then we'll generate a word cloud using the words in the string
<li>You may need to install wordcloud using pip
<li>pip install wordcloud

In [None]:
!pip install wordcloud

In [None]:
all_snippets

In [None]:
text=''
for snippet in all_snippets:
    text+=snippet[1]
text

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline

wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white').generate(text)

plt.imshow(wordcloud)
plt.axis('off')
plt.show()

<h2>word cloud comparison</h2>
<li>We'll remove short words and look only at words longer than 6 letters
<li>And then do a side by side comparison of the word clouds for our four restaurants

In [None]:
texts = restaurant_tuples
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline
#Remove unwanted words
#As we look at the cloud, we can get rid of words that don't make sense by adding them to this variable
DELETE_WORDS = []
def remove_words(text_string,DELETE_WORDS=DELETE_WORDS):
    for word in DELETE_WORDS:
        text_string = text_string.replace(word,' ')
    return text_string

#Remove short words
MIN_LENGTH = 5
def remove_short_words(text_string,min_length = MIN_LENGTH):
    word_list = text_string.split()
    for word in word_list:
        if len(word) < min_length:
            text_string = text_string.replace(' '+word+' ',' ',1)
    return text_string


#Set up side by side clouds
COL_NUM = 2
ROW_NUM = 2
fig, axes = plt.subplots(ROW_NUM, COL_NUM, figsize=(12,12))

for i in range(0,len(texts)):
    text_string = remove_words(texts[i][1])
    text_string = remove_short_words(text_string)
    ax = axes[i//2, i%2] 
    ax.set_title(texts[i][0])
    wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',width=1200,height=1000,max_words=20).generate(text_string)
    ax.imshow(wordcloud)
    ax.axis('off')
plt.show()

In [None]:
texts = [('trump',inaugural.raw('2017-Trump.txt')),('Obama',inaugural.raw('2013-Obama.txt')),
         ('Bush',inaugural.raw('2001-Bush.txt')),('Clinton',inaugural.raw('1997-Clinton.txt'))]
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline
#Remove unwanted words
#As we look at the cloud, we can get rid of words that don't make sense by adding them to this variable
DELETE_WORDS = []
def remove_words(text_string,DELETE_WORDS=DELETE_WORDS):
    for word in DELETE_WORDS:
        text_string = text_string.replace(word,' ')
    return text_string

#Remove short words
MIN_LENGTH = 5
def remove_short_words(text_string,min_length = MIN_LENGTH):
    word_list = text_string.split()
    for word in word_list:
        if len(word) < min_length:
            text_string = text_string.replace(' '+word+' ',' ',1)
    return text_string


#Set up side by side clouds
COL_NUM = 2
ROW_NUM = 2
fig, axes = plt.subplots(ROW_NUM, COL_NUM, figsize=(12,12))

for i in range(0,len(texts)):
    text_string = remove_words(texts[i][1])
    text_string = remove_short_words(text_string)
    ax = axes[i//2, i%2] 
    ax.set_title(texts[i][0])
    wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',width=1200,height=1000,max_words=20).generate(text_string)
    ax.imshow(wordcloud)
    ax.axis('off')
plt.show()

<h1>Simple Analysis: Complexity</h1>
<h4>We'll look at four complexity factors</h4>
<li>average word length: longer words adds to complexity
<li>average sentence length: longer sentences are more complex (unless the text is rambling!)
<li>vocabulary: the ratio of unique words used to the total number of words (more variety, more complexity)

<b>token:</b> A sequence (or group) of characters of interest. For e.g., in the below analysis, a token = a word
<li>Generally: A token is the base unit of analysis</li>
<li>So, the first step is to convert text into tokens and nltk text object</li>

In [None]:
#Construct tokens (words/sentences) from the text
text = restaurants_data['le_monde'].raw()
import nltk
from nltk import sent_tokenize,word_tokenize 
sentences = nltk.Text(sent_tokenize(text))
print(len(sentences))
words = nltk.Text(word_tokenize(text))
print(len(words))

In [None]:
num_chars=len(text)
num_words=len(word_tokenize(text))
num_sentences=len(sent_tokenize(text))
vocab = {x.lower() for x in word_tokenize(text)}
print(num_chars,int(num_chars/num_words),int(num_words/num_sentences),(len(vocab)/num_words))


<h4>Functionalize this</h4>

In [None]:
def get_complexity(text):
    num_chars=len(text)
    num_words=len(word_tokenize(text))
    num_sentences=len(sent_tokenize(text))
    vocab = {x.lower() for x in word_tokenize(text)}
    return len(vocab),int(num_chars/num_words),int(num_words/num_sentences),len(vocab)/num_words

In [None]:
get_complexity(restaurant_data['le_monde'].raw())

In [None]:
for text in restaurant_tuples:
    (vocab,word_size,sent_size,vocab_to_text) = get_complexity(text[1])
    print("{0:15s}\t{1:1.2f}\t{2:1.2f}\t{3:1.2f}\t{4:1.2f}".format(text[0],vocab,word_size,sent_size,vocab_to_text))

<h3>Comparing complexity of restaurant reviews won't get us anything useful</h3>
<h3>Let's look at something more useful</h3>

<h4>Let's look at the complexity of the speeches by four presidents</h4>

In [None]:
inaugural_texts = [('trump',inaugural.raw('2017-Trump.txt')),
         ('obama',inaugural.raw('2013-Obama.txt')),
         ('jackson',inaugural.raw('1829-Jackson.txt')),
         ('washington',inaugural.raw('1789-Washington.txt'))]
for text in inaugural_texts:
    (vocab,word_size,sent_size,vocab_to_text) = get_complexity(text[1])
    print("{0:15s}\t{1:1.2f}\t{2:1.2f}\t{3:1.2f}\t{4:1.2f}".format(text[0],vocab,word_size,sent_size,vocab_to_text))

<h2>Analysis over time</h2>


<h3>The files are arranged over time so we can analyze how complexity has changed between Washington and Trump</h3>

In [None]:
from nltk.corpus import inaugural
sentence_lengths = list()
for fileid in inaugural.fileids():
    sentence_lengths.append(get_complexity(' '.join(inaugural.words(fileid)))[2])
plt.plot(sentence_lengths)

<h1>dispersion plots</h1>
<h2>Dispersion plots show the relative frequency of words over the text</h2>
<h3>Let's see how the frequency of some words has changed over the course of the republic</h3>
<h3>That should give us some idea of how the focus of the nation has changed</h3>

In [None]:
text4.dispersion_plot(["government", "citizen", "freedom", "duties", "America",'independence','God','patriotism'])

<h4>You can use dispersion plots to identify important characters in a book</h4>
<li>The main characters in Sense and Sensibility are Elinor, Marianne, Edward, and Willoughby
<li>Who are the more important, the men or the women?

In [None]:
text2.dispersion_plot(['Elinor','Marianne','Edward','Willoughby'])

<h4>Of the characters in the book, which are likely major and which minor?</h4>

In [None]:
text2.dispersion_plot(['Elinor','Marianne','Edward','Willoughby','Brandon','Fanny'])

<h2>Stemming</h2>

<h4>We may want to use word stems rather than the part of speect form</h4>
<li>For example: patriot, patriotic, patriotism all express roughly the same idea
<li>nltk has a stemmer that implements the "Porter Stemming Algorithm" (https://tartarus.org/martin/PorterStemmer/)
<li>We'll push everything to lowercase as well

In [None]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
text = inaugural.raw()
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')
sentences = sent_tokenize(striptext)
words = word_tokenize(striptext)
text = nltk.Text([p_stemmer.stem(i).lower() for i in words])
text.dispersion_plot(["govern", "citizen", "free", "america",'independ','god','patriot'])

<h2>Weighted sentiment analysis using Vader</h2>
<h4>Vader contains a list of 7500 features weighted by how positive or negative they are</h4>
<h4>It uses these features to calculate stats on how positive, negative and neutral a passage is</h4>
<h4>And combines these results to give a compound sentiment (higher = more positive) for the passage</h4>
<h4>Human trained on twitter data and generally considered good for informal communication</h4>
<h4>10 humans rated each feature in each tweet in context from -4 to +4</h4>
<h4>Calculates the sentiment in a sentence using word order analysis</h4>
<li>"marginally good" will get a lower positive score than "extremely good"
<h4>Computes a "compound" score based on heuristics (between -1 and +1)</h4>
<h4>Includes sentiment of emoticons, punctuation, and other 'social media' lexicon elements</h4>
    <h4>For more see http://datameetsmedia.com/vader-sentiment-analysis-explained/</h4> 


In [None]:
!pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
headers = ['pos','neg','neu','compound']
texts = restaurant_tuples
analyzer = SentimentIntensityAnalyzer()
for i in range(len(texts)):
    name = texts[i][0]
    sentences = sent_tokenize(texts[i][1])
    pos=compound=neu=neg=0
    for sentence in sentences:
        vs = analyzer.polarity_scores(sentence)
        
        pos+=vs['pos']/(len(sentences))
        neu+=vs['neu']/(len(sentences))
        neg+=vs['neg']/(len(sentences))
        compound+=vs['compound']/(len(sentences))
    print(name,pos,neg,neu,compound)

<h4>And functionalize this as well</h4>

In [None]:
def vader_comparison(texts):
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    headers = ['pos','neg','neu','compound']
    print("Name\t",'  pos\t','neg\t','neu\t','compound')
    analyzer = SentimentIntensityAnalyzer()
    for i in range(len(texts)):
        name = texts[i][0]
        sentences = sent_tokenize(texts[i][1])
        pos=compound=neu=neg=0
        for sentence in sentences:
            vs = analyzer.polarity_scores(sentence)
            
            pos+=vs['pos']/(len(sentences))
            neu+=vs['neu']/(len(sentences))
            neg+=vs['neg']/(len(sentences))
            compound+=vs['compound']/(len(sentences))
        print('%-10s'%name,'%1.2f\t'%pos,'%1.2f\t'%neg,'%1.2f\t'%neu,'%1.2f\t'%compound)

In [None]:
vader_comparison(restaurant_tuples)

In [None]:
vader_comparison(inaugural_texts)

<h1>Named Entity Detection</h1>
<h4>People, places, organizations</h4>
Named entities are often the subject of sentiments so identifying them can be very useful

<h4>Named entity detection is based on Part-of-speech tagging of words and chunks (groups of words)</h4>
<li>Start with sentences (using a sentence tokenizer)
<li>tokenize words in each sentence
<li>chunk them. ne_chunk identifies likely chunked candidates (ne = named entity)
<li>Finally build chunks using nltk's guess on what members of chunk represent (people, place, organization)
<li>English pickle in the 'punkt' package contains english grammar information
<li>We can load it and then use it to parse a sentence into constituent sentences

In [None]:
import nltk
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sample_text = """
I was walking along thinking of many things. For e.g., I walked with my friend Bilkees Bijou through the campus of Columbia University. I 
thought of birds, of bees, of sealing wax. I thought of cabbages and kings.
"""
sent_detector.tokenize(sample_text)

<h4>word_tokenize collects the words from a sentence</h4>

In [None]:
word_list = nltk.word_tokenize(sent_detector.tokenize(sample_text)[1])
word_list

<h4>pos_tag tags the word with nltk's best guess as to the part of speech</h4>
<li>nltk uses Penn Treebank tagging</li>
<li>https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
nltk.pos_tag(word_list)

<h4>ne_chunk creates a "Sentence Tree" of parts of speech using a tokenized list of words</h4>
<li>words that are candidate entities have an attribute "label"

In [None]:
tagged = nltk.pos_tag(word_list)
chunked = nltk.ne_chunk(tagged)
chunked
#chunked[-2]

<h4>We're going to use hasattr() to select items from the Tree. </h4>
hasattr() is a python function that checks whether a name/string is an attribute of an object

In [None]:
class my_class(object):
    def __init__(self,x):
        name = x
    def check(self):
        return self.name

y = my_class('Jack')
hasattr(y,'check')
# dir(y)

In [None]:
for j in chunked:
    try:
        print(j,j.label())
    except:
        continue

In [None]:
tagged = nltk.pos_tag(word_list)
chunked = nltk.ne_chunk(tagged)
hasattr(chunked[-2],'label')

In [None]:
en={}
try:
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(sample_text)
    for sentence in sentences:
            tokenized = nltk.word_tokenize(sentence)
            tagged = nltk.pos_tag(tokenized)
            chunked = nltk.ne_chunk(tagged)
            for tree in chunked:
                if hasattr(tree, 'label'):
                    ne = ' '.join(c[0] for c in tree.leaves())
                    # print(ne)
                    en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
except Exception as e:
    print(str(e))
#print(en)
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(en)

<h4>We can now do this on our actual text. Let's try with community_data</h4>

In [None]:
en={}
try:
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(restaurants_data['community'].raw().strip())
    for sentence in sentences:
            tokenized = nltk.word_tokenize(sentence)
            tagged = nltk.pos_tag(tokenized)
            chunked = nltk.ne_chunk(tagged)
            for tree in chunked:
                if hasattr(tree, 'label'):
                    ne = ' '.join(c[0] for c in tree.leaves())
                    en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
except Exception as e:
    print(str(e))
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(en)

<h3>And functionalize this</h3>

In [None]:
def get_labeled_text(text,label_type='ALL'):
    en={}
    try:
        sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
        sentences = sent_detector.tokenize(text.strip())
        for sentence in sentences:
                tokenized = nltk.word_tokenize(sentence)
                tagged = nltk.pos_tag(tokenized)
                chunked = nltk.ne_chunk(tagged)
                for tree in chunked:
                    if hasattr(tree, 'label'):
                        if not label_type == "ALL":
                            if not tree.label() == label_type:
                                continue
                        ne = ' '.join(c[0] for c in tree.leaves())
                        en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
    except Exception as e:
        return str(e)
    return en
get_labeled_text(restaurants_data['community'].raw(),'ORGANIZATION')

<h4>Assuming we've done a good job of identifying named entities, we can get an affect score on entities</h4>

In [None]:
# For example for the word service
meaningful_sents = list()
i=0
for sentence in sentences:
    if 'service' in sentence:
        i+=1
        meaningful_sents.append((i,sentence))

vader_comparison(meaningful_sents)       

<h4>We could also develop a affect calculator for common terms in our domain (e.g., food items)</h4>

In [None]:
def get_affect(text,word,lower=True):
    import nltk
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(text.strip())
    sentence_count = 0
    running_total = 0
    for sentence in sentences:
        if lower: 
            sentence = sentence.lower()
            word = word.lower()
        if word in sentence:
            vs = analyzer.polarity_scores(sentence) 
            running_total += vs['compound']
            sentence_count += 1
    if sentence_count == 0: return 0
    return running_total/sentence_count

<h4>And compare different texts on the affect score of the terms</h4>

In [None]:
print(get_affect(restaurants_data['community'].raw(),'service',True))
print(get_affect(restaurants_data['le_monde'].raw(),'service',True))
print(get_affect(restaurants_data['shakeshack'].raw(),'service',True))
print(get_affect(restaurants_data['fiveguys'].raw(),'service',True))

<h4>Or look for the "good" and "bad" characters in a piecce of text</h4>

In [None]:
print(get_affect(gutenberg.raw('shakespeare-hamlet.txt'),'Gertrude',False))
print(get_affect(gutenberg.raw('shakespeare-hamlet.txt'),'Hamlet',False))
print(get_affect(gutenberg.raw('shakespeare-hamlet.txt'),'Horatio',False)

<h3>We can apply this to any text</h3>

In [None]:
get_labeled_text(inaugural.raw('2017-trump.txt'))

In [None]:
for key in get_labeled_text(inaugural.raw('2009-obama.txt'),'PERSON'):
    print(key,get_affect(inaugural.raw('2009-obama.txt'),key))

In [None]:
for key in get_labeled_text(inaugural.raw('2017-trump.txt'),'PERSON'):
    print(key,get_affect(inaugural.raw('2017-trump.txt'),key))

<h4>The nltk function concordance prints text fragments around a word</h4>
<li>Useful for a quick look but it "prints" not "returns"

In [None]:
nltk.Text(restaurants_data['community'].words()).concordance('Columbia',100)

In [None]:
nltk.Text(nltk.word_tokenize(inaugural.raw('2009-obama.txt'))).concordance('Sahn',100)