# AnTeDe Lab 3: Sentiment Analysis - Part A

## Session goal
The goal of this session is to get acquainted with Pointwise Mutual Information and Semantic Orientation. The function **get_hits** returns the number of Google hits for a given **query**.

In [1]:
import requests, re, logging

def get_hits (query):
    params = (
        ('hl', 'en'),
        ('q', query),
        
    )
    
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
    URL = "https://google.com/search?q={query}"

    headers = {"user-agent": USER_AGENT}
    response = requests.get(URL, headers=headers, params=params)

    # Regular expression to retrieve the approximate number of hits from Google's response
    groups=re.findall(r'result-stats">About (.*?) results', response.text)
    
    # in case no hits were found
    if len(groups)==0:
        logging.warning('No hits found for query '+query)
        return 0
    
    try:
        # if the number is large, get rid of separators 
        result = float(groups[0].replace('.', '').replace(',', "").replace('’', ""))
    except:
        result = float(groups[0])
        
    return result    

Now we get an estimate of the total number of hits.

In [2]:
all_hits = get_hits('the AND a AND of')
print (all_hits)

6040000000.0


We define a **get_PMI** function based on **get_hits**.

In [3]:
import math

def get_PMI (w1, w2, verbose=True):  
    joint = get_hits(w1+'+AND+'+w2)
    pw1 = get_hits(w1)
    pw2 = get_hits(w2)

    PMI = math.log(joint*all_hits/(pw1*pw2), 2)
    
    if verbose:
        print ('PMI('+w1+','+w2+')='+str(round(PMI, 2)))
    
    return PMI, (joint, pw1, pw2, all_hits)

Now you can experiment with PMI. Here's an example, but think of other examples on your own.

In [4]:
PMI, metrics = get_PMI ('richest', 'engineer')
PMI, metrics = get_PMI ('richest', 'data scientist')
PMI, metrics = get_PMI ('richest', 'producer')
PMI, metrics = get_PMI ('richest', 'venture capitalist')
PMI, metrics = get_PMI ('richest', 'zurich banker')

PMI(richest,engineer)=4.88
PMI(richest,data scientist)=11.11
PMI(richest,producer)=5.01
PMI(richest,venture capitalist)=10.16
PMI(richest,zurich banker)=9.83


Write a function to compute the semantic orientation similarly to how it is defined in the (Turney, 2002) paper we discussed in class. 

In [5]:
def get_SO (phrase, positive_word='excellent', negative_word='poor', verbose=True):
    SO="INCOMPLETE"
    # BEGIN_REMOVE
    PMI_positive, metrics = get_PMI(phrase, positive_word, False)
    PMI_negative, metrics = get_PMI(phrase, negative_word, False)
    SO=PMI_positive-PMI_negative
    if verbose:
        print ('SO('+phrase+')='+str(round(SO, 2)))
    # END_REMOVE
    
    return SO

Try it out on the examples from the paper and see whether you can modify it to get better results.

In [6]:
get_SO('local branch')
get_SO('online experience')
get_SO('inconveniently located')
get_SO('unethical practices')

SO(local branch)=1.47
SO(online experience)=2.31
SO(inconveniently located)=-0.47
SO(unethical practices)=-0.54


-0.5411561580449034

In [7]:
get_SO('unethical practices', negative_word='terrible bank')

SO(unethical practices)=-3.94


-3.9383374946642764