#OAuth Exercise

In this exercise we will try to scrape twitter data and do a tf-idf analysis on that (src-uwes twitter analysis). We will need OAuth authentication, and we will follow a similar approach as detailed in the yelp analysis notebook. 

In [1]:
import oauth2 as oauth
import urllib2 as urllib
import json, operator
import numpy as np
import pandas as pd

We will now need twitter api access. The following steps as available online will help you set up your twitter account and access the live 1% stream.

1. Create a twitter account if you do not already have one.
2. Go to https://dev.twitter.com/apps and log in with your twitter credentials.
3. Click "Create New App"
4. Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
5. On the next page, click the "API Keys" tab along the top, then scroll all the way down until you see the section "Your Access Token"
6. Click the button "Create My Access Token". You can Read more about Oauth authorization online. 

Save the details of api_key, api_secret, access_token_key, access_token_secret in your vaule directory and load it in the notebook as shown in yelpSample notebook.

In [2]:
# We need to define the following variables
#api_key = #<get api key> 
#api_secret = #<get api secret>
#access_token_key = #<get your access token key here>"
#access_token_secret = #<get your access token secret here>

#defining them right here is not safe. insteadm create a file in a different directory
# (I use ~/VaultDSE) and in it put a file called, say, twitterkeys.py whose
# content is:
#api_key = #<get api key>
#api_secret = #<get api secret>
#access_token_key = #<get your access token key here>"
#access_token_secret = #<get your access token secret here>
#
#def getkeys():
#    return api_key,api_secret,access_token_key,access_token_secret

# then use the following commands

import sys
sys.path.append('/home/sadat/Documents/DSE/VaultDSE')
import twitterKeys
api_key,api_secret,access_token_key,access_token_secret=twitterKeys.getkeys()

_debug = 0

oauth_token    = oauth.Token(key=access_token_key, secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=api_key, secret=api_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"

http_handler  = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

Below is a twitter request method which will use the above user logins to sign, and open a twitter stream request

In [3]:
def getTwitterStream(url, method, parameters):
  req = oauth.Request.from_consumer_and_token(oauth_consumer,
                                             token=oauth_token,
                                             http_method=http_method,
                                             http_url=url, 
                                             parameters=parameters)

  req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)

  headers = req.to_header()

  if http_method == "POST":
    encoded_post_data = req.to_postdata()
  else:
    encoded_post_data = None
    url = req.to_url()

  opener = urllib.OpenerDirector()
  opener.add_handler(http_handler)
  opener.add_handler(https_handler)

  response = opener.open(url, encoded_post_data)

  return response





We can use the above function to request a response as follows

In [4]:
#Now we will test the above function for a sample data provided by twitter stream here -  
url = "https://stream.twitter.com/1/statuses/sample.json"

In [5]:
parameters = []
response = getTwitterStream(url, "GET", parameters)

Write a function which will take a url and return the top 10 lines returned by the twitter stream

** Note ** The response returned needs to be intelligently parsed to get the text data which correspond to actual tweets. This part can be done in a number of ways and you are encouraged to try different approaches to parse the response data.

In [6]:
def fetchData(url):
    response = getTwitterStream(url, "GET", [])
    lines = response.read()
    j = json.loads(lines)
    h = j['statuses']
    print 'Stream: ',url.split('/')[-1][14:], '\n'
    for i in range(10):
        try:
            print i+1
            print h[i]['text'],'\n'
        except:
            continue
    print '\n\n\n'

In [7]:
list_query = ['UCSD', 'Donald Trump', 'Syria']

for search_query in list_query:
    #We can also request twitter stream data for specific search parameters as follows
    url= "https://api.twitter.com/1.1/search/tweets.json?q="+search_query
    fetchData(url)

Stream:  UCSD 

1
RT @Keysight: Keysight collaborates with #UCSD to demonstrate world’s first #5G, 100-200 meter communication link up to 2 Gbps: https://t.c… 

2
World's First 5G, 100 To 200 Meter Comms Link Up To 2 Gbps Demo'd By Keysight Technologies + UCSD https://t.co/n3aow4n1Iq #science 

3
RT @augmentl: HTC Vive Tour Hits NC State and UCSD Universities this Week https://t.co/ppEZMnwE1H #virtualreality #vr 

4
RT @FamilyoftheYear: San Diego 🤑 see ya tonight @ the loft #ucsd 

5
Climate models underestimate the observed deoxygenation of oxygen minimum zones - Corinne Le Quere #OceanforClimate @Scripps_Ocean #ucsd 

6
RT @Keysight: Keysight collaborates with #UCSD to demonstrate world’s first #5G, 100-200 meter communication link up to 2 Gbps: https://t.c… 

7
With organizers and speakers of the Tara Expeditions ocean science event @Scripps_Ocean #OceanforClimate #ucsd https://t.co/fUC1aslJT1 

8
RT @ArchivosEst: They still draw pictures. Drawings made by Spanish children during th

Call the fetchData function to fetch latest live stream data for following search queries and output the first 5 lines

1. "UCSD"
2. "Donald Trump"
3. "Syria"

### TF-IDF###

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is among the most regularly used statistical tool for word cloud analysis. You can read more about it online (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

We base our analysis on the following

1. The weight of a term that occurs in a document is simply proportional to the term frequency
2. The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs

For this question we will perform tf-idf analysis o the stream data we retrieve for a given search parameter. Perform the steps below

1. use the twitterreq function to search for the query "syria" and save the top 200 lines in the file twitterStream.txt
2. load the saved file and output the count of occurrences for each term. This will be your term frequency
3. Calculate the inverse document frequency for each of the term in the output above.
4. Multiply the term frequency for each of the term by corresponding inverse document frequency.
5. Sort the terms in the descending order based on their term freq/inverse document freq scores 
6. Print the top 10 terms.

In [52]:
#1. use the twitterreq function to search for the query "syria" and save the top 200 lines in the file twitterStream.txt
writer = open('twitterStream.txt', 'a') 
url= "https://api.twitter.com/1.1/search/tweets.json?q="+"syria"
response = getTwitterStream(url, "GET", [])
lines = response.read()
j = json.loads(lines)
h = j['statuses']
for i in range(100):
    try:
        writer.write(h[i]['text'].replace('\n',' ')+'\n\n')
    except:
        continue
writer.close()

print 'Twitter Stream file generated'

Twitter Stream file generated


In [53]:
#2. load the saved file and output the count of occurrences for each term. This will be your term frequency

def tf(name):
    '''Term Frequency'''
    char = '.,?"'
    text = open(name, 'r')
    line = text.read()
    text.close()
    word_list=line.lower().split()
    count_dict = {}
    for word in word_list:
        if word[-1] in char:
            word = word[:-1]
        if word not in count_dict:
            count_dict[word]=0
    for word in word_list:
        if word[-1] in char:
            word = word[:-1]
        count_dict[word]+=1
    return count_dict

name = 'twitterStream.txt'
tf = tf(name)

print 'Term Frequency:\n\n'
print tf

Term Frequency:


{'all': 1, 'bomb': 1, '@capx': 1, 'mission': 2, 'kill': 1, 'cameron': 1, 'soldiers': 1, 'bombing': 1, 'paris': 1, 'fled': 1, 'https://t.co/mlto0bdo5t': 1, 'terror': 1, 'has': 1, 'do': 1, 'branch': 1, "syria's": 1, '#iraq': 1, 'mps': 1, 'assad': 1, 'shadow': 1, 'now': 1, 'https://t.co/riqacyp2if': 1, 'https://t.co/5f03ytebbb': 1, '@iainmartin1:': 1, 'https://t.co/ul541zpbwk': 1, 'some': 1, 'related': 1, 'captive': 1, 'rt': 3, 'for': 1, 'state': 1, 'suspect': 1, 'bosses': 1, 'https://t.co/00xwx4ejyn': 1, 'vows': 1, 'free': 1, 'pressure': 1, '-abc': 1, 'https://t.co/2x64a59xmc': 1, 'on': 3, 'of': 1, 'british': 1, 'david': 1, 'thing': 1, 'syria': 7, 'https://t.co/p49r9mexj5': 1, 'think': 1, 'first': 1, 'via': 1, 'vote': 1, 'https://t.co/oakfnfqee0': 1, 'al-qaida': 1, 'troops': 1, 'to': 7, 'been': 1, 'their': 1, 'way': 1, 'releases': 1, 'exactly': 1, 'cabinet': 1, 'lebanese': 1, 'kind': 1, '#syria': 1, 'isis': 1, 'sheep-like': 1, "bonds'": 1, 'this': 1, 'https://t.co/f1d4c

In [54]:
#3. Calculate the inverse document frequency for each of the term in the output above.

def idf(name):
    '''Inverse Document Frequency'''
    docs = open(name, 'r')
    tot_docs = len(docs.readlines())
    count_dict = {}
    unique = []
    docs.close()
    
    #Get all unique terms
    docs = open(name, 'r')
    char = '.,?"'
    text_list = docs.read().lower().split()
    for word in text_list:
        if word[-1] in char:
            word = word[:-1]
        if word not in unique:
            unique.append(word)
            count_dict[word] = 0
    docs.close()
    
    #Term count in each doc
    docs = open(name, 'r')
    for line in docs.readlines():
        new_line = []
        for word in line.lower().split():
            if word[-1] in char:
                word = word[:-1]
            new_line.append(word)
        for term in unique:
            if term[-1] in char:
                term = term[:-1]
            if term in new_line:
                count_dict[term] += 1
            else:
                pass    
    docs.close()
    
    #IDF calculation
    for key in count_dict:
        count_dict[key] = np.log10(float(tot_docs) / float(count_dict[key]))
    
    return count_dict
        
        
name = 'twitterStream.txt'    
idf = idf(name)
print 'Inverse Document Frequency:\n\n'
print idf

Inverse Document Frequency:


{'all': 1.255272505103306, 'bomb': 1.255272505103306, '@capx': 1.255272505103306, 'mission': 0.95424250943932487, 'kill': 1.255272505103306, 'cameron': 1.255272505103306, 'soldiers': 1.255272505103306, 'bombing': 1.255272505103306, 'paris': 1.255272505103306, 'fled': 1.255272505103306, 'https://t.co/mlto0bdo5t': 1.255272505103306, 'terror': 1.255272505103306, 'has': 1.255272505103306, 'do': 1.255272505103306, 'branch': 1.255272505103306, "syria's": 1.255272505103306, '#iraq': 1.255272505103306, 'mps': 1.255272505103306, 'assad': 1.255272505103306, 'shadow': 1.255272505103306, 'now': 1.255272505103306, 'https://t.co/riqacyp2if': 1.255272505103306, 'https://t.co/5f03ytebbb': 1.255272505103306, '@iainmartin1:': 1.255272505103306, 'https://t.co/ul541zpbwk': 1.255272505103306, 'some': 1.255272505103306, 'related': 1.255272505103306, 'captive': 1.255272505103306, 'rt': 0.77815125038364363, 'for': 1.255272505103306, 'state': 1.255272505103306, 'suspect': 1.255272

In [55]:
#4. Multiply the term frequency for each of the term by corresponding inverse document frequency.

def tfidf(tf_dict, idf_dict):
    tfidf_dict = {}
    for term in tf_dict.keys():
        tfidf_dict[term] = tf_dict[term] * idf_dict[term]
    return tfidf_dict

tfidf = tfidf(tf, idf)
print 'Term Frequency - Inverse Document Frequency:\n\n'
tfidf

Term Frequency - Inverse Document Frequency:




{'#clinton': 1.255272505103306,
 '#iraq': 1.255272505103306,
 '#syria': 1.255272505103306,
 "'james": 1.255272505103306,
 '-abc': 1.255272505103306,
 '@capx': 1.255272505103306,
 '@guardiannews:': 1.9084850188786497,
 '@iainmartin1:': 1.255272505103306,
 'ages': 1.255272505103306,
 'al-qaida': 1.255272505103306,
 'all': 1.255272505103306,
 'anti-isis': 1.9084850188786497,
 'approves': 1.9084850188786497,
 'assad': 1.255272505103306,
 'been': 1.255272505103306,
 'bomb': 1.255272505103306,
 'bombing': 1.255272505103306,
 "bonds'": 1.255272505103306,
 'bosses': 1.255272505103306,
 'bows': 1.255272505103306,
 'branch': 1.255272505103306,
 'british': 1.255272505103306,
 'cabinet': 1.255272505103306,
 'cameron': 1.255272505103306,
 'captive': 1.255272505103306,
 'chief': 1.255272505103306,
 'corbyn': 1.255272505103306,
 'david': 1.255272505103306,
 'determined': 1.255272505103306,
 'do': 1.255272505103306,
 'exactly': 1.255272505103306,
 'first': 1.255272505103306,
 'fled': 1.255272505103306

In [56]:
#5. Sort the terms in the descending order based on their term freq/inverse document freq scores

df = pd.DataFrame(tfidf.items(),columns=['Term','TF-IDF']).sort(ascending=False,columns=['TF-IDF'])

In [60]:
# Print the top 10 terms
df.head(10)

Unnamed: 0,Term,TF-IDF
10,to,4.572488
43,syria,2.871221
27,rt,2.334454
38,on,2.334454
76,in,2.334454
85,approves,1.908485
3,mission,1.908485
67,anti-isis,1.908485
66,https://t.co/f1d4cl9ajo,1.908485
82,@guardiannews:,1.908485
