# Connecting to the Twitter stream and harvesting ambient geospatial data

## By: David Leifer

Following the steps in this documentation will install the needed development environment to do two things. Firstly, you will be able to scrape Twitter data and return the data in JSON text files. Secondly, you will be able to send the JSON text files to OpenStreetMap’s free and open source geocoding application Nominatim. This is tested on a 2015 Apple MacBook Pro Retina, macOS Mojave Version 10.14.3.

### PART 1: THE STREAMING SCRIPT

1.) Download Anaconda and Jupyter Notebook
This assumes that you have at least Python 3.5 installed on you Mac (which it should be automatically).


![anaconda](anaconda.png "Title")

Go to https://www.anaconda.com/distribution/. Click on macOS (your operating system) and download the Python 3.7 version 64-Bit Graphical Installer. You need at least 652.7 MB of space to do this. Double click the .pkg that you just downloaded and click “Continue” three times and then click “Agree” to the terms. Click “Install”. This takes a few minutes.

![ana install](ana_install.png "Title")

Once this has been installed, you will be asked to install Microsoft Visual Studio Code. This is an IDE for development but I never use it. Skip it for now by pressing “Continue”. Press “Close” and “Move to Trash” for the Anaconda installer package. Verify that it has been correctly installed by typing in “Anaconda- Navigator” in the Spotlight search bar on Mac and clicking the icon (Below Figure).

![ana nav](ana_nav.png "Title")

Click “Launch” on the Jupyter Notebook icon. This should launch Jupyter Notebook in your internet web browser.

![ana nav2](ana_nav2.png "Title")
![ana ju_note](ju_note.png "Title")

2.) Installing the Python 3 Libraries

Now on to installing the Python libraries. First, open up Spotlight again and enter “Terminal”. Open up the “Terminal” application and type in “pip install tweepy”. This will download and install the Python library “tweepy” for Python 3.

![cl](cl.png "Title")

To make sure this has installed correctly, open the Jupyter Notebook and create a new Python3 notebook. Open the notebook and copy the following into the first cell and execute the cell:

from tweepy.streaming import StreamListener

from tweepy import OAuthHandler

from tweepy import Stream


If the cell executes, the libraries were correctly installed.

Open up the “energyhashtag1w.py” file with a text editor such as “Sublime Text” or a similar program (if you don’t have “Sublime Text” installation instructions are found here: https://www.sublimetext.com/3). Hover over the variable named “python_program” and change the path to the location of the file. To do this, press “option/command/c” while the file is selected in Finder. Paste the location into the variable. Repeat this step for the second file named “energyhashtag2w.py” for the variable named “the_other_python_script”. Then, repeat this variable path changing for the second file named “energyhashtag2w.py”. These steps are essential to keep the script from crashing.

![energy.py](energy.png "Title")

Test to see if this works by opening up the application “Terminal”. Change Directories to the one with your “energyhashtag1w.py” by entering in: 

cd /Volumes/Untitled/Desktop/grant_work/jessica_folder/code2/energyhashtag1w.py 

The second part of the cd will be your path to your file, so make sure to change it.

You need to register your application with Twitter developer. Here is a guide to do so:

https://docs.inboundnow.com/guide/create-twitter-application/

Then, run “python energyhashtag1w.py” to see if the script runs correctly. It should create two files: one ending with .txt and the other with .log. The .txt file is the file with the .json data contained in it. Congratulations, the streamer is now collecting data!



In [None]:
# this script is the same as the other one
#   except for the access_ and consumer_ keys and tokens
# Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from datetime import datetime as dt
import os
import sys
import logging
import time
import subprocess

python_program = '/Volumes/Untitled/Desktop/grant_work/jessica_folder/code2/energyhashtag1w.py'
the_other_python_script = '/Volumes/Untitled/Desktop/grant_work/jessica_folder/code2/energyhashtag2w.py'

logfilename = 'energy' + dt.now().strftime("%Y%m%d%H%M%S") + '.log'
logging.basicConfig(filename=logfilename, format='%(asctime)s %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p', level=logging.INFO)

# Variables that contains the user credentials to access Twitter API 
access_token = "XXXX"
access_token_secret = "XXX"
consumer_key = "XXX"
consumer_secret = "XXX"


# This is a basic listener that writes received tweets to files.
# Each file contains max 10000 tweets
class MyListener(StreamListener):
    MAINFILENAME = 'energy'
    MAXTWEETSINFILE = 10000

    def __init__(self, api=None):
        self.api = api
        super(StreamListener, self).__init__()
        self.tweets_count = 0
        self.current_file = self.get_file()
        self.previous_file = self.current_file
        self.rate_limit_exceeded = False
        
    def on_data(self, data):
        self.tweets_count += 1
        self.current_file.write(str(data))
        if self.tweets_count > MyListener.MAXTWEETSINFILE:
            self.previous_file.close()
            self.current_file = self.get_file()
            self.tweets_count = 0
        return True

    def on_exception(self, exception):
        print("on_exception in energyhashtag1w.py on " + dt.now().strftime("%Y%m%d%H%M%S"))
        logging.warn('on_exception in energyhashtag1w.py')
        self.running = False
        subprocess.call([sys.executable, python_program, the_other_python_script])

    def on_error(self, status):
        print("on_error in energyhashtag1w.py on " + dt.now().strftime("%Y%m%d%H%M%S"))
        self.current_file.close()
        if status == 420:
            logging.warn('rate limit exceeded.')
            self.rate_limit_exceeded = True
        logging.warn('on_error in energyhashtag1w.py with status code ' + str(status))
        self.running = False
        subprocess.call([sys.executable, python_program, the_other_python_script])

    def get_file(self):
        name = MyListener.MAINFILENAME + dt.now().strftime("%Y%m%dT%H%M%S")
        name += '.txt'
        return open(name, 'a')

def stop_stream(stream):
    stream.listener.stop()
    stream.disconnect()
    stream.listener.current_file.close()

if __name__ == '__main__':

    stream = None
    try:
        #This handles Twitter authetification and the connection to
        # Twitter Streaming API
        l = MyListener()
        auth = OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_token_secret)

        stream = Stream(auth, l)

        #This line filter Twitter Streams to capture data by keywords
        stream.filter(track=['#gasoline','#Crudeoil', '#crude', '#oil',
'#OOTT', '#OPEC', '#algae', '#biodiesel', '#bioethanol', '#Biogas', '#biofuel',
'#biofuels', '#ethanol', '#biomass', '#AdvancedBiofuels', '#RFS', '#Fuelcell',
'#hydrogen', '#FossilEnergy', '#GreenEnergy', '#thermal', '#coal', '#coalmine',
'#geothermal', '#hydroenergy', '#MethaneHydrates', '#biomethane', '#biopower',
'#natgas', '#naturalgas', '#SynthesisGas', '#nuclearenergy', '#nuclear',
'#OTEC', '#renewable', '#RenewableEnergy', '#TidalEnergy', '#WaveEnergy',
'#oceanenergy', '#shalegas', '#solarfarm', '#solar', '#SolarPower',
'#SolarEnergy', '#SolarPanels', '#WindEnergy', '#hydropower', '#hydroenergy',
'#windfarm', '#WindTurbine', '#RenewableNaturalGas', '#wastetofuel',
'#envirofuel', '#wastetoenergy', '#bioenergy', '#syngas', '#cellulosic',
'#gasoil', '#graphene', '#LPG', '#cleanenergy', '#Fracking', '#Gasification', '#syngas', '#THE'],stall_warnings=True)
    except:
        # when exceptions occur, the program did not go here
        logging.warning("main try except MyException")
        if stream is not None:
            stop_stream(stream)
        # hard-coded the path of the other script
        # os.execv('/home/dxiong/socialmedia/energyhashtag2.py', sys.argv)



## PART 2: THE GEOCODING SCRIPT

3.) Installing the Python2 Libraries

Up until now, we have been using Python3 through Anaconda to install libraries and execute the streaming script. Since our geocoding script only works with Python2.7, we will need to learn how to switch back and forth.

Open up the application “Terminal” and enter “sudo pip2 install pandas”. You will be prompted to enter your password. This should install pandas onto your machine. You will also need to enter the following into “Terminal”:

pip2 install ntlk

pip2 install json 

pip2 install glob

pip2 install collections

pip2 install numpy 

If you receive an error saying that you do not have a module installed, you did not install the libraries correctly (or I missed one that needs to be installed).

Now, to run the script, return to “Terminal” and change directories with “cd” to the place where you store both the “streamlined.py” script and the .txt data you want to geocode. I have included some example .txt files so you can test your script. Run “python2 streamlined.py” and the script will run on the three files, creating a few more files. Congratulations! You can now collect Twitter data and geocode the ambient geospatial information it.

Let me know if you have any questions!

In [None]:
###########################################################################################
#S  				T					A 						R						 T#
###########################################################################################

import json, os
import pandas as pd
from glob import glob
import geocoder

import time
import datetime

from collections import Counter
import numpy as np
import unicodedata

from nltk.corpus import stopwords
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#for Anaconda python2.7:
#source activate py27
#data has to be in same dir as the file

#path to the current dir
#all_files to the dir of txt files
# Tweets are stored in in file "fname". In the file used for this script, 
# each tweet was stored on one line
#fname = 'data2/energy20180322T083452.txt'
path = ''
all_files = glob(os.path.join(path, "*.txt"))

#loop through the all_files dir
for fname in all_files:
    with open(fname, 'r') as f:
        #http://www.mikaelbrunila.fi/2017/03/27/scraping-extracting-mapping-geodata-twitter/
        #https://opensas.wordpress.com/2013/06/30/using-openrefine-to-geocode-your-data-using-google-and-openstreetmap-api/
        #Create dictionary to later be stored as JSON. All data will be included
        # in the list 'data'
        users_with_geodata = {
            "data": []
        }
        all_users = []
        total_tweets = 0
        geo_tweets  = 0
        for line in f:
            tweet = json.loads(line)
            if tweet['user']['id']:
                total_tweets += 1 
                user_id = tweet['user']['id']
                if user_id not in all_users:
                    all_users.append(user_id)
                
                    #Give users some data to find them by. User_id listed separately 
                    # to make iterating this data later easier
                    user_data = {
                        "user_id" : tweet['user']['id'],
                        "features" : {
                            "name" : tweet['user']['name'],
                            "id": tweet['user']['id'],
                            "screen_name": tweet['user']['screen_name'],
                            "tweets" : 1,
                            "location": tweet['user']['location'],
                            "text": tweet['text'],
                            "created_at": tweet['created_at'],
                        }
                    }
                    #Iterate through different types of geodata to get the variable primary_geo
                    if tweet['coordinates']:
                        user_data["features"]["primary_geo"] = str(tweet['coordinates'][tweet['coordinates'].keys()[1]][1]) + ", " + str(tweet['coordinates'][tweet['coordinates'].keys()[1]][0])
                        user_data["features"]["geo_type"] = "Tweet coordinates"
                    elif tweet['place']:
                        user_data["features"]["primary_geo"] = tweet['place']['full_name'] + ", " + tweet['place']['country']
                        user_data["features"]["geo_type"] = "Tweet place"
                    else:
                        user_data["features"]["primary_geo"] = tweet['user']['location']
                        user_data["features"]["geo_type"] = "User location"
                    #Add only tweets with some geo data to .json. Comment this if you want to include all tweets.
                    if user_data["features"]["primary_geo"]:
                        users_with_geodata['data'].append(user_data)
                        geo_tweets += 1
            
                #If user already listed, increase their tweet count
                elif user_id in all_users:
                    for user in users_with_geodata["data"]:
                        if user_id == user["user_id"]:
                            user["features"]["tweets"] += 1
    
        #Count the total amount of tweets for those users that had geodata            
        for user in users_with_geodata["data"]:
            geo_tweets = geo_tweets + user["features"]["tweets"]
        #Get some aggregated numbers on the data
        print fname + " included " + str(len(all_users)) + " unique users who tweeted with or without geo data"
        print fname + " included " + str(len(users_with_geodata['data'])) + " unique users who tweeted with geo data, including 'location'"
        print fname + " users with geo data tweeted " + str(geo_tweets) + " out of the total " + str(total_tweets) + " of tweets."
    # Save data to JSON file
    with open('user_loc_' + fname + '.json', 'w') as fout:
        fout.write(json.dumps(users_with_geodata, indent=4))
#create a glob list of the json files in our dir
path = ''
all_files = glob(os.path.join(path, "*.json"))
#loop through the glob list of json files
for data in all_files:
    df = pd.read_json(data)
    tweets = pd.read_json((df['data']).to_json(), orient='index')
    tweets1 = pd.read_json((tweets['features']).to_json(), orient='index')
    tweets1['coord'] = 'coord'

    #create a list to append the geo data to, geocode based on location column
    for index, row in tweets1.iterrows():
        try:
        	print(row['location'])
        	time.sleep(1.01)
        	g = geocoder.osm(row['location'])
        	geo = g.latlng
        	print(geo)
        	tweets1.at[index, 'coord'] = geo
        except:
        	pass

    #split the coord column in y and x columns
    #tweets1['coord'] = pd.Series(coord)
    tweets1['coord'] = tweets1['coord'].astype(str)
    tweets1['coord'] = tweets1['coord'].str.strip('[]')
    tweets1['y'], tweets1['x'] = tweets1['coord'].str.split(',', 1).str
    
    #save to a json file
    #tweets1.to_json(data + '_geo.json')
    print "Geocoded " + data
    
    #sidestep an error reading a string
    tweets1 = tweets1[tweets1['text'].notnull()]
    
    #remove stopwords
    stop = stopwords.words('english')
    tweets1['tweet_without_stopwords'] = tweets1['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    stop =  ['The','RT','&amp;', '-', 'A', 'https:', '.', '2']
    tweets1['tweet_without_stopwords'] = tweets1['tweet_without_stopwords'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    #remove periods
    tweets1['tweet_without_stopwords'] = tweets1['tweet_without_stopwords'].str.replace('[\.]','')
    #remove commas
    tweets1['tweet_without_stopwords'] = tweets1['tweet_without_stopwords'].str.replace('[\,]','')
    #remove -
    tweets1['tweet_without_stopwords'] = tweets1['tweet_without_stopwords'].str.replace('[-]','')
    #remove @
    tweets1['tweet_without_stopwords'] = tweets1['tweet_without_stopwords'].str.replace('[@]','')
    
    #sentiment analysis using VADER
    tweets1["compound"] = ''
    tweets1["neg"] = ''
    tweets1["neu"] = ''
    tweets1["pos"] = ''
    sid = SentimentIntensityAnalyzer()
    
    for user, row in tweets1.T.iteritems():
        try:
            sentence = unicodedata.normalize('NFKD', tweets1.loc[user, 'tweet_without_stopwords'])
            ss = sid.polarity_scores(sentence)
            tweets1.set_value(user, 'compound', ss['compound'])
            tweets1.set_value(user, 'neg', ss['neg'])
            tweets1.set_value(user, 'neu', ss['neu'])
            tweets1.set_value(user, 'pos', ss['pos'])
        except TypeError:
            print(tweets1.loc[user, 'tweet_without_stopwords'])
    
    #print a positive message and save the file
    print "Sentiment analyzed " + data
    tweets1.to_json(data + '_geo_sent.json')

###########################################################################################
#F  									I 												 N#
###########################################################################################

