# Backend script to gather tweets mentioning impostor syndrome
  
# By Adriana



In [3]:
import tweepy

Tweepy is a package that enables Python to use the Twitter API. Tweepy supports accessing Twitter via OAuth, an authentication method.

In [4]:
# Define ckey and csecret in keys.py with your application's key and secret.
from keys import *  


When creating a Twitter developer account, we get two keys: API key and API secret key, those are defined in keys.py
as ckey and csecret.

In [5]:
import numpy as np
import pandas as pd


Pandas is a package that translates tweets called status objects created with tweepy into DataFrames used to manipulate status objects and save them as csv files. Pandas uses numpy.

In [6]:
auth = tweepy.AppAuthHandler(ckey, csecret)
api = tweepy.API(auth, wait_on_rate_limit=True,
				   wait_on_rate_limit_notify=True)

if (not api):
    print ("Can't Authenticate")
    sys.exit(-1)


Authentication takes place

In [7]:
searchQuery = '#impostersyndrome OR #impostorsyndrome OR imposter syndrome OR impostor syndrome'  # this is what we're searching for
maxTweets = 10000000 # Some arbitrary large number
tweetsPerQry = 100  # this is the max the API permits
fName = 'tweets.csv' # We'll store the tweets in a text file.

We are searching for hashtags as well as strings related to imposte(o)r syndrome.

In [10]:
import sys
import os
import csv        

We want to keep the collection of tweets in teets.csv.
The first time we run the search, the file will not exist.

In order to avoid reloading all tweets every time, we need to keep track of which was the latest tweet retrieved (sinceId). If the file does not exist, sinceId is set to None. Otherwise we need to search for the most recent tweet ID. 

When files are updated, information is appended at the end of the file. That means that the most recent tweet is somewhere towards the end of the file, but not quite the last row of the file, since every time a group of tweets is added to tweets.csv, they are listed in decreasing order. 

tweets.csv = [+....-][+...-]...[+...-]. 

This code finds the first tweet of the last set of tweets written to tweets.csv.
A limit case is when the file has only the result of a single search, in which case the first tweet is the most recent.

In [11]:
exists = os.path.isfile(fName)
if exists: 
    with open(fName, 'rt', encoding='utf-8') as f1:
        mycsv = csv.reader(f1)
        mycsv = list(mycsv)
        row_number = -1    # The end of the csv file has the most recent tweets with 
                           # the largest set of IDs. We will start searching from the
                           # last row up until we find the most recent.
        lastRowId = mycsv[row_number][1]
        firstId = mycsv[2][1]
        if firstId > lastRowId:   # if the firstId is bigger that the last one
            sinceId = firstId     # there was only one read of tweets, and the firstId is the largest one.
        else: # find the largest/newest ID.
            row = row_number - 1
            botId = lastRowId
            topId = mycsv[row][1]
            # we loop up from the end of the file until we find the largest ID
            # by falling in the previously read set of tweets.
            while  (topId != "t_id"):
                row = row -1
                botId = topId
                topId = mycsv[row][1]
            sinceId = botId       
            f1.close()   
else:
    sinceId = None

The rest of the code uses the search api to retrieve 100 tweets at a time; a limit set by Twitter.

The new_tweets are in object format, but we want to store them as a csv file with a header for the relevant fields and data in each column, so that every row corresponds to a specific tweet. The following code parses the tweet objects into an internal table called DataFrame. The DataFrame has columns for tweet id, user name, tweet text, location, image, and date, which are the fields relevant to reconstruct the url of a tweet, and preview on the map.

The parsed object data is then written to teets.csv with data.to_csv.

In [13]:
# If results only below a specific ID are, set max_id to that ID.
# else default to no upper limit, start from the most recent tweet matching the search query.
max_id = -1

tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'a') as f:
    while tweetCount < maxTweets:
        try:
            if (max_id <= 0):
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            since_id=sinceId)
            else:
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1))
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1),
                                            since_id=sinceId)
            if not new_tweets:
                break
            data = pd.DataFrame(data=[tweet.id for tweet in new_tweets], columns=['t_id'])
            data['s_name'] = np.array([tweet.user.screen_name for tweet in new_tweets])
            data['t_text'] = np.array([tweet.text for tweet in new_tweets])
            data['u_location'] = np.array([tweet.user.location for tweet in new_tweets])
            data['image_url'] = np.array([tweet.user.profile_image_url for tweet in new_tweets])
            data['t_date'] = np.array([tweet.created_at for tweet in new_tweets])
            data.to_csv(fName, encoding='utf-8-sig',  mode='a')
            tweetCount += len(new_tweets)
            max_id = new_tweets[-1].id
        except tweepy.TweepError as e:
            # Just exit if any error
            print("some error : " + str(e))
            break


Downloading max 10000000 tweets


Finally a message with the total tweets retrieved is printed

In [15]:
print ("Downloaded {0} tweets. Saved to {1}".format(tweetCount, fName))

Downloaded 1906 tweets. Saved to tweets.csv
