# Extracting Tweets from Tweet IDs

In [1]:
import numpy as np 
import pandas as pd 

from twython import Twython

from tqdm import *

In [2]:
tweet_ids = pd.read_csv('tweet_ids/2015_Nepal_Earthquake_en/2015_nepal_eq_cf_labels.csv')

In [3]:
tweet_ids.head()

Unnamed: 0,label,tweet_id
0,other_useful_information,'591902695562170368'
1,infrastructure_and_utilities_damage,'591902695822331904'
2,injured_or_dead_people,'591902695943843840'
3,missing_trapped_or_found_people,'591902696371724288'
4,sympathy_and_emotional_support,'591902696375877632'


I want an additional column, with the actual tweet next to tweet_id

Using Twython

In [4]:
CONSUMER_KEY = 'I deleted these keys'
CONSUMER_SECRET = 'If you want to generate some'

OAUTH_TOKEN = 'register an app at https://apps.twitter.com/ '
OAUTH_SECRET = 'its super easy!'

In [5]:
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_SECRET)

In [6]:
test_id = tweet_ids.tweet_id[0][1:-1]

There can be a variety of errors which occur when using this API, such as a suspended user: 

In [7]:
tweet = twitter.show_status(id=test_id)

TwythonError: Twitter API returned a 403 (Forbidden), User has been suspended.

But I still want the loop to continue if an error is raised, so I'll use Python's 'try, except' to catch any errors

In [8]:
try: tweet = twitter.show_status(id=test_id)
except: print ("oh rats")

oh rats


In [9]:
test_id = tweet_ids.tweet_id[1][1:-1] 
try: tweet = twitter.show_status(id=test_id); print tweet['text']
except: print ("oh rats")

RT @DailySabah: #LATEST #Nepal's Kantipur TV shows at least 21 bodies lined up on ground after 7.9 earthquake
http://t.co/opoQLUkYAN http:/…


Excellent. Let's make a new csv file, which has all the tweets filled out and a None if an error was raised. 

In [6]:
tweet_ids['tweet_texts'] = u''

Note: because of the limited number of calls I could make to Twitter's API, I had to stagger this. 

In [37]:
for i in tqdm(range(2400, len(tweet_ids))):
    individual_id = tweet_ids.tweet_id.iloc[i][1:-1]
    try: tweet = twitter.show_status(id=individual_id)['text']
    except: tweet = None
    tweet_ids.set_value(i, 'tweet_texts', tweet)

100%|██████████| 618/618 [01:11<00:00,  8.59it/s]


In [52]:
tweet_ids.to_csv('tweet_ids/2015_Nepal_Earthquake_en/string_filled_tweets.csv', encoding = 'utf-8')

Now, I want to remove all tweets for which no actual tweet could be extracted

In [59]:
stripped_tweets = tweet_ids[pd.notnull(tweet_ids.tweet_texts)]

In [72]:
stripped_tweets.to_csv('tweet_ids/2015_Nepal_Earthquake_en/stripped_filled_tweets.csv', encoding = 'utf-8')

I load the data back up to make sure everything is okay: 

In [74]:
read_tweets = pd.read_csv('tweet_ids/2015_Nepal_Earthquake_en/stripped_filled_tweets.csv', encoding = 'ISO-8859-1')

In [75]:
len(read_tweets)

2339

So I was able to extract $\frac{2339}{3019} = 77\%$ of the tweets. This should be enough to give this a shot! 