In [40]:
import pandas as pd
import tweepy
import requests
import json
import sys

## Gather
The following code is meant for gathering data and storing them into pandas DataFrames.

The enhanced Twitter archive for the WeRateDogs tweet collection is provided as a file to us. We load this CSV file into our environment using the `pandas.read_csv` function.

In [62]:
df_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [63]:
df_archive.shape

(2356, 17)

The predictions for the dog breed along with their image links are provided in another dataset at [this location](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv). We can download the data using the `requests` library, and load it as a TSV file using the same function as above.

In [3]:
image_predictions_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

In [6]:
req = requests.get(image_predictions_url)
open(image_predictions_url.split('/')[-1], 'wb').write(req.content)

335079

In [10]:
df_preds = pd.read_csv('image-predictions.tsv', sep='\t')
df_preds.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


Now, for the above tweets, using the `tweet_id`, we obtain more information from Twitter using the `tweepy` API.

First, we setup the `tweepy` API to create an API object.

In [35]:
# Setup for tweepy
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

We now write the tweet statuses by obtaining it using the API via `tweet_id`. We store all these results in `tweet_json.txt`.

In [41]:
with open('tweet_json.txt', 'w+') as file:
    for tweet_id in df_archive['tweet_id']:
        try:
            tweet_status = api.get_status(tweet_id, tweet_mode='extended')
            tweet_json = tweet_status._json
            json.dump(tweet_json, file)
            file.write('\n')
        except:
            e = sys.exc_info()[0]
            print(tweet_id)
            print("Error: " + str(e))

888202515573088257
Error: <class 'tweepy.error.TweepError'>
873697596434513921
Error: <class 'tweepy.error.TweepError'>
872668790621863937
Error: <class 'tweepy.error.TweepError'>
872261713294495745
Error: <class 'tweepy.error.TweepError'>
869988702071779329
Error: <class 'tweepy.error.TweepError'>
866816280283807744
Error: <class 'tweepy.error.TweepError'>
861769973181624320
Error: <class 'tweepy.error.TweepError'>
856602993587888130
Error: <class 'tweepy.error.TweepError'>
851953902622658560
Error: <class 'tweepy.error.TweepError'>
845459076796616705
Error: <class 'tweepy.error.TweepError'>
844704788403113984
Error: <class 'tweepy.error.TweepError'>
842892208864923648
Error: <class 'tweepy.error.TweepError'>
837366284874571778
Error: <class 'tweepy.error.TweepError'>
837012587749474308
Error: <class 'tweepy.error.TweepError'>
829374341691346946
Error: <class 'tweepy.error.TweepError'>
827228250799742977
Error: <class 'tweepy.error.TweepError'>
812747805718642688
Error: <class 'tweepy

Now, we open the file `tweet_json.txt` and store information like `tweet_id`, `retweet_count`, and `favorite_count` in a DataFrame.

In [61]:
tweet_info_list = []
with open('tweet_json.txt', 'r') as file:
    for line in file:
        tweet_json = json.loads(line)
        tweet_id = tweet_json['id_str']
        retweet_count = tweet_json['retweet_count']
        favorite_count = tweet_json['favorite_count']
        tweet_info_dict = {'tweet_id': tweet_id,
                           'retweet_count': retweet_count,
                           'favorite_count': favorite_count}
        tweet_info_list.append(tweet_info_dict)
df_tweet_info = pd.DataFrame(tweet_info_list)
df_tweet_info.head()

Unnamed: 0,favorite_count,retweet_count,tweet_id
0,36274,7725,892420643555336193
1,31279,5709,892177421306343426
2,23546,3783,891815181378084864
3,39566,7878,891689557279858688
4,37789,8495,891327558926688256


Now, we have three DataFrames: `df_archive` for tweet and rating information, `df_preds` for dog species predictions, and `df_tweet_info` for additional tweet information. This concludes our gathering part.

## Assess

We now assess the above data to find quality and tidiness issues.

### Quality


### Tidiness
