# Analysis Of The WeRateDogs Twitter Archive

The WeRateDogs twitter archive collects data on dogs, rates them, and provides humorous comment on each dog rated. Many dog lovers have found this archive very useful and fun. This project seeks to analyze the dogs in the archive and provide insights into the data. 

## Introduction

TBC

## Importing Useful Libraries

In [1]:
import pandas as pd 
import numpy as np
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer


## Gathering Data

The data will be collected from three files. Then the files will be merged for consistency and data quality. 

### The twitter archive of the WeRateDogs page

The twitter archive was provided by the admins of the twitter page on WeRateDogs. 

In [2]:
twitter_archive = pd.read_csv("twitter-archive-enhanced.csv")
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### The tweet image predictions

A neural network was used to classify the images for each tweet. Then the three most probable predictions were made on these images. This predictions have already been done and can be retrieved online using the requests library. 

In [3]:
import requests 

url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
with open("image_predictions.tsv", "wb") as file_handler:
    file_handler.write(response.content)

Then I open the tab separated file as as dataframe. 

In [4]:
image_predictions = pd.read_csv("image_predictions.tsv", sep="\t")
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Additional Twitter Data

I need to collect the retweet counts, likes count and other data using the tweet ID for the twitter account. I will be using the tweepy library to get this data. 

In [5]:
import tweepy
# You need to insert your keys for the Twitter developer account here, 
# otherwise use the tweet_json.txt file I provided.

# consumer_key = ''
# consumer_secret = ''
# access_token = ''
# access_secret = ''

# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)

# api = tweepy.API(auth, wait_on_rate_limit=True)

In [6]:
# This cell has to run only once, hence why I commented it out. 

# # Tweet IDs for which to gather additional data via Twitter's API
# tweet_ids = twitter_archive.tweet_id.values
# print("The length of the tweet ids:", len(tweet_ids))

# # Query Twitter's API for JSON data for each tweet ID in the Twitter archive
# count = 0
# fails_dict = {} #collects all the failed tweet ids
# start = timer()
# # Save each tweet's returned JSON as a new line in a .txt file
# with open('tweet_json.txt', 'w') as outfile:
#     # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
#     for tweet_id in tweet_ids:
#         count += 1
#         print(str(count) + ": " + str(tweet_id))
#         try:
#             tweet = api.get_status(tweet_id, tweet_mode='extended')
#             print("Success")
#             json.dump(tweet._json, outfile)
#             outfile.write('\n')
#         except Exception as e:
#             print("Fail")
#             fails_dict[tweet_id] = e
#             pass
# end = timer()
# print("Gathering the data took", end - start, "seconds")
# print(fails_dict)

Let's read the text file to see what it contains.

In [7]:
tweets_list = []
with open("tweet_json.txt", "r") as file:
    for line in file:
        data = json.loads(line)
        tweets_list.append(data)

We'll now look at the features of each line of json data in the list. 

In [8]:
tweets_list[0]

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'large': {'w': 

I will now put this into a dataframe taking only the keys I need. I need just the tweet_id, retweet_count, and favorite_count so I can harmonize them with the data I have. 

In [9]:
counts_tweet = pd.DataFrame(tweets_list, columns=['id_str', 'retweet_count', 'favorite_count'])
counts_tweet.head()

Unnamed: 0,id_str,retweet_count,favorite_count
0,892420643555336193,6877,32876
1,892177421306343426,5178,28402
2,891815181378084864,3421,21352
3,891689557279858688,7085,35849
4,891327558926688256,7593,34292


I now rename the id_str to tweet_id to harmonize it with the other dataframes. 

In [10]:
counts_tweet = counts_tweet.rename(columns={"id_str": "tweet_id"})
counts_tweet.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6877,32876
1,892177421306343426,5178,28402
2,891815181378084864,3421,21352
3,891689557279858688,7085,35849
4,891327558926688256,7593,34292


The data gathering is finished. These are the dataframes we produced from the data gathering phase:
1. twitter_archive: For the data on rated dogs and their stages
2. image_predictions: for prediction of the image for each tweet. Three predictions in all
3. counts_tweet: for the retweet and favorite counts of each twee. 


I will now assess these datasets and check for quality issues. 