### Introduction

This project examines the tweet archive for WeRateDogs to draw insights from the dataset regarding the ratings of dogs. The Gather, Assess and Clean process is followed with the aim of answering the following questions:

- What type of dogs have the highest ratings?
- What dog stage has the highest ratings?
- What type of dogs are the most popular in terms of retweet count and favorite count?

## Gather

In [1]:
import json
import numpy as np
import os
import pandas as pd
import requests
import tweepy
from timeit import default_timer as timer
from tweepy import OAuthHandler

In [2]:
# Load twitter archive file
twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# Download image-predictions file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

In [4]:
# Import image predictions file
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

In [5]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive_enhanced.tweet_id.values
len(tweet_ids)

2356

In [7]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
if not os.path.exists('tweet_json.txt'):
    count = 0
    fails_dict = {}
    start = timer()
    # Save each tweet's returned JSON as a new line in a .txt file
    with open('tweet_json.txt', 'w') as outfile:
        # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
        for tweet_id in tweet_ids:
            count += 1
            print(str(count) + ": " + str(tweet_id))
            try:
                tweet = api.get_status(tweet_id, tweet_mode='extended')
                print("Success")
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except tweepy.TweepError as e:
                print("Fail")
                fails_dict[tweet_id] = e
                pass
    end = timer()
    print(end - start)
    print(fails_dict)


In [8]:
# DELETE
# Temp - save fails_dict to file for working with
'''
import csv
fails_list = []
for key in fails_dict.keys():
    fails_list.append(key)
fails_list
with open('fails_dict.csv', 'w') as outfile:
    wr = csv.writer(outfile, dialect='excel')
    wr.writerow(fails_list)
'''

"\nimport csv\nfails_list = []\nfor key in fails_dict.keys():\n    fails_list.append(key)\nfails_list\nwith open('fails_dict.csv', 'w') as outfile:\n    wr = csv.writer(outfile, dialect='excel')\n    wr.writerow(fails_list)\n"

In [9]:
# DELETE
# Working - uncomment out if necessary
'''
failures = []
with open('fails_dict.csv', 'r') as f:
    reader = csv.reader(f)
    for line in reader:
        failures.append(line)
'''

"\nfailures = []\nwith open('fails_dict.csv', 'r') as f:\n    reader = csv.reader(f)\n    for line in reader:\n        failures.append(line)\n"

In [10]:
# Load extended tweet data
tweets = []
with open('tweet_json.txt') as json_file:
    for line in json_file:
        tweets.append(json.loads(line))

# Place extended tweet data into a dataframe
extended_data = pd.DataFrame(tweets)

## Assess

In [11]:
# View column types and missing data in twitter_archive_enhanced
twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [12]:
# DETERMINE IF TO KEEP
# Visually assess data in archive
twitter_archive_enhanced.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1075,739623569819336705,,,2016-06-06 01:02:55 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here's a doggo that don't need no human. 12/10...,,,,https://vine.co/v/iY9Fr1I31U6,12,10,,doggo,,,
78,877611172832227328,,,2017-06-21 19:36:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @rachel2195: @dog_rates the boyfriend and h...,8.768508e+17,512804507.0,2017-06-19 17:14:49 +0000,https://twitter.com/rachel2195/status/87685077...,14,10,,,,pupper,
2045,671528761649688577,,,2015-12-01 03:18:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He's in the middle of a serious conv...,,,,https://twitter.com/dog_rates/status/671528761...,10,10,Jax,,,,
1434,697270446429966336,,,2016-02-10 04:06:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bentley. He got stuck on his 3rd homew...,,,,https://twitter.com/dog_rates/status/697270446...,10,10,Bentley,,,,
1251,710997087345876993,,,2016-03-19 01:11:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Milo and Amos. They are the best of pals....,,,,https://twitter.com/dog_rates/status/710997087...,12,10,Milo,,,,


In [13]:
# Check for duplicated tweets
len(twitter_archive_enhanced[twitter_archive_enhanced.tweet_id.duplicated()])

0

In [14]:
# Check for inclusion of retweets
sum(~twitter_archive_enhanced.retweeted_status_id.isnull())

181

There are 181 tweets that are retweets and need to be removed.

In [15]:
# Check values in rating_denominator - should be 10
twitter_archive_enhanced.rating_denominator.value_counts().sort_index()

0         1
2         1
7         1
10     2333
11        3
15        1
16        1
20        2
40        1
50        3
70        1
80        2
90        1
110       1
120       1
130       1
150       1
170       1
Name: rating_denominator, dtype: int64

In [16]:
# Check for incorrect rating_denominator values
incorrect_denominators = twitter_archive_enhanced[twitter_archive_enhanced['rating_denominator'] != 10]
list(incorrect_denominators['tweet_id'])

[835246439529840640,
 832088576586297345,
 820690176645140481,
 810984652412424192,
 775096608509886464,
 758467244762497024,
 740373189193256964,
 731156023742988288,
 722974582966214656,
 716439118184652801,
 713900603437621249,
 710658690886586372,
 709198395643068416,
 704054845121142784,
 697463031882764288,
 686035780142297088,
 684225744407494656,
 684222868335505415,
 682962037429899265,
 682808988178739200,
 677716515794329600,
 675853064436391936,
 666287406224695296]

Some incorrect denominators were found:

- 835246439529840640 should be 13 numerator and 10 denominator
- 832088576586297345 does not contain a rating and should be dropped
- 810984652412424192 does not contain a rating and should be dropped
- 775096608509886464 is a retweet and will be deleted
- 740373189193256964 should be 14 numerator 10 denominator
- 722974582966214656 should be 13 numerator and 10 denominator
- 716439118184652801 should be 11 numerator and 10 denominator
- 686035780142297088 does not contain a rating and should be dropped
- 682962037429899265 should be 10 numerator and 10 denominator
- 682808988178739200 does not contain a rating and should be dropped
- 666287406224695296 should be 9 numerator and 10 denominator

Others from the list are accurate although they do not have 10 as a denominator.

In [17]:
# Examine data types and mising data in image_predictions
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [18]:
# DETERMINE IF TO KEEP
image_predictions.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1845,838921590096166913,https://pbs.twimg.com/media/C6Ryuf7UoAAFX4a.jpg,1,Border_terrier,0.664538,True,Brabancon_griffon,0.170451,True,Yorkshire_terrier,0.087824,True
1016,709901256215666688,https://pbs.twimg.com/media/CdoTbL_XIAAitq2.jpg,2,bib,0.998814,False,handkerchief,0.000512,False,umbrella,0.000224,False
827,693486665285931008,https://pbs.twimg.com/ext_tw_video_thumb/69348...,1,sea_lion,0.519811,False,Siamese_cat,0.290971,False,black-footed_ferret,0.039967,False
199,669661792646373376,https://pbs.twimg.com/media/CUsd2TfWwAAmdjb.jpg,1,weasel,0.262802,False,Siamese_cat,0.148263,False,hamster,0.116374,False
79,667453023279554560,https://pbs.twimg.com/media/CUNE_OSUwAAdHhX.jpg,1,Labrador_retriever,0.82567,True,French_bulldog,0.056639,True,Staffordshire_bullterrier,0.054018,True
283,671138694582165504,https://pbs.twimg.com/media/CVBdFahXAAAIe5Y.jpg,1,Samoyed,0.587342,True,Great_Pyrenees,0.268952,True,Pekinese,0.090527,True
1469,779056095788752897,https://pbs.twimg.com/media/Cs_DYr1XEAA54Pu.jpg,1,Chihuahua,0.721188,True,toy_terrier,0.112943,True,kelpie,0.053365,True
2029,882762694511734784,https://pbs.twimg.com/media/DEAz_HHXsAA-p_z.jpg,1,Labrador_retriever,0.85005,True,Chesapeake_Bay_retriever,0.074257,True,flat-coated_retriever,0.015579,True
1783,829011960981237760,https://pbs.twimg.com/media/C4E99ygWcAAQpPs.jpg,2,boxer,0.312221,True,dalmatian,0.24404,True,conch,0.130273,False
992,708109389455101952,https://pbs.twimg.com/media/CdO1u9vWAAApj2V.jpg,1,Staffordshire_bullterrier,0.516106,True,American_Staffordshire_terrier,0.236075,True,kelpie,0.06975,True


In [19]:
# Check column types and missing values in extended_data
extended_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 32 columns):
contributors                     0 non-null object
coordinates                      0 non-null object
created_at                       2339 non-null object
display_text_range               2339 non-null object
entities                         2339 non-null object
extended_entities                2065 non-null object
favorite_count                   2339 non-null int64
favorited                        2339 non-null bool
full_text                        2339 non-null object
geo                              0 non-null object
id                               2339 non-null int64
id_str                           2339 non-null object
in_reply_to_screen_name          77 non-null object
in_reply_to_status_id            77 non-null float64
in_reply_to_status_id_str        77 non-null object
in_reply_to_user_id              77 non-null float64
in_reply_to_user_id_str          77 non-null obj

In [20]:
# DETERMINE IF TO KEEP
extended_data.head(5)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,Tue Aug 01 16:23:56 +0000 2017,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37783,False,This is Phineas. He's a mystical boy. Only eve...,,...,,,,,8233,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,Tue Aug 01 00:17:27 +0000 2017,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32453,False,This is Tilly. She's just checking pup on you....,,...,,,,,6084,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,Mon Jul 31 00:18:03 +0000 2017,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24437,False,This is Archie. He is a rare Norwegian Pouncin...,,...,,,,,4027,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,Sun Jul 30 15:58:51 +0000 2017,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",41113,False,This is Darla. She commenced a snooze mid meal...,,...,,,,,8384,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,Sat Jul 29 16:00:24 +0000 2017,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",39326,False,This is Franklin. He would like you to stop ca...,,...,,,,,9089,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [21]:
# Find number of retweets in extended_data
len(extended_data[~extended_data['retweeted_status'].isnull()])

167

There are 167 tweets that are retweets and will need to be removed from the data.

In [22]:
# Check for duplication between twitter_archive_enhanced and image_predictions
all_columns = pd.Series(list(twitter_archive_enhanced) + list(image_predictions))
all_columns[all_columns.duplicated()]

17    tweet_id
dtype: object

Only the tweet_id column was duplicated, which we can use as the link between to two tables.

#### Quality
##### `twitter_archive_enhanced` table
- data missing such as retweet_count and favorite_count
- tweet_id is an int (should be str)
- presence of retweets (retweeted_status_id is non-null)
- rating_numerator and rating_denominator columns are ints (should be floats)
- several incorrect ratings to be corrected

##### `image_predictions` table
- tweet_id is an int (should be str)
- dog names in p1, p2, p3 columns inconsistent - upper and lowercase names, inclusion of underscores etc.
- predictions include non-dog objects which should be removed prior to analysis

##### `extended_data` table
- presence of retweets (retweeted_status is non-null)

#### Tidiness
- `twitter_archive_enhanced` contains multiple columns for the dog stage which should be one column with the dog stage as the variable
- `image_predictions` has multiple predictions per row

## Clean

In [23]:
# Make copies of each dataframe for cleaning
archive_clean = twitter_archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
extended_data_clean = extended_data.copy()

### Missing Data

#### `extended_info`: favorite_count and retweet_count need to be joined to `twitter_archive_enhanced`.

##### Define
- Drop the unwanted columns from `extended_info_clean` so just tweet_id, favorite_count and retweet_count remain
- Merge `extended_info_clean` with `twitter_archive_enhanced_clean` on the tweet_id column, keeping only the rows (tweet_id) that are present in both dataframes

##### Code

##### Test

### Tidiness

#### `twitter_archive_enhanced` contains multiple columns for the dog stage.

##### Define
- Create a column in `twitter_archive_enhanced` called dog_stage that holds the concatenation of each value in the doggo, floofer, pupper and puppo columns.
- Extract the correct dog stage from the dog_stage column and update
- For entries with more than one correct dog, record these as 'mixed'

##### Code

##### Test

#### `image_predictions` has multiple predictions per row.

##### Define
- Create separate dataframes from `image_predictions_clean` for p1, p2 and p3 data and keep only the columns for that probability (`p1_df`, `p2_df`, `p3_df`)
- For each new dataframe, create a column that holds the p_order value (e.g. 1 for p1)
- Rename p1 to p_type
- Repeat for p2 and p3
- Join `p2` to `p1`
- Join `p3` to `p1`
- Store `p1` as `image_predictions_clean`

##### Code

##### Test

### Quality

#### `twitter_archive_enhanced`: tweet_id is an int

##### Define
- Convert tweet_id column to str using .astype()

##### Code

##### Test

#### `image_predictions`: tweet_id is an int

##### Define
- Convert tweet_id column to str using .astype()

##### Code

##### Test

#### `twitter_archive_enhanced`: presence of retweets

##### Define
- Remove rows where retweet_status_id != null

##### Code

##### Test

#### `extended_data`: presence of retweets

##### Define
- Will be removed from the merged twitter_archive_enhanced table when it is cleaned above.

##### Code

##### Test

#### rating_numerator and rating_denominator columns are ints

##### Define
- Convert rating_numerator and rating_denominator columns to float using .astype()

##### Code

##### Test

#### Incorrect ratings (numerators and denominators)

##### Define
- Write a regex to find the numerator and denominator from the text column
- Save as floats to numerator and denominator columns
- Test against some of the known incorrect values to see if the algorithm improves on original algorithm

##### Code

##### Test

#### Inconsistent format for dog names in p columns of `image_predictions`.

##### Define
- Replace '_' with ' ' in p_type column of `image_predictions_clean`
- Capitalise each word in p_type column of `image_predictions_clean`

##### Code

##### Test

#### Predictions include non-dog objects in `image_predictions`.

##### Define
- Drop rows where p_dog == False in `image_predictions_clean`

##### Code

##### Test

## Storage

(store each table as a csv file with main file called tiwtter_archive_master.csv)

## Analysis

(make at least 3 insights)