# Data Wrangling Project - WeRateDogs Twitter Archive

## Introduction
WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

![WeRateDogs Twitter](https://video.udacity-data.com/topher/2017/October/59dd378f_dog-rates-social/dog-rates-social.jpg)

### Project Steps Overview
With the main focus of this project being on data wrangling, it will be divided into the following steps:
1. Gathering data
2. Assessing data
3. Cleaning data
4. Storing data
5. Analyzing and visualizing data
6. Reporting
      - My data wrangling efforts
      - My data analyses and visualizations.

### Aim
The goal is to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

### The Data
In this project, I will work on the following three datasets.

##### Enhanced Twitter Archive
This data was provided by the Udacity team. The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which was used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, they filtered for tweets with ratings only (there are 2356).
![image.png](https://video.udacity-data.com/topher/2017/October/59dd4791_screenshot-2017-10-10-18.19.36/screenshot-2017-10-10-18.19.36.png)

They extracted this data programmatically, but they didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. I'll need to assess and clean these columns if I want to use them for analysis and visualization.

##### Additional Data via the Twitter API
Back to the basic-ness of Twitter archives: ***retweet count*** and ***favorite count*** are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Since I have access to Twitter's API and the tweet IDs of the tweets in the Enhanced archive, I can gather the needed data for all 5000+ by querying Twitter's API.

##### Image Predictions File
Every image in the WeRateDogs Twitter archive was run through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).
![imgpredict](https://video.udacity-data.com/topher/2017/October/59dd4d2c_screenshot-2017-10-10-18.43.41/screenshot-2017-10-10-18.43.41.png)

So for the last row in that table:
* `tweet_id` is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
* `p1` is the algorithm's #1 prediction for the image in the tweet → **golden retriever**
* `p1_conf` is how confident the algorithm is in its #1 prediction → **95%**
* `p1_dog` is whether or not the #1 prediction is a breed of dog → **TRUE**
* `p2` is the algorithm's second most likely prediction → **Labrador retriever**
* `p2_conf` is how confident the algorithm is in its #2 prediction → **1%**
* `p2_dog` is whether or not the #2 prediction is a breed of dog → **TRUE**
* etc.

## Gathering Data

In [2]:
#%pip install tweepy

In [1]:
import pandas as pd
import numpy as np
import requests
import json
import os

### Data downloaded manually
The WeRateDogs Twitter archive was given to us by hand, so let's read it into a pandas dataframe.

In [3]:
twitter_enhanced_df = pd.read_csv('twitter-archive-enhanced.csv')

In [4]:
twitter_enhanced_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### Data from Web
Downloading the **tweet image predictions** programmatically using the `requests` library.

In [5]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)

In [15]:
with open("image_predictions.tsv",mode='wb') as file:
      file.write(response.content)

In [16]:
os.listdir()

['image_predictions.tsv',
 'twitter-archive-enhanced.csv',
 'WeRateDogs Twitter_Data Wrangling Project.ipynb']

In [18]:
img_predictions_df = pd.read_csv('image_predictions.tsv', sep='\t')

In [19]:
img_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Data from API
Use Tweepy to query Twitter's API for the retweet count and favorite count for each tweet.

- Create API object
- Create empty list for containing dictionary with keys 'tweet_id' and 'json_string' or 'tweet_data'
- Create an empty list for storing errors
- Iterate through the tweet_id in the twitter_enhanced_df
- Use the try-except statement to catch errors
- Monitor each iterations time.
- Get status of tweet
- Convert json status to string so that it can be stored.
- Store the tweet_id and json_string in a dictionary
- Store dictionary in empty list
- Store errors in list.
- Work on the errors
- Store each json_string in a new line in `tweet_json.txt` file

In [71]:
import tweepy
import timeit

My API keys, secrets and tokens won't be included this report. You can get yours by signing up for the [Twitter Developer Account](https://developer.twitter.com)

In [27]:
api_key = "your api key"
api_key_secret = "your api key secret"
access_token = "your access token"
access_token_secret = "your access token secret"

Creating the API object that I'll use to gather Twitter data.

In [36]:
#auth = tweepy.OAuthHandler(api_key, api_key_secret)
#auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth=auth, wait_on_rate_limit=True)

Defining relevant functions

In [74]:
def tweet_status(tweetID):
      '''
      To get the status of a particular tweetID. Returns a dictioary of tweet ID and
      the json data in string format.
      '''
      # get status of the tweet
      status = api.get_status(tweetID, tweet_mode='extended')
      # get the json data in string
      tweet_data = json.dumps(status._json)
      
      return {'tweet_id': tweetID, 'tweet_data':tweet_data}

In [83]:
'''
tweets = [] # creating empty list for containing dictionary returned from tweet_status()
errorList = [] # list for containing the failed tweet IDs

i = 1 # counter
for id in twitter_enhanced_df.tweet_id:
      
      print("{}: {}".format(i, id))
      
      start = timeit.timeit() # get start time
      try:
            status = tweet_status(id) # get status for tweet ID
            tweets.append(status) # append return dictionary to tweet_json
      except Exception as e:
            # get the tweet ID and its particular error
            errorList.append({
                              'tweet_id': id,
                              'error': str(e)
                              })
      end = timeit.timeit() # get end time
      print("\truntime:",end - start)
      i += 1
'''

1: 892420643555336193
	runtime: 0.03284029999849736
2: 892177421306343426
	runtime: 0.016495300002134172
3: 891815181378084864
	runtime: 0.00876839999909862
4: 891689557279858688
	runtime: 0.007181000004493399
5: 891327558926688256
	runtime: 0.011067400002502836
6: 891087950875897856
	runtime: 0.005447499999718275
7: 890971913173991426
	runtime: 0.01020059999791556
8: 890729181411237888
	runtime: 0.014640300003520679
9: 890609185150312448
	runtime: 0.007319699998333817
10: 890240255349198849
	runtime: 0.009109599999646889
11: 890006608113172480
	runtime: -0.0013445999975374434
12: 889880896479866881
	runtime: 0.00751259999742615
13: 889665388333682689
	runtime: 0.012810500000341563
14: 889638837579907072
	runtime: 0.010224100002233172
15: 889531135344209921
	runtime: 0.009715500000311295
16: 889278841981685760
	runtime: 0.009460100005526328
17: 888917238123831296
	runtime: 0.016081899997516302
18: 888804989199671297
	runtime: 0.0034466999968572054
19: 888554962724278272
	runtime: 0.012

Rate limit reached. Sleeping for: 178


	runtime: 0.012303399998927489
891: 759566828574212096
	runtime: -0.0005553000009967946
892: 759557299618865152
	runtime: 0.004020400003355462
893: 759447681597108224
	runtime: 0.009658200000558281
894: 759446261539934208
	runtime: 0.015201199999864912
895: 759197388317847553
	runtime: -0.012521699998615077
896: 759159934323924993
	runtime: 0.0023095000033208635
897: 759099523532779520
	runtime: 0.0030828999988443684
898: 759047813560868866
	runtime: 0.008664800003316486
899: 758854675097526272
	runtime: 0.008317700001498451
900: 758828659922702336
	runtime: 0.005359300001146039
901: 758740312047005698
	runtime: 0.012296600001718616
902: 758474966123810816
	runtime: -0.005912199998419965
903: 758467244762497024
	runtime: 0.014805300001171418
904: 758405701903519748
	runtime: -0.005016799997974886
905: 758355060040593408
	runtime: 0.05482090000077733
906: 758099635764359168
	runtime: -0.002992199999425793
907: 758041019896193024
	runtime: 0.01646710000204621
908: 757741869644341248
	run

Rate limit reached. Sleeping for: 220


	runtime: 0.009938099999999395
1791: 677530072887205888
	runtime: 0.006335800000670133
1792: 677335745548390400
	runtime: 0.005997300002491102
1793: 677334615166730240
	runtime: 0.007152399997721659
1794: 677331501395156992
	runtime: 0.006478399998741224
1795: 677328882937298944
	runtime: 0.0038512000028276816
1796: 677314812125323265
	runtime: 0.009676499998022337
1797: 677301033169788928
	runtime: 0.005042700006015366
1798: 677269281705472000
	runtime: 0.007750399996439228
1799: 677228873407442944
	runtime: 0.013334999999642605
1800: 677187300187611136
	runtime: -0.0013901000020268839
1801: 676975532580409345
	runtime: 0.0058262999991711695
1802: 676957860086095872
	runtime: -0.0006466000013460871
1803: 676949632774234114
	runtime: 8.789999992586672e-05
1804: 676948236477857792
	runtime: 0.009919500000250991
1805: 676946864479084545
	runtime: -0.0027098999998997897
1806: 676942428000112642
	runtime: 0.00357570000414853
1807: 676936541936185344
	runtime: 0.020049400001880713
1808: 676

In [115]:
errorList.__len__()

29

In [119]:
pd.DataFrame(errorList)

Unnamed: 0,tweet_id,error
0,888202515573088257,404 Not Found\n144 - No status found with that...
1,873697596434513921,404 Not Found\n144 - No status found with that...
2,872668790621863937,404 Not Found\n144 - No status found with that...
3,872261713294495745,404 Not Found\n144 - No status found with that...
4,869988702071779329,404 Not Found\n144 - No status found with that...
5,866816280283807744,404 Not Found\n144 - No status found with that...
6,861769973181624320,404 Not Found\n144 - No status found with that...
7,856602993587888130,404 Not Found\n144 - No status found with that...
8,856330835276025856,404 Not Found\n144 - No status found with that...
9,851953902622658560,404 Not Found\n144 - No status found with that...


Looks like we couldn't get the tweet data for 29 out of the 2356 tweet IDs that we have. Further investigation suggests that these tweets were deleted (`404` error) or the user's account is private (`403` error).

Let's write the successful json tweet data into a file `tweet_json.txt`.

In [123]:
tweets[0]

{'tweet_id': 892420643555336193,
 'tweet_data': '{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": 892420643555336193, "id_str": "892420643555336193", "full_text": "This is Phineas. He\'s a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", "truncated": false, "display_text_range": [0, 85], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 892420639486877696, "id_str": "892420639486877696", "indices": [86, 109], "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "url": "https://t.co/MgUWQ76dJU", "display_url": "pic.twitter.com/MgUWQ76dJU", "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1", "type": "photo", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 540, "h": 528, "resize": "fit"}, "small": {"w": 540, "h": 528, "resize": "fit"}, "large": {"w": 540, "h": 528, 

In [140]:
# write only 'tweet_data' of each dictionary in `tweets` to a txt file
c = 0
for status in tweets:
      # creating file and storing data
      if c == 0:
            with open('tweet_json.txt', 'w') as file:
                  file.write(status['tweet_data'])
      else: # appending each json data as a new line
            with open('tweet_json.txt', 'a') as file:
                  file.write("\n"+status['tweet_data'])
      c = 1

Read `tweet_json.txt` line by line into a pandas DataFrame with `tweet ID`, `retweet count` and `favorite count`.

In [152]:
tweet_json = [] # creating empty list to store dictionary

with open('tweet_json.txt', 'r') as file:
      # iterating through each line in 'file'
      for line in file.readlines():
            # converting 'line' from string to dictionary type
            data = json.loads(line)
            # appending dictionary contain our values of interest to 'tweet_json'
            tweet_json.append({
                              'tweet_id': data['id'],
                              'retweet_count': data['retweet_count'],
                              'favorite_count': data['favorite_count']
                               })

In [154]:
tweet_json_df = pd.DataFrame(tweet_json)

In [155]:
tweet_json_df

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6969,33693
1,892177421306343426,5272,29218
2,891815181378084864,3464,21974
3,891689557279858688,7191,36785
4,891327558926688256,7715,35178
...,...,...,...
2322,666049248165822465,36,88
2323,666044226329800704,115,246
2324,666033412701032449,36,100
2325,666029285002620928,39,112


## Assessing Data
### Visual Assessment
* **Quality Issues**

`twitter_enhanced_df`
1.  Some names in `name` column don't have the correct name
2.  Some rows in  have a 'None' value for `doggo`, `floofer`, `pupper` and `puppo` columns.
3. `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp` have NaN values.

`img_predictions_df`
1. For some Twitter IDs, `p1_dog`, `p2_dog` and `p3_dog` are all False indicating that there's no correct dog breed prediction for those IDs.
2. Names of dog breeds in `p1`, `p2` and `p3` are in lowercases.

`tweet_json_df`
1. 29 tweet IDs from `twitter_enhanced_df` don't have a record here.

In [161]:
twitter_enhanced_df.head(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [165]:
twitter_enhanced_df.tail()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


In [168]:
twitter_enhanced_df.sample(15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1855,675531475945709568,,,2015-12-12 04:23:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Ellie AKA Queen Slayer of the Orbs. Ve...,,,,https://twitter.com/dog_rates/status/675531475...,10,10,Ellie,,,,
2241,667915453470232577,,,2015-11-21 04:00:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Otis. He is a Peruvian Quartzite. Pic spo...,,,,https://twitter.com/dog_rates/status/667915453...,10,10,Otis,,,,
1287,708356463048204288,,,2016-03-11 18:18:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oliver. That is his castle. He protect...,,,,https://twitter.com/dog_rates/status/708356463...,10,10,Oliver,,,,
1791,677335745548390400,,,2015-12-17 03:53:20 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Downright inspiring 12/10 https://t.co/vSLtYBWHcQ,,,,https://vine.co/v/hbLbH77Ar67,12,10,,,,,
1715,680221482581123072,,,2015-12-25 03:00:14 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is CeCe. She's patiently waiting for Sant...,,,,https://twitter.com/dog_rates/status/680221482...,10,10,CeCe,,,,
819,770655142660169732,,,2016-08-30 16:11:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...",We only rate dogs. Pls stop sending in non-can...,,,,https://twitter.com/dog_rates/status/770655142...,11,10,very,,,,
1895,674742531037511680,6.7474e+17,4196984000.0,2015-12-10 00:08:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Some clarification is required. The dog is sin...,,,,,11,10,,,,,
2198,668815180734689280,,,2015-11-23 15:35:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a wild Toblerone from Papua New Guinea...,,,,https://twitter.com/dog_rates/status/668815180...,7,10,a,,,,
632,793962221541933056,,,2016-11-02 23:45:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Maximus. His face is stuck like that. ...,,,,https://twitter.com/dog_rates/status/793962221...,12,10,Maximus,,,,
1604,685906723014619143,,,2016-01-09 19:31:20 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Olive. He's stuck in a sleeve. 9/10 da...,,,,https://twitter.com/dog_rates/status/685906723...,9,10,Olive,,,,


In [166]:
img_predictions_df.head(15)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [169]:
img_predictions_df.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


In [189]:
img_predictions_df.sample(15)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
274,670833812859932673,https://pbs.twimg.com/media/CU9HyzSWIAAVcte.jpg,1,Pekinese,0.609853,True,Persian_cat,0.265442,False,Japanese_spaniel,0.02746,True
521,676575501977128964,https://pbs.twimg.com/media/CWOt07EUsAAnOYW.jpg,1,feather_boa,0.424106,False,Yorkshire_terrier,0.073144,True,Shetland_sheepdog,0.057598,True
677,683481228088049664,https://pbs.twimg.com/media/CXw2jSpWMAAad6V.jpg,1,keeshond,0.508951,True,chow,0.442016,True,German_shepherd,0.013206,True
1777,828381636999917570,https://pbs.twimg.com/media/C38Asz1WEAAvzj3.jpg,1,Bedlington_terrier,0.392535,True,Labrador_retriever,0.089022,True,clumber,0.0818,True
1304,753294487569522689,https://pbs.twimg.com/media/CnQ9Vq1WEAEYP01.jpg,1,chow,0.194773,True,monitor,0.102305,False,Siberian_husky,0.086855,True
827,693486665285931008,https://pbs.twimg.com/ext_tw_video_thumb/69348...,1,sea_lion,0.519811,False,Siamese_cat,0.290971,False,black-footed_ferret,0.039967,False
945,704480331685040129,https://pbs.twimg.com/media/CcbRIAgXIAQaKHQ.jpg,1,Samoyed,0.979206,True,Pomeranian,0.007185,True,Arctic_fox,0.006438,False
289,671163268581498880,https://pbs.twimg.com/media/CVBzbWsWsAEyNMA.jpg,1,African_hunting_dog,0.733025,False,plow,0.119377,False,Scottish_deerhound,0.026983,True
304,671518598289059840,https://pbs.twimg.com/media/CVG2l9jUYAAwg-w.jpg,1,Lakeland_terrier,0.428275,True,wire-haired_fox_terrier,0.111472,True,toy_poodle,0.105016,True
1247,747600769478692864,https://pbs.twimg.com/media/CmAC7ehXEAAqSuW.jpg,1,Chesapeake_Bay_retriever,0.804363,True,Weimaraner,0.054431,True,Labrador_retriever,0.043268,True


In [190]:
tweet_json_df.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6969,33693
1,892177421306343426,5272,29218
2,891815181378084864,3464,21974
3,891689557279858688,7191,36785
4,891327558926688256,7715,35178


In [191]:
tweet_json_df.tail()

Unnamed: 0,tweet_id,retweet_count,favorite_count
2322,666049248165822465,36,88
2323,666044226329800704,115,246
2324,666033412701032449,36,100
2325,666029285002620928,39,112
2326,666020888022790149,419,2282


### Programmatic Assessment
**Quality Issues**

`twitter_enhanced_df`
1. 'None' values in `name`,`doggo`, `floofer`, `pupper` and `puppo` should be represents as `NaN`
2. 78 records are replies and are not of interest to us. We need only dog ratings
3. 181 records are retweets which are basically a repeat of an initial Twitter ID.
4. `timestamp` is an object.

`img_predictions_df`
1. 324 records in `img_predictions_df` are wrong. The actual pictures either show a different animal or the neural network couldn't detect the dog due to the dogs not properly represented in the photo.

**Tidiness Issues**
1. Join `tweet_json_df` to `twitter_enhanced_df` on `tweet_id`
2. Drop unneccesary (not useful in our viz) columns in `twitter_enhanced_df`: [1,2,6,7,8] 
3. Dog stages are separated into 4 columns (`doggo`, `floofer`, `pupper`, `puppo`) instead of just one [`twitter_enhanced_df`]

In [193]:
twitter_enhanced_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [205]:
twitter_enhanced_df[twitter_enhanced_df.in_reply_to_status_id.notnull()].head(15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,
149,863079547188785154,6.671522e+17,4196984000.0,2017-05-12 17:12:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Ladies and gentlemen... I found Pipsy. He may ...,,,,https://twitter.com/dog_rates/status/863079547...,14,10,,,,,
179,857214891891077121,8.571567e+17,180671000.0,2017-04-26 12:48:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Marc_IRL pixelated af 12/10,,,,,12,10,,,,,
184,856526610513747968,8.558181e+17,4196984000.0,2017-04-24 15:13:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...","THIS IS CHARLIE, MARK. HE DID JUST WANT TO SAY...",,,,https://twitter.com/dog_rates/status/856526610...,14,10,,,,,
186,856288084350160898,8.56286e+17,279281000.0,2017-04-23 23:26:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@xianmcguire @Jenna_Marbles Kardashians wouldn...,,,,,14,10,,,,,
188,855862651834028034,8.558616e+17,194351800.0,2017-04-22 19:15:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@dhmontgomery We also gave snoop dogg a 420/10...,,,,,420,10,,,,,


In [217]:
twitter_enhanced_df[twitter_enhanced_df.retweeted_status_id.notnull()].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4196984000.0,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4196984000.0,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,


In [218]:
twitter_enhanced_df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [220]:
twitter_enhanced_df[twitter_enhanced_df.rating_numerator > 100][['tweet_id','text','rating_numerator','rating_denominator','name']]

Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator,name
188,855862651834028034,@dhmontgomery We also gave snoop dogg a 420/10...,420,10,
189,855860136149123072,@s8n You tried very hard to portray this good ...,666,10,
290,838150277551247360,@markhoppus 182/10,182,10,
313,835246439529840640,@jonnysun @Lin_Manuel ok jomny I know you're e...,960,0,
902,758467244762497024,Why does this never happen at my front door......,165,150,
979,749981277374128128,This is Atticus. He's quite simply America af....,1776,10,Atticus
1120,731156023742988288,Say hello to this unbelievably well behaved sq...,204,170,this
1634,684225744407494656,"Two sneaky puppers were not initially seen, mo...",143,130,
1635,684222868335505415,Someone help the girl is being mugged. Several...,121,110,
1779,677716515794329600,IT'S PUPPERGEDDON. Total of 144/120 ...I think...,144,120,


Looks like the unusually high numerators and denominators are actually valid. Those are some good doggos 🤣

In [221]:
img_predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [222]:
img_predictions_df.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [249]:
img_predictions_df[(img_predictions_df.p1_dog + img_predictions_df.p2_dog + img_predictions_df.p3_dog) == 0]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,1,hen,0.965932,False,cock,0.033919,False,partridge,0.000052,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,1,desktop_computer,0.086502,False,desk,0.085547,False,bookcase,0.079480,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,1,three-toed_sloth,0.914671,False,otter,0.015250,False,great_grey_owl,0.013207,False
25,666362758909284353,https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg,1,guinea_pig,0.996496,False,skunk,0.002402,False,hamster,0.000461,False
...,...,...,...,...,...,...,...,...,...,...,...,...
2021,880935762899988482,https://pbs.twimg.com/media/DDm2Z5aXUAEDS2u.jpg,1,street_sign,0.251801,False,umbrella,0.115123,False,traffic_light,0.069534,False
2022,881268444196462592,https://pbs.twimg.com/media/DDrk-f9WAAI-WQv.jpg,1,tusker,0.473303,False,Indian_elephant,0.245646,False,ibex,0.055661,False
2046,886680336477933568,https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg,1,convertible,0.738995,False,sports_car,0.139952,False,car_wheel,0.044173,False
2052,887517139158093824,https://pbs.twimg.com/ext_tw_video_thumb/88751...,1,limousine,0.130432,False,tow_truck,0.029175,False,shopping_cart,0.026321,False


In [223]:
tweet_json_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2327 non-null   int64
 1   retweet_count   2327 non-null   int64
 2   favorite_count  2327 non-null   int64
dtypes: int64(3)
memory usage: 54.7 KB


In [224]:
tweet_json_df.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2327.0,2327.0,2327.0
mean,7.41793e+17,2458.667383,7026.269875
std,6.820795e+16,4163.681864,10919.212757
min,6.660209e+17,1.0,0.0
25%,6.781394e+17,492.5,1220.5
50%,7.178418e+17,1144.0,3037.0
75%,7.986547e+17,2844.5,8565.0
max,8.924206e+17,70330.0,144246.0


### Issues Found

#### Quality Issues

**`twitter_enhanced_df`**
1.  Some names in `name` column don't have the correct name
2.  Some rows in  have a 'None' value for `doggo`, `floofer`, `pupper` and `puppo` columns.
1. 'None' values in `name`,`doggo`, `floofer`, `pupper` and `puppo` should be represents as `NaN`
3. `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp` have NaN values.
2. 78 records are replies and are not of interest to us. We need only dog ratings
3. 181 records are retweets which are basically a repeat of an initial Twitter ID.
4. `timestamp` is an object type.

**`img_predictions_df`**
1. For some Twitter IDs, `p1_dog`, `p2_dog` and `p3_dog` are all False indicating that there's no correct dog breed prediction for those IDs.
1. 324 records in `img_predictions_df` are wrong. The actual pictures either show a different animal or the neural network couldn't detect the dog due to the dogs not properly represented in the photo.
2. Names of dog breeds in `p1`, `p2` and `p3` are in lowercases.

**`tweet_json_df`**
1. 29 tweet IDs from `twitter_enhanced_df` don't have a record here.


#### Tidiness Issues
1. Join **`tweet_json_df`** to **`twitter_enhanced_df`** on `tweet_id`
1. **`img_predictions_df`** have 3 dog breeds. Only one with the %confidence is needed.
1. Join **`img_predictions_df`** to **`twitter_enhanced_df`** on `tweet_id`.
2. Drop unneccesary (not useful in our viz) columns in **`twitter_enhanced_df`**: [1,2,6,7,8] 
3. Dog stages are separated into 4 columns (`doggo`, `floofer`, `pupper`, `puppo`) instead of just one [**`twitter_enhanced_df`**]

## Cleaning Data