# Introduction
Real-world data rarely comes clean. Using Python and its libraries, we will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it (data wrangling). All wrangling efforts will be documented in a Jupyter Notebook and a showcase through the analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that we will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

# Data Wrangling

## Data Gathering
First we need to gather our data. For this project we will be using :
* *The WeRateDogs Twitter archive* a csv file (not to be found in the repocitory) 
* *The Tweet Image Prediction Algorithm* which is hosted on Udacity's servers (`image_predictions.tsv`)
* Each tweet's retweet cound and favorite ("like") using Python's **Tweepy** library. This data will be stored in a file called `tweet_json.txt` (this file will also not be available in the repository)

Load the data from the csv file `twitter_archive_enhanced.csv`

In [1]:
# import all necessary libraries
import numpy as np
import pandas as pd
import requests
import tweepy
import os
import json
from timeit import default_timer as timer
import time
import re

In [2]:
# read the csv file into a pandas DataFrame 
df_main = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# test if the csv file was uploaded correctly
df_main.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


Load the prediction algorith from the Udacity server using the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [10]:
# since we are saving everything in the same directory, there is no need to create a new folder

# store the url in a variable
url_algo = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# ask the Udacity server to send back the files
response = requests.get(url_algo)

# save the file
with open(url_algo.split('/')[-1], mode='wb') as file:
    file.write(response.content)

In [4]:
# test that the file was downloaded
os.listdir()

['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'api_keys.txt',
 'image-predictions.tsv',
 'README.md',
 'tweet_json.txt',
 'twitter-archive-enhanced.csv',
 'wrangle_act.ipynb']

In [5]:
# create a dataframe out of the tsv file
df_pred = pd.read_csv('image-predictions.tsv', sep='\t')

In [6]:
# check the datafram was created correctly
df_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


Please remember that none of the files will be uploaded to the GitHub repocitory. Now we will continue and **create the API for the Twitter Database**

In [14]:
consumer_key = '****************'
consumer_secret = '*************'
access_token = '****************'
access_secret = '***************'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [18]:
tweet1 = api.get_status(id='892420643555336193')

In [22]:
print(tweet1.favorite_count)

35866


I have already created a txt file in this same directory. Here we will save all the JSON data from the tweets and store them as a text

Next we will start the query to the tweeter's API for the JSON data for each tweet ID

In [7]:
# Create a list of all IDs for which we will be quering
tweet_ids = df_main.tweet_id.tolist()

# check that the list was created properly
len(tweet_ids)

2356

Since there is a timer restriciton from the Twitter API, we need to build a `time.sleep` command to comply with the max requirement

In [48]:
# define parameters for the loop
count = 0
fails_dict = {}
start = timer()

# save each tweet's JSON data as a new line in the text file
with open('tweet_json.txt', 'w') as outfile:
    for tweet_id in tweet_ids:
        count += 1
        if count in (850, 1700, 2550, 3400):
            time.sleep(900)
            print('current tweet_id "{}" is the number: {}'.format(str(tweet_id), str(count)))
            try:
                tweet = api.get_status(id=tweet_id, tweet_mode='extended')
                print('Success!')
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except tweepy.TweepError as e:
                print('Fail...')
                fails_dict[tweet_id] = e
                pass 
        else:
            print('current tweet_id "{}" is the number: {}'.format(str(tweet_id), str(count)))
            try:
                tweet = api.get_status(id=tweet_id, tweet_mode='extended')
                print('Success!')
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except tweepy.TweepError as e:
                print('Fail...')
                fails_dict[tweet_id] = e
                pass 

# at the end of the loop show the parameters
end = timer()
print(end - start)
print(fails_dict)

current tweet_id "892420643555336193" is the number: 1
Success!
current tweet_id "892177421306343426" is the number: 2
Success!
current tweet_id "891815181378084864" is the number: 3
Success!
current tweet_id "891689557279858688" is the number: 4
Success!
current tweet_id "891327558926688256" is the number: 5
Success!
current tweet_id "891087950875897856" is the number: 6
Success!
current tweet_id "890971913173991426" is the number: 7
Success!
current tweet_id "890729181411237888" is the number: 8
Success!
current tweet_id "890609185150312448" is the number: 9
Success!
current tweet_id "890240255349198849" is the number: 10
Success!
current tweet_id "890006608113172480" is the number: 11
Success!
current tweet_id "889880896479866881" is the number: 12
Success!
current tweet_id "889665388333682689" is the number: 13
Success!
current tweet_id "889638837579907072" is the number: 14
Success!
current tweet_id "889531135344209921" is the number: 15
Success!
current tweet_id "8892788419816857

Success!
current tweet_id "868552278524837888" is the number: 127
Success!
current tweet_id "867900495410671616" is the number: 128
Success!
current tweet_id "867774946302451713" is the number: 129
Success!
current tweet_id "867421006826221569" is the number: 130
Success!
current tweet_id "867072653475098625" is the number: 131
Success!
current tweet_id "867051520902168576" is the number: 132
Success!
current tweet_id "866816280283807744" is the number: 133
Fail...
current tweet_id "866720684873056260" is the number: 134
Success!
current tweet_id "866686824827068416" is the number: 135
Success!
current tweet_id "866450705531457537" is the number: 136
Success!
current tweet_id "866334964761202691" is the number: 137
Success!
current tweet_id "866094527597207552" is the number: 138
Success!
current tweet_id "865718153858494464" is the number: 139
Success!
current tweet_id "865359393868664832" is the number: 140
Success!
current tweet_id "865006731092295680" is the number: 141
Success!
cu

Success!
current tweet_id "844979544864018432" is the number: 252
Success!
current tweet_id "844973813909606400" is the number: 253
Success!
current tweet_id "844704788403113984" is the number: 254
Fail...
current tweet_id "844580511645339650" is the number: 255
Success!
current tweet_id "844223788422217728" is the number: 256
Success!
current tweet_id "843981021012017153" is the number: 257
Success!
current tweet_id "843856843873095681" is the number: 258
Success!
current tweet_id "843604394117681152" is the number: 259
Success!
current tweet_id "843235543001513987" is the number: 260
Success!
current tweet_id "842892208864923648" is the number: 261
Fail...
current tweet_id "842846295480000512" is the number: 262
Success!
current tweet_id "842765311967449089" is the number: 263
Success!
current tweet_id "842535590457499648" is the number: 264
Success!
current tweet_id "842163532590374912" is the number: 265
Success!
current tweet_id "842115215311396866" is the number: 266
Success!
cur

Success!
current tweet_id "828046555563323392" is the number: 377
Success!
current tweet_id "828011680017821696" is the number: 378
Success!
current tweet_id "827933404142436356" is the number: 379
Success!
current tweet_id "827653905312006145" is the number: 380
Success!
current tweet_id "827600520311402496" is the number: 381
Success!
current tweet_id "827324948884643840" is the number: 382
Success!
current tweet_id "827228250799742977" is the number: 383
Fail...
current tweet_id "827199976799354881" is the number: 384
Success!
current tweet_id "826958653328592898" is the number: 385
Success!
current tweet_id "826848821049180160" is the number: 386
Success!
current tweet_id "826615380357632002" is the number: 387
Success!
current tweet_id "826598799820865537" is the number: 388
Success!
current tweet_id "826598365270007810" is the number: 389
Success!
current tweet_id "826476773533745153" is the number: 390
Success!
current tweet_id "826240494070030336" is the number: 391
Success!
cu

Success!
current tweet_id "813096984823349248" is the number: 502
Success!
current tweet_id "813081950185472002" is the number: 503
Success!
current tweet_id "813066809284972545" is the number: 504
Success!
current tweet_id "813051746834595840" is the number: 505
Success!
current tweet_id "812781120811126785" is the number: 506
Success!
current tweet_id "812747805718642688" is the number: 507
Fail...
current tweet_id "812709060537683968" is the number: 508
Success!
current tweet_id "812503143955202048" is the number: 509
Success!
current tweet_id "812466873996607488" is the number: 510
Success!
current tweet_id "812372279581671427" is the number: 511
Success!
current tweet_id "811985624773361665" is the number: 512
Success!
current tweet_id "811744202451197953" is the number: 513
Success!
current tweet_id "811647686436880384" is the number: 514
Success!
current tweet_id "811627233043480576" is the number: 515
Success!
current tweet_id "811386762094317568" is the number: 516
Success!
cu

Success!
current tweet_id "795076730285391872" is the number: 627
Success!
current tweet_id "794983741416415232" is the number: 628
Success!
current tweet_id "794926597468000259" is the number: 629
Success!
current tweet_id "794355576146903043" is the number: 630
Success!
current tweet_id "794332329137291264" is the number: 631
Success!
current tweet_id "794205286408003585" is the number: 632
Success!
current tweet_id "793962221541933056" is the number: 633
Success!
current tweet_id "793845145112371200" is the number: 634
Success!
current tweet_id "793614319594401792" is the number: 635
Success!
current tweet_id "793601777308463104" is the number: 636
Success!
current tweet_id "793500921481273345" is the number: 637
Success!
current tweet_id "793286476301799424" is the number: 638
Success!
current tweet_id "793271401113350145" is the number: 639
Success!
current tweet_id "793256262322548741" is the number: 640
Success!
current tweet_id "793241302385262592" is the number: 641
Success!
c

Fail...
current tweet_id "779056095788752897" is the number: 752
Success!
current tweet_id "778990705243029504" is the number: 753
Success!
current tweet_id "778774459159379968" is the number: 754
Success!
current tweet_id "778764940568104960" is the number: 755
Success!
current tweet_id "778748913645780993" is the number: 756
Success!
current tweet_id "778650543019483137" is the number: 757
Success!
current tweet_id "778624900596654080" is the number: 758
Success!
current tweet_id "778408200802557953" is the number: 759
Success!
current tweet_id "778396591732486144" is the number: 760
Success!
current tweet_id "778383385161035776" is the number: 761
Success!
current tweet_id "778286810187399168" is the number: 762
Success!
current tweet_id "778039087836069888" is the number: 763
Success!
current tweet_id "778027034220126208" is the number: 764
Success!
current tweet_id "777953400541634568" is the number: 765
Success!
current tweet_id "777885040357281792" is the number: 766
Success!
cu

Success!
current tweet_id "761004547850530816" is the number: 877
Success!
current tweet_id "760893934457552897" is the number: 878
Success!
current tweet_id "760656994973933572" is the number: 879
Success!
current tweet_id "760641137271070720" is the number: 880
Success!
current tweet_id "760539183865880579" is the number: 881
Success!
current tweet_id "760521673607086080" is the number: 882
Success!
current tweet_id "760290219849637889" is the number: 883
Success!
current tweet_id "760252756032651264" is the number: 884
Success!
current tweet_id "760190180481531904" is the number: 885
Success!
current tweet_id "760153949710192640" is the number: 886
Success!
current tweet_id "759943073749200896" is the number: 887
Success!
current tweet_id "759923798737051648" is the number: 888
Success!
current tweet_id "759846353224826880" is the number: 889
Success!
current tweet_id "759793422261743616" is the number: 890
Success!
current tweet_id "759566828574212096" is the number: 891
Fail...
cu

Success!
current tweet_id "747933425676525569" is the number: 1002
Success!
current tweet_id "747885874273214464" is the number: 1003
Success!
current tweet_id "747844099428986880" is the number: 1004
Success!
current tweet_id "747816857231626240" is the number: 1005
Success!
current tweet_id "747651430853525504" is the number: 1006
Success!
current tweet_id "747648653817413632" is the number: 1007
Success!
current tweet_id "747600769478692864" is the number: 1008
Success!
current tweet_id "747594051852075008" is the number: 1009
Success!
current tweet_id "747512671126323200" is the number: 1010
Success!
current tweet_id "747461612269887489" is the number: 1011
Success!
current tweet_id "747439450712596480" is the number: 1012
Success!
current tweet_id "747242308580548608" is the number: 1013
Success!
current tweet_id "747219827526344708" is the number: 1014
Success!
current tweet_id "747204161125646336" is the number: 1015
Success!
current tweet_id "747103485104099331" is the number: 

Success!
current tweet_id "730211855403241472" is the number: 1125
Success!
current tweet_id "730196704625098752" is the number: 1126
Success!
current tweet_id "729854734790754305" is the number: 1127
Success!
current tweet_id "729838605770891264" is the number: 1128
Success!
current tweet_id "729823566028484608" is the number: 1129
Success!
current tweet_id "729463711119904772" is the number: 1130
Success!
current tweet_id "729113531270991872" is the number: 1131
Success!
current tweet_id "728986383096946689" is the number: 1132
Success!
current tweet_id "728760639972315136" is the number: 1133
Success!
current tweet_id "728751179681943552" is the number: 1134
Success!
current tweet_id "728653952833728512" is the number: 1135
Success!
current tweet_id "728409960103686147" is the number: 1136
Success!
current tweet_id "728387165835677696" is the number: 1137
Success!
current tweet_id "728046963732717569" is the number: 1138
Success!
current tweet_id "728035342121635841" is the number: 

Success!
current tweet_id "711652651650457602" is the number: 1248
Success!
current tweet_id "711363825979756544" is the number: 1249
Success!
current tweet_id "711306686208872448" is the number: 1250
Success!
current tweet_id "711008018775851008" is the number: 1251
Success!
current tweet_id "710997087345876993" is the number: 1252
Success!
current tweet_id "710844581445812225" is the number: 1253
Success!
current tweet_id "710833117892898816" is the number: 1254
Success!
current tweet_id "710658690886586372" is the number: 1255
Success!
current tweet_id "710609963652087808" is the number: 1256
Success!
current tweet_id "710588934686908417" is the number: 1257
Success!
current tweet_id "710296729921429505" is the number: 1258
Success!
current tweet_id "710283270106132480" is the number: 1259
Success!
current tweet_id "710272297844797440" is the number: 1260
Success!
current tweet_id "710269109699739648" is the number: 1261
Success!
current tweet_id "710153181850935296" is the number: 

Success!
current tweet_id "702321140488925184" is the number: 1371
Success!
current tweet_id "702276748847800320" is the number: 1372
Success!
current tweet_id "702217446468493312" is the number: 1373
Success!
current tweet_id "701981390485725185" is the number: 1374
Success!
current tweet_id "701952816642965504" is the number: 1375
Success!
current tweet_id "701889187134500865" is the number: 1376
Success!
current tweet_id "701805642395348998" is the number: 1377
Success!
current tweet_id "701601587219795968" is the number: 1378
Success!
current tweet_id "701570477911896070" is the number: 1379
Success!
current tweet_id "701545186879471618" is the number: 1380
Success!
current tweet_id "701214700881756160" is the number: 1381
Success!
current tweet_id "700890391244103680" is the number: 1382
Success!
current tweet_id "700864154249383937" is the number: 1383
Success!
current tweet_id "700847567345688576" is the number: 1384
Success!
current tweet_id "700796979434098688" is the number: 

Success!
current tweet_id "692752401762250755" is the number: 1494
Success!
current tweet_id "692568918515392513" is the number: 1495
Success!
current tweet_id "692535307825213440" is the number: 1496
Success!
current tweet_id "692530551048294401" is the number: 1497
Success!
current tweet_id "692423280028966913" is the number: 1498
Success!
current tweet_id "692417313023332352" is the number: 1499
Success!
current tweet_id "692187005137076224" is the number: 1500
Success!
current tweet_id "692158366030913536" is the number: 1501
Success!
current tweet_id "692142790915014657" is the number: 1502
Success!
current tweet_id "692041934689402880" is the number: 1503
Success!
current tweet_id "692017291282812928" is the number: 1504
Success!
current tweet_id "691820333922455552" is the number: 1505
Success!
current tweet_id "691793053716221953" is the number: 1506
Success!
current tweet_id "691756958957883396" is the number: 1507
Success!
current tweet_id "691675652215414786" is the number: 

Success!
current tweet_id "685198997565345792" is the number: 1617
Success!
current tweet_id "685169283572338688" is the number: 1618
Success!
current tweet_id "684969860808454144" is the number: 1619
Success!
current tweet_id "684959798585110529" is the number: 1620
Success!
current tweet_id "684940049151070208" is the number: 1621
Success!
current tweet_id "684926975086034944" is the number: 1622
Success!
current tweet_id "684914660081053696" is the number: 1623
Success!
current tweet_id "684902183876321280" is the number: 1624
Success!
current tweet_id "684880619965411328" is the number: 1625
Success!
current tweet_id "684830982659280897" is the number: 1626
Success!
current tweet_id "684800227459624960" is the number: 1627
Success!
current tweet_id "684594889858887680" is the number: 1628
Success!
current tweet_id "684588130326986752" is the number: 1629
Success!
current tweet_id "684567543613382656" is the number: 1630
Success!
current tweet_id "684538444857667585" is the number: 

Success!
current tweet_id "679511351870550016" is the number: 1740
Success!
current tweet_id "679503373272485890" is the number: 1741
Success!
current tweet_id "679475951516934144" is the number: 1742
Success!
current tweet_id "679462823135686656" is the number: 1743
Success!
current tweet_id "679405845277462528" is the number: 1744
Success!
current tweet_id "679158373988876288" is the number: 1745
Success!
current tweet_id "679148763231985668" is the number: 1746
Success!
current tweet_id "679132435750195208" is the number: 1747
Success!
current tweet_id "679111216690831360" is the number: 1748
Success!
current tweet_id "679062614270468097" is the number: 1749
Success!
current tweet_id "679047485189439488" is the number: 1750
Success!
current tweet_id "679001094530465792" is the number: 1751
Success!
current tweet_id "678991772295516161" is the number: 1752
Success!
current tweet_id "678969228704284672" is the number: 1753
Success!
current tweet_id "678800283649069056" is the number: 

Success!
current tweet_id "675432746517426176" is the number: 1863
Success!
current tweet_id "675372240448454658" is the number: 1864
Success!
current tweet_id "675362609739206656" is the number: 1865
Success!
current tweet_id "675354435921575936" is the number: 1866
Success!
current tweet_id "675349384339542016" is the number: 1867
Success!
current tweet_id "675334060156301312" is the number: 1868
Success!
current tweet_id "675166823650848770" is the number: 1869
Success!
current tweet_id "675153376133427200" is the number: 1870
Success!
current tweet_id "675149409102012420" is the number: 1871
Success!
current tweet_id "675147105808306176" is the number: 1872
Success!
current tweet_id "675146535592706048" is the number: 1873
Success!
current tweet_id "675145476954566656" is the number: 1874
Success!
current tweet_id "675135153782571009" is the number: 1875
Success!
current tweet_id "675113801096802304" is the number: 1876
Success!
current tweet_id "675111688094527488" is the number: 

Success!
current tweet_id "672898206762672129" is the number: 1986
Success!
current tweet_id "672884426393653248" is the number: 1987
Success!
current tweet_id "672877615439593473" is the number: 1988
Success!
current tweet_id "672834301050937345" is the number: 1989
Success!
current tweet_id "672828477930868736" is the number: 1990
Success!
current tweet_id "672640509974827008" is the number: 1991
Success!
current tweet_id "672622327801233409" is the number: 1992
Success!
current tweet_id "672614745925664768" is the number: 1993
Success!
current tweet_id "672609152938721280" is the number: 1994
Success!
current tweet_id "672604026190569472" is the number: 1995
Success!
current tweet_id "672594978741354496" is the number: 1996
Success!
current tweet_id "672591762242805761" is the number: 1997
Success!
current tweet_id "672591271085670400" is the number: 1998
Success!
current tweet_id "672538107540070400" is the number: 1999
Success!
current tweet_id "672523490734551040" is the number: 

Success!
current tweet_id "670452855871037440" is the number: 2109
Success!
current tweet_id "670449342516494336" is the number: 2110
Success!
current tweet_id "670444955656130560" is the number: 2111
Success!
current tweet_id "670442337873600512" is the number: 2112
Success!
current tweet_id "670435821946826752" is the number: 2113
Success!
current tweet_id "670434127938719744" is the number: 2114
Success!
current tweet_id "670433248821026816" is the number: 2115
Success!
current tweet_id "670428280563085312" is the number: 2116
Success!
current tweet_id "670427002554466305" is the number: 2117
Success!
current tweet_id "670421925039075328" is the number: 2118
Success!
current tweet_id "670420569653809152" is the number: 2119
Success!
current tweet_id "670417414769758208" is the number: 2120
Success!
current tweet_id "670411370698022913" is the number: 2121
Success!
current tweet_id "670408998013820928" is the number: 2122
Success!
current tweet_id "670403879788544000" is the number: 

Success!
current tweet_id "668226093875376128" is the number: 2232
Success!
current tweet_id "668221241640230912" is the number: 2233
Success!
current tweet_id "668204964695683073" is the number: 2234
Success!
current tweet_id "668190681446379520" is the number: 2235
Success!
current tweet_id "668171859951755264" is the number: 2236
Success!
current tweet_id "668154635664932864" is the number: 2237
Success!
current tweet_id "668142349051129856" is the number: 2238
Success!
current tweet_id "668113020489474048" is the number: 2239
Success!
current tweet_id "667937095915278337" is the number: 2240
Success!
current tweet_id "667924896115245057" is the number: 2241
Success!
current tweet_id "667915453470232577" is the number: 2242
Success!
current tweet_id "667911425562669056" is the number: 2243
Success!
current tweet_id "667902449697558528" is the number: 2244
Success!
current tweet_id "667886921285246976" is the number: 2245
Success!
current tweet_id "667885044254572545" is the number: 

Success!
current tweet_id "666029285002620928" is the number: 2355
Success!
current tweet_id "666020888022790149" is the number: 2356
Success!
2357.9549000999996
{888202515573088257: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 873697596434513921: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 872668790621863937: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 872261713294495745: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 869988702071779329: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 866816280283807744: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 861769973181624320: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 856602993587888130: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 851953902622658560: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 

In [53]:
# check how many tweets could not be downloaded
len(fails_dict.keys())

25

We have created the text file with all the JSON date (besides 25 tweets). However, the data is for now "encripted" in a text file. We have to extract the relevant information as a DataFrame for further analysis

In [8]:
# Extract the relevant information: Number of retweets and likes counts
# use a loop to extract the information from each line
tweet_id_list = []
retweet_count_list = []
favorite_count_list = []

number_of_lines = len(open('tweet_json.txt').readlines())

with open('tweet_json.txt', encoding='utf-8') as filetest:
    for i in range(number_of_lines):
        line = filetest.readline()
        tweet_id = re.search(',\s"id":\s(\d+).+"retweet_count":\s(\d+).+"favorite_count":\s(\d+)',
                            line).group(1)
        retweet_count = re.search(',\s"id":\s(\d+).+"retweet_count":\s(\d+).+"favorite_count":\s(\d+)',
                            line).group(2)
        favorite_count = re.search(',\s"id":\s(\d+).+"retweet_count":\s(\d+).+"favorite_count":\s(\d+)',
                            line).group(3)
        tweet_id_list.append(str(tweet_id))
        retweet_count_list.append(str(retweet_count))
        favorite_count_list.append(str(favorite_count))

In [9]:
# do a visual assessment that the results is correct
print(tweet_id_list[0], retweet_count_list[0], favorite_count_list[0])

892420643555336193 7598 35866


In [10]:
# confirm again the result is correct
print(tweet_id_list[-1], retweet_count_list[-1], favorite_count_list[-1])

666020888022790149 459 2388


In [11]:
#create the dataframe out of the three lists
df_counts = pd.DataFrame(list(zip(tweet_id_list, 
                                             retweet_count_list, 
                                             favorite_count_list)),
                                   columns=['tweet_id', 'retweets_count', 'favorites_count'])

In [12]:
# visual assessment that the dataframe was created correctly
df_counts.head()

Unnamed: 0,tweet_id,retweets_count,favorites_count
0,892420643555336193,7598,35866
1,892177421306343426,5627,30940
2,891815181378084864,3724,23285
3,891689557279858688,7778,39137
4,891327558926688256,8374,37385


With this last DataFrame we finish **gathering** the data. So far we have created three DataFrames, which we are going to assess, clean and analyse. These DataFrames are:
* `df_main`: Contains the tweets id's and some basic information regarding all the tweets of WeRateDogs until 2017
* `df_pred`: Contains a prediction of the dog race provided in the pictures of the tweets of WeRateDogs until 2017
* `df_counts`: Contains furhter information of the tweets such as number of retweets and the number of "likes" aka "favorites"

## Data Assessment

We will assess the data visually and programatically. Our main goals are to get at least 2 tydiness problems and at least 8 quality issues. **All the observations will be written at the end of this chapter.** Things we need to keep in mind:
* We only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Ratings above 10 is not a quality issue - this is part of the popularity WeRateDogs

### Visual Assessment of the three DataFrames

In [13]:
# take a look to the main Datafrma
df_main

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [14]:
df_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [15]:
df_counts

Unnamed: 0,tweet_id,retweets_count,favorites_count
0,892420643555336193,7598,35866
1,892177421306343426,5627,30940
2,891815181378084864,3724,23285
3,891689557279858688,7778,39137
4,891327558926688256,8374,37385
...,...,...,...
2326,666049248165822465,40,96
2327,666044226329800704,130,268
2328,666033412701032449,41,111
2329,666029285002620928,42,120


### Programmatically Assessment of the Data

In [16]:
# get the basic information of the DataFrames:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [17]:
df_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [18]:
df_counts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   tweet_id         2331 non-null   object
 1   retweets_count   2331 non-null   object
 2   favorites_count  2331 non-null   object
dtypes: object(3)
memory usage: 54.8+ KB


In [35]:
df_main.puppo.value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

In [19]:
df_main.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [20]:
df_main.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [21]:
# check for duplicates
df_main.text.duplicated().sum()

0

In [22]:
df_pred.duplicated().sum()

0

In [23]:
# get a list of the duplicated columns - only the tweet_id should be duplicated
list_columns = pd.Series(list(df_main) + list(df_pred) + list(df_counts))
list_columns[list_columns.duplicated()]

17    tweet_id
29    tweet_id
dtype: object

In [24]:
df_main.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [25]:
df_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [26]:
df_counts.describe()

Unnamed: 0,tweet_id,retweets_count,favorites_count
count,2331,2331,2331
unique,2331,1699,1985
top,754120377874386944,451,0
freq,1,6,163


### Assessment Summary:

#### Tidiness Issues
* `in_reply_to` columns convey the same information: original tweet or reply
* `retweeted_status` columns convey the same information: original tweet or retweet - *live only the timestamp* 
* Untidy data: each type of observational unit does not form a table - *Goal is to get four tables: One for the text and information in the text, one for the tweet metrics, one for the predictions, and one for the metadata*

#### Quality Issues
* Missing Data while quering the Twitter API
* Unsure about quality of extrated data from tweet text
    * `rating` columns
    + `dog stages`columns
* Column names may contain unwanted characters - *`df_main` column `tweet_id` is not recognize equal to the other DataFrames* 
* Column `rating_denominator` does not always have the same denominator - *for further analysis we should focus on the base 10 ratings* 
* Column `rating_numerator` contains unexpected high values in a few cases - *by definition, the numerator being greater than the denominator is correct, however too high ratings would bias any analysis*
* Columns `doggo` - `puppo` "None" values are counted as values (or strings) - *they should be categorical, preferibly 0 or 1*
* Columns containing `id's` are not strings - *they should be a string*
* Columns containing `timestamp's` are saved as objects - *they should be data type datetime*
* Columns containing `_counts's` are saved as objects - *they should be integers*

## Data Cleaning

we will start with the "messy data" or tidiness problems and then we will continue with the "dirty data" or quality issues

First we need to create copy of the datasets

In [59]:
# copy the datasets
df_main_clean = df_main.copy()
df_pred_clean = df_pred.copy()
df_counts_clean = df_counts.copy()

### Tidiness

> * `in_reply_to` columns convey the same information: original tweet or reply
> * `retweeted_status` columns convey the same information: original tweet or retweet - *live only the timestamp*

***Define***

* **Drop the unnecessary variables from `df_main`:** `in_reply_to` and `retweeted_status` so that each variable has a column
* **Give the columns a better name**

***Code***

In [60]:
# check the name of the columns
df_main_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [61]:
# drop the unnecessary columns
df_main_clean.drop(columns=['in_reply_to_user_id', 
                            'retweeted_status_id', 
                            'retweeted_status_user_id',],
                  inplace=True)

In [62]:
# rename the columns
df_main_clean = df_main_clean.rename(columns = {'in_reply_to_status_id': 'is_reply', 
                                               'retweeted_status_timestamp': 'is_retweet'})

***Test***

In [63]:
# test the column names:
df_main_clean.columns

Index(['tweet_id', 'is_reply', 'timestamp', 'source', 'text', 'is_retweet',
       'expanded_urls', 'rating_numerator', 'rating_denominator', 'name',
       'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

>* Untidy data: each type of observational unit does not form a table - *Goal is to get four tables: One for the text and information in the text, one for the tweet metrics, one for the predictions, and one for the metadata*

***Define***

Create three tables such that:
* `t_text`: this table should only encapsels the text and all the information of the text, such as rating, and other keywords (e.g. doggo)
* `t_pred`: this table should only contains the predicted race of each picture
* `t_metrics`: this table sould only contains metadata about the tweets such as original or retweet, number of retweets and so on

***Code***

In [64]:
# merge all tables into one big DataFrame
# before merging we need to set keys to the same data type
df_main_clean.tweet_id = df_main_clean.tweet_id.astype(str).str.strip()
df_pred_clean.tweet_id = df_pred_clean.tweet_id.astype(str).str.strip()
df_counts_clean.tweet_id = df_counts_clean.tweet_id.astype(str).str.strip()

In [65]:
# merge df_main_clean and df_counts_clean
df_main_clean = pd.merge(df_main_clean, df_counts_clean, how='left', on='tweet_id')

In [66]:
# merge df_main_clean and df_pred_clean
df_main_clean = pd.merge(df_main_clean, df_pred_clean, how='left', on='tweet_id')

In [69]:
# Do not make any furhter copy of the DataFrames since we want the cleaning to apply to 
# the DataFrames_clean as well

t_text = df_main_clean[['tweet_id', 
                        'text', 
                        'rating_numerator', 
                        'rating_denominator', 
                        'name', 
                        'doggo', 
                        'floofer', 
                        'pupper', 
                        'puppo']]

In [70]:
t_pred = df_main_clean[['tweet_id', 
                        'p1', 
                        'p1_conf', 
                        'p1_dog', 
                        'p2', 
                        'p2_conf', 
                        'p2_dog', 
                        'p3', 
                        'p3_conf', 
                        'p3_dog']]

In [71]:
t_metrics = df_main_clean[['tweet_id', 
                           'is_reply', 
                           'is_retweet', 
                           'retweets_count', 
                           'favorites_count']]

In [72]:
t_metadata = df_main_clean[['tweet_id', 
                            'timestamp', 
                            'source', 
                            'expanded_urls', 
                            'jpg_url', 
                            'img_num']]

***Test***

In [55]:
# check the merge worked correctly
df_main_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 27 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   object 
 1   in_reply_to_status_id       78 non-null     float64
 2   timestamp                   2356 non-null   object 
 3   source                      2356 non-null   object 
 4   text                        2356 non-null   object 
 5   retweeted_status_timestamp  181 non-null    object 
 6   expanded_urls               2297 non-null   object 
 7   rating_numerator            2356 non-null   int64  
 8   rating_denominator          2356 non-null   int64  
 9   name                        2356 non-null   object 
 10  doggo                       2356 non-null   object 
 11  floofer                     2356 non-null   object 
 12  pupper                      2356 non-null   object 
 13  puppo                       2356 

In [74]:
# check that no column was lost
list_columns = pd.Series(list(t_text) + list(t_pred) + list(t_metrics) + list(t_metadata))
len(list_columns)

30

In [75]:
# only the tweet_id column should be repeated
list_columns[list_columns.duplicated()]

9     tweet_id
19    tweet_id
24    tweet_id
dtype: object

In [76]:
# check one of the new formed DataFrames
t_text.sample(5)

Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1775,678021115718029313,This is Reese. He likes holding hands. 12/10 h...,12,10,Reese,,,,
2342,666082916733198337,Here we have a well-established sunblockerspan...,6,10,,,,,
1509,691459709405118465,Say hello to Leo. He's a Fallopian Puffalope. ...,12,10,Leo,,,,
808,771770456517009408,This is Davey. He'll have your daughter home b...,11,10,Davey,,,,
539,806576416489959424,Hooman catch successful. Massive hit by dog. F...,13,10,,,,,
