# Gathering Data

I parse my library needed in each task so it will easier to know what kind library needed from that task.

#### 1. Get Data Twitter archive

Todo:
1. Import library needed
2. Read <b>twitter_archive_enhanced.csv</b> from the same folder
3. Make sure that data has been read correctly
    - print head

In [1]:
import pandas as pd

In [2]:
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive_df = twitter_archive_df.sort_values('timestamp')
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,


#### 2. Get Data Tweet image prediction

Todo:
1. Import library needed
2. Read <b>image-predictions.tsv</b> from Udacity's server that can be access from <i> https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv </i>
3. Make sure that data has been read correctly
    - print head
    - describe domain knowledge about the data

In [3]:
import requests

In [4]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)

with open('image-predictions.tsv', mode ='wb') as file:
    file.write(response.content)

In [5]:
#Read TSV file
image_prediction_df = pd.read_csv('image-predictions.tsv', sep='\t' )
image_prediction_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### The description:
- tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
- p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
- p1_conf is how confident the algorithm is in its #1 prediction → 95%
- p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
- p2 is the algorithm's second most likely prediction → Labrador retriever
- p2_conf is how confident the algorithm is in its #2 prediction → 1%
- p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
- etc.

#### 3. Configure Twitter Account

Todo:
1. Import library needed
2. Declare twitter configuration with consumer_key, consumer_secret, access_token, and access_secret 
3. Make configuration

In [6]:
import tweepy

In [7]:
# for security reasons, I save my configuration in csv
twitter_configuration = pd.read_csv("twitter_configuration.csv")

In [18]:
try:
    auth = tweepy.OAuthHandler(twitter_configuration.consumer_key[0], twitter_configuration.consumer_secret[0])
    auth.set_access_token(twitter_configuration.access_token[0], twitter_configuration.access_secret[0])
except tweepy.TweepError as t:
    print(t.message)
    
api = tweepy.API(auth, wait_on_rate_limit= True, wait_on_rate_limit_notify= True)

### 4. Get Data Twitter with API & JSON

Todo:
1. Import library needed (if not exist before)
2. Get twitter data in JSON by id from file point 1
    - add data JSON from a list
    - add ids data that we can't find that with API
    - calculate the number id we wan to looking for
    - calculate number succes and fail data we looking for
    - save data tweets in txt file so we can accsess that many time
3. Read and save tweets data in dataframe so we can access in our notebook
4. Make sure that data has been read correctly
    - print head

In [9]:
import json
from timeit import default_timer as timer

In [13]:
tweets = []
ids_missing_tweet = []
num_tweet_id = len(twitter_archive_df.tweet_id)
num_succes_get_data = 0
num_fail_get_data = 0
for tweet_id in twitter_archive_df.tweet_id:
    try:
        temp = api.get_status(tweet_id)._json
        tweets.append({'tweet_id':temp['id'],
                       'created_at':temp['created_at'],
                       'favorite_count':temp['favorite_count'],
                       'favorited':temp['favorited'],
                       'retweet_count':temp['retweet_count'],
                       'retweeted':temp['retweeted']})
        num_succes_get_data += 1
        print('{} : done, {}/{}'.format(tweet_id, num_succes_get_data, num_tweet_id))
    except tweepy.TweepError as t:
        num_fail_get_data += 1
        ids_missing_tweet.append(tweet_id)
        print('{} : {}, total fail= {}'.format(tweet_id, t, num_fail_get_data))

666020888022790149 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1
666029285002620928 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 2
666033412701032449 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 3
666044226329800704 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 4
666049248165822465 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 5
666050758794694657 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 6
666051853826850816 : Failed to send request: Only unicode 

670782429121134593 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 264
670783437142401025 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 265
670786190031921152 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 266
670789397210615808 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 267
670792680469889025 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 268
670797304698376195 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 269
670803562457407488 : Failed to send request: O

676121918416756736 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 522
676146341966438401 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 523
676191832485810177 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 524
676215927814406144 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 525
676219687039057920 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 526
676237365392908289 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 527
676263575653122048 : Failed to send request: O

690360449368465409 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 829
690374419777196032 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 830
690400367696297985 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 831
690597161306841088 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 832
690607260360429569 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 833
690649993829576704 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 834
690690673629138944 : Failed to send request: O

704847917308362754 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1015
704859558691414016 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1016
704871453724954624 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1017
705066031337840642 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1018
705102439679201280 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1019
705223444686888960 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1020
705239209544720384 : Failed to send requ

728387165835677696 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1220
728409960103686147 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1221
728653952833728512 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1222
728751179681943552 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1223
728760639972315136 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1224
728986383096946689 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1225
729113531270991872 : Failed to send requ

766793450729734144 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1514
766864461642756096 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1515
767122157629476866 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1516
767191397493538821 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1517
767500508068192258 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1518
767754930266464257 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1519
767884188863397888 : Failed to send requ

800751577355128832 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1778
800855607700029440 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1779
800859414831898624 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1780
801115127852503040 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1781
801127390143516673 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1782
801167903437357056 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1783
801285448605831168 : Failed to send requ

826476773533745153 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1967
826598365270007810 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1968
826598799820865537 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1969
826615380357632002 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1970
826848821049180160 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1971
826958653328592898 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 1972
827199976799354881 : Failed to send requ

866450705531457537 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 2221
866686824827068416 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 2222
866720684873056260 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 2223
866816280283807744 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 2224
867051520902168576 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 2225
867072653475098625 : Failed to send request: Only unicode objects are escapable. Got {0: 'Q7ebdh68Qk1giixyXWq2EqEir'} of type <class 'dict'>., total fail= 2226
867421006826221569 : Failed to send requ

In [None]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
num_tweet_id = len(twitter_archive_df.tweet_id)
num_succes_get_data = 0
num_fail_get_data = 0

tweet_ids = twitter_archive_df.tweet_id
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in twitter_archive_df.tweet_id:
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet._json, outfile)
            print(outfile)
            outfile.write('\n')
            
            num_succes_get_data += 1
            print('{} : done, {}/{}'.format(tweet_id, num_succes_get_data, num_tweet_id))
        except tweepy.TweepError as e:
            num_fail_get_data += 1
            print('{} : {}, total fail= {}'.format(tweet_id, e, num_fail_get_data))
            fails_dict[tweet_id] = e
            pass
        
end = timer()
print(end - start)
print(fails_dict)

<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666020888022790149 : done, 1/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666029285002620928 : done, 2/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666033412701032449 : done, 3/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666044226329800704 : done, 4/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666049248165822465 : done, 5/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666050758794694657 : done, 6/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666051853826850816 : done, 7/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666055525042405380 : done, 8/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
666057090499244032 : done, 9/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
66605860052415

<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667453023279554560 : done, 81/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667455448082227200 : done, 82/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667470559035432960 : done, 83/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667491009379606528 : done, 84/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667495797102141441 : done, 85/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667502640335572993 : done, 86/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667509364010450944 : done, 87/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667517642048163840 : done, 88/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
667524857454854144 : done, 89/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
66753

<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668852170888998912 : done, 160/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668872652652679168 : done, 161/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668892474547511297 : done, 162/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668902994700836864 : done, 163/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668932921458302977 : done, 164/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668955713004314625 : done, 165/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668960084974809088 : done, 166/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668967877119254528 : done, 167/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
668975677807423489 : done, 168/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-

<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670421925039075328 : done, 239/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670427002554466305 : done, 240/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670428280563085312 : done, 241/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670433248821026816 : done, 242/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670434127938719744 : done, 243/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670435821946826752 : done, 244/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670442337873600512 : done, 245/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670444955656130560 : done, 246/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
670449342516494336 : done, 247/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-

<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671550332464455680 : done, 318/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671561002136281088 : done, 319/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671729906628341761 : done, 320/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671735591348891648 : done, 321/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671743150407421952 : done, 322/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671744970634719232 : done, 323/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671763349865160704 : done, 324/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671768281401958400 : done, 325/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
671789708968640512 : done, 326/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-

<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673576835670777856 : done, 397/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673580926094458881 : done, 398/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673583129559498752 : done, 399/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673612854080196609 : done, 400/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673636718965334016 : done, 401/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673656262056419329 : done, 402/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673662677122719744 : done, 403/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673680198160809984 : done, 404/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
673686845050527744 : done, 405/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-

<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675006312288268288 : done, 476/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675015141583413248 : done, 477/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675047298674663426 : done, 478/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675109292475830276 : done, 479/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675111688094527488 : done, 480/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675113801096802304 : done, 481/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675135153782571009 : done, 482/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675145476954566656 : done, 483/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-8'>
675146535592706048 : done, 484/2356
<_io.TextIOWrapper name='tweet_json.txt' mode='w' encoding='UTF-

In [None]:
print("Success to get {} data, and fail to get {} data, from total {} data."\
      .format(num_succes_get_data, num_fail_get_data, num_tweet_id))

##### Why we can't find 22 data? And what must we do with that?
The data maybe deleted.

In [None]:
# read json file into dataframe
with open('tweet_json.txt','r') as f:
    data = json.load(f)

scrapped_tweet_df = pd.DataFrame(data)
scrapped_tweet_df.head()

# Access Data 

For now, we have 3 data: twitter_archive_df, image_prediction_df, and scrapped_tweet_df
Todo:
1. Get missing value percentage for each data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from quilt.data.ResidentMario import missingno_data
import numpy as np
import missingno as msno

%matplotlib inline

##### 1. Check length of data

In [None]:
def print_length(name, data_frame):
    print("The length of {} is {}".format(name, len(data_frame)))

In [None]:
print_length('twitter_archive_df', twitter_archive_df)
print_length('image_prediction_df', image_prediction_df)
print_length('scrapped_tweet_df', scrapped_tweet_df)

From that data we get info that twitter_archive_df has different length with scrapped_tweet_df because we failed to get 22 data from twitter. We can delete some row in data so we will have the same length in each table.

##### 2. Check Data Type

In [None]:
twitter_archive_df.dtypes

In [None]:
image_prediction_df.dtypes

In [None]:
scrapped_tweet_df.dtypes

Object in the data type mean string, we not have some problem in there except timestamp. It must be date

#### 3. Check The Value

In [None]:
twitter_archive_df.name.value_counts().head()

There is 5 sorted dog name with the biggest total value. We find that "None" is typically missing data, and I assumed that "a" also a missing data, so we must find and uniformly all missing data value in each label.

In [None]:
scrapped_tweet_df.retweeted.value_counts()

It is perfect because we only want original ratings (no retweets) that have images.

In [None]:
twitter_archive_df.duplicated(['tweet_id']).sum()

In [None]:
twitter_archive_df.duplicated(['expanded_urls']).sum()

In [None]:
twitter_archive_df[twitter_archive_df.duplicated(['expanded_urls'])]

In [None]:
twitter_archive_df[twitter_archive_df.duplicated(['expanded_urls'])].expanded_urls.value_counts()

In [None]:
twitter_archive_df.query("expanded_urls == 'https://twitter.com/dog_rates/status/767754930266464257/photo/1'")

In [None]:
image_prediction_df.duplicated(['jpg_url']).sum()

There are some images that duplicated, we must re-check are they are have same value in each cols (except the id, because we don't have any duplicate tweet id)

In [None]:
scrapped_tweet_df.favorited.value_counts()

retweeted and favorited data only have 1 value, so it is not important anymore, we must to drop it.

##### 2. Check Missing Value

In [None]:
def get_missing_value_percentage(data_frame):
    data_missing = data_frame.isna()
    num_data_missing = data_missing.sum()
    num_data = len(data_frame)
    return (num_data_missing * 100)/num_data

In [None]:
get_missing_value_percentage(twitter_archive_df)

In [None]:
get_missing_value_percentage(image_prediction_df)

In [None]:
get_missing_value_percentage(scrapped_tweet_df)

Data twitter_archive_df have some missing value in variable in_reply_to_status_id (96.69%), in_reply_to_user_id (96.69%), retweeted_status_id (92.32%), retweeted_status_user_id (92.32%), retweeted_status_timestamp (92.32%), and expanded_urls (2.50%). Because of the large missing value (>90%), 5 cols in twitter_archive_df must be deleted. For expanded_urls, must be check after join with other table. Data image_prediction_df didn't have any missing value, the scrapped_tweet_df also didn't have missing value.

### Problem We Meet:

Need to Join All Columns???

1. Remove some row so we have same length in each table
2. Change timestamp into date format
3. Uniform missing value
4. Re-check duplicated twitter by expanded urls and jpg_url
5. Drop retwitted and favorited because they have same value in all row
6. Drop cols that have >90% missing value

Tidy:
1. Type of dog must be 1 cols instead of 4 cols
2. Type of algorithm and confidence must be new column to change p1, p2, p3, p1_conft, and other so the value not be the header name
