# Data Wrangling for WeRateDogs Twitter archive

## Table of Contents

<ul>
<li><a href=\"#intro\">1 Introduction</a></li>
<li><a href=\"#wrangling\">2 Data Wrangling</a></li>
<li><a href=\"#eda\">3 Exploratory Data Analysis</a></li>
<li><a href=\"#conclusions\">4 Conclusion and limitations</a></li>
<li><a href=\"#Appendix\">5 Appendix</a></li>
</ul>



<a id='intro'></a>
## 1 Introduction
> This sub project is for data wrangling process of the 'Wrangling and Analyze Data' project. As the rule of thumb, this project has three components, gathering data, assessing data and cleaning data. Moreover, in the vary end of this notebook, I will store the cleaning data in .csv files for analysis and visualization later on.

In [2]:
import numpy as np
import pandas as pd
import requests
import io
import tweepy
from tweepy import OAuthHandler
import json
import timeit
import config.py # info of twitter API secrets and keys

## 2 Gathering Data
There are three data resources:
* Manually download: `twitter_arachive_enhanced.csv`
* Derive from Udacity's servers: `image_predictions.tsv`
* Derive by Tweepy: `tweet_json.txt`

>`twitter_arachive_enhanced.csv`: This file is downloaded manually and stores under the same path of this notebook for accessibility.

>`image_predictions.tsv`: This file is obtained using requests library in [section 2.1](need a html link here)

>`tweet_json.txt`: This file is obtained using requests library in [section 2.2](need a html link here)

### 2.1 Extract `image_predictions.tsv` from Udacity's servers 

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
urlData = requests.get(url).content
img_pred = pd.read_csv(io.StringIO(urlData.decode('utf-8')),sep='\t')

In [4]:
img_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### 4.2 Extract data using twitter API

In [5]:
consumer_key = config.consumer_key
consumer_secret = config.consumer_secret
access_token = config.access_token
access_secret = config.access_secret

In [6]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [8]:
# start = timeit.timeit() # set up a timer
# fails_dict={} # collect deleted ids
# count = 0 # get the processing status
# with open('tweet_json.txt', 'w') as outfile:
#     for twt_id in img_pred['tweet_id']:
#         try:
#             tweet = api.get_status(twt_id,tweet_mode='extended',wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
#             print('{} record success'.format(count),end="\r")
#             json.dump(tweet._json, outfile)
#             outfile.write('\n')
#         except tweepy.TweepError as e:
#             print('Fail',end="\r")
#             fails_dict[twt_id] = e
#             pass
#         count += 1
# end = timeit.timeit()

571 record success

Rate limit reached. Sleeping for: 444


1471 record success

Rate limit reached. Sleeping for: 576


2074 record success

In [14]:
print(end - start)

0.0036373999998744466


In [None]:
with open('tweet_json.txt') as json_file:
    data = json.load(json_file)
    for p in data['people']:
        print('Name: ' + p['name'])
        print('Website: ' + p['website'])
        print('From: ' + p['from'])
        print('')