## Project: Wrangle and Analyze Data (WeRateDogs)

This project is based on tackling with data wrangling efforts.

## Table of Contents
- [Introduction](#intro)
- [Part I - Gathering Data](#gathering)
- [Part II - Assessing Data](#assessing)
- [Part III - Cleaning Data](#cleaning)
- [Part IV - Analyzing and visualizing Wrangled data](#analyzing_visualizing)

<a id='intro'></a>
### Introduction

Wrangling and analyzing data is a demanding process for data analyst and its work. In this project, I will use the data wrangling skills to dather data from Twitter,clean them, and do some analysis. Then,I will get the original Twitter data from Twitter user @dog_rates, along with a image prediction dataset.I will document my wrangling efforts in a Jupyter Notebook, and showcase them through analyses and visualizations using Python (and its libraries).

As WeRateDogs is a popular Twitter hash tag,people rate dogs with a denominator of 10 and the numerator is usually higher than 10 to show how lovely the dog is. Furthermore, WeRateDogs has over 4 million followers and has received international media coverage.


<a id='gathering'></a>
### Part I - Gathering Data

Data will be gathered from three resources:

1. The WeRateDogs Twitter archive. The twitter_archive_enhanced.csv file was provided to me.

2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file was provided to me.

3. Twitter API and Python's Tweepy library to gather each tweet's retweet count and favorite ("like") count at minimum, and any additional data I find interesting.

In [22]:
# Because tweepy module cannot be encompassed, it should be install as Anaconda Command.
# !pip install tweepy

In [27]:
# import libraries I use to build my project
import numpy as np
import pandas as pd
import requests
import json
import datetime
import tweepy
import sys
import time
import matplotlib.pyplot as plt
%matplotlib inline

1. Gathering the WeRateDogs Twitter archive Whose filename is twitter_archive_enhanced.csv

In [32]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

2. Downloading the image predictions file from the Udacity server and writing all content to image_predictions.tsv file

In [25]:
url="https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
with open('image_predictions.tsv', 'wb') as file:
    file.write(response.content)
    
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

3. Querying Twitter's API for JSON data - match to each Tweet ID from the archive

In [41]:
consumer_key = "UnbSXP1KKBRB1kkx5Ry4Yfezi" # "MY_CUSTOMER_KEY"
consumer_secret = "k9JoR24jMc09Ikl9IPurim8tCsKQNSvSvsw3SgfhO14h3vQwZ9" # "MY_CONSUMER_SECRET"
access_token = "411318653-hFPRaoKqZ0IlCN1nGI4KED7oalEVbak1wKKflBBW" # "ACCESS_TOKEN"
access_secret = "zhyKTfCcFHIH4Gh3C2d2n1svZe6zf9OsXeSQJ704a5lD2" # "ACCESS_SECRET"


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

if(api):
    print("Connected Twitter via API")


# As, it has a long time process fo loop part, I would like to calculate how much time is it done.
start = time.time()

errors = []  
tweet_ids = twitter_archive["tweet_id"] 
with open('tweet_json.txt', 'w') as tweet_json_file:
    for tweet_id in tweet_ids:
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended',wait_on_rate_limit = True,
                               wait_on_rate_limit_notify = True)
            json.dump(tweet._json, tweet_json_file)
            tweet_json_file.write('\n')
            print("Tweet ID : {}".format(tweet_id))
        except tweepy.TweepError as e:
            errors.append(tweet_id)
            print("Error : {}".format(e))
            pass

        
# this stops the timer            
end = time.time()

minutes_time = int( (end - start) / 60)
seconds_time = int( (end - start) % 60)
print("Time the process completed : {} minutes {} seconds ".format(minutes_time, seconds_time))

Connected Twitter via API
Tweet ID : 892420643555336193
Tweet ID : 892177421306343426
Tweet ID : 891815181378084864
Tweet ID : 891689557279858688
Tweet ID : 891327558926688256
Tweet ID : 891087950875897856
Tweet ID : 890971913173991426
Tweet ID : 890729181411237888
Tweet ID : 890609185150312448
Tweet ID : 890240255349198849
Tweet ID : 890006608113172480
Tweet ID : 889880896479866881
Tweet ID : 889665388333682689
Tweet ID : 889638837579907072
Tweet ID : 889531135344209921
Tweet ID : 889278841981685760
Tweet ID : 888917238123831296
Tweet ID : 888804989199671297
Tweet ID : 888554962724278272
Error : [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID : 888078434458587136
Tweet ID : 887705289381826560
Tweet ID : 887517139158093824
Tweet ID : 887473957103951883
Tweet ID : 887343217045368832
Tweet ID : 887101392804085760
Tweet ID : 886983233522544640
Tweet ID : 886736880519319552
Tweet ID : 886680336477933568
Tweet ID : 886366144734445568
Tweet ID : 886267009285017600
Tweet 

Tweet ID : 843235543001513987
Error : [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID : 842846295480000512
Tweet ID : 842765311967449089
Tweet ID : 842535590457499648
Tweet ID : 842163532590374912
Tweet ID : 842115215311396866
Tweet ID : 841833993020538882
Tweet ID : 841680585030541313
Tweet ID : 841439858740625411
Tweet ID : 841320156043304961
Tweet ID : 841314665196081154
Tweet ID : 841077006473256960
Tweet ID : 840761248237133825
Tweet ID : 840728873075638272
Tweet ID : 840698636975636481
Tweet ID : 840696689258311684
Tweet ID : 840632337062862849
Tweet ID : 840370681858686976
Tweet ID : 840268004936019968
Tweet ID : 839990271299457024
Tweet ID : 839549326359670784
Tweet ID : 839290600511926273
Tweet ID : 839239871831150596
Tweet ID : 838952994649550848
Tweet ID : 838921590096166913
Tweet ID : 838916489579200512
Tweet ID : 838831947270979586
Tweet ID : 838561493054533637
Tweet ID : 838476387338051585
Tweet ID : 838201503651401729
Tweet ID : 838150277551247360
Tw

Tweet ID : 809084759137812480
Tweet ID : 808838249661788160
Tweet ID : 808733504066486276
Tweet ID : 808501579447930884
Tweet ID : 808344865868283904
Tweet ID : 808134635716833280
Tweet ID : 808106460588765185
Tweet ID : 808001312164028416
Tweet ID : 807621403335917568
Tweet ID : 807106840509214720
Tweet ID : 807059379405148160
Tweet ID : 807010152071229440
Tweet ID : 806629075125202948
Tweet ID : 806620845233815552
Tweet ID : 806576416489959424
Tweet ID : 806542213899489280
Tweet ID : 806242860592926720
Tweet ID : 806219024703037440
Tweet ID : 805958939288408065
Tweet ID : 805932879469572096
Tweet ID : 805826884734976000
Tweet ID : 805823200554876929
Tweet ID : 805520635690676224
Tweet ID : 805487436403003392
Tweet ID : 805207613751304193
Tweet ID : 804738756058218496
Tweet ID : 804475857670639616
Tweet ID : 804413760345620481
Tweet ID : 804026241225523202
Tweet ID : 803773340896923648
Tweet ID : 803692223237865472
Tweet ID : 803638050916102144
Tweet ID : 803380650405482500
Tweet ID :

Rate limit reached. Sleeping for: 414


Tweet ID : 786729988674449408
Tweet ID : 786709082849828864
Tweet ID : 786664955043049472
Tweet ID : 786595970293370880
Tweet ID : 786363235746385920
Tweet ID : 786286427768250368
Tweet ID : 786233965241827333
Tweet ID : 786051337297522688
Tweet ID : 786036967502913536
Tweet ID : 785927819176054784
Tweet ID : 785872687017132033
Tweet ID : 785639753186217984
Tweet ID : 785533386513321988
Tweet ID : 785515384317313025
Tweet ID : 785264754247995392
Tweet ID : 785170936622350336
Tweet ID : 784826020293709826
Tweet ID : 784517518371221505
Tweet ID : 784431430411685888
Tweet ID : 784183165795655680
Tweet ID : 784057939640352768
Tweet ID : 783839966405230592
Tweet ID : 783821107061198850
Tweet ID : 783695101801398276
Tweet ID : 783466772167098368
Tweet ID : 783391753726550016
Tweet ID : 783347506784731136
Tweet ID : 783334639985389568
Tweet ID : 783085703974514689
Tweet ID : 782969140009107456
Tweet ID : 782747134529531904
Tweet ID : 782722598790725632
Tweet ID : 782598640137187329
Tweet ID :

Tweet ID : 751251247299190784
Tweet ID : 751205363882532864
Tweet ID : 751132876104687617
Tweet ID : 750868782890057730
Tweet ID : 750719632563142656
Tweet ID : 750506206503038976
Tweet ID : 750429297815552001
Tweet ID : 750383411068534784
Tweet ID : 750381685133418496
Tweet ID : 750147208377409536
Tweet ID : 750132105863102464
Tweet ID : 750117059602808832
Tweet ID : 750101899009982464
Tweet ID : 750086836815486976
Tweet ID : 750071704093859840
Tweet ID : 750056684286914561
Tweet ID : 750041628174217216
Tweet ID : 750026558547456000
Tweet ID : 750011400160841729
Tweet ID : 749996283729883136
Tweet ID : 749981277374128128
Tweet ID : 749774190421639168
Tweet ID : 749417653287129088
Tweet ID : 749403093750648834
Tweet ID : 749395845976588288
Tweet ID : 749317047558017024
Tweet ID : 749075273010798592
Tweet ID : 749064354620928000
Tweet ID : 749036806121881602
Tweet ID : 748977405889503236
Tweet ID : 748932637671223296
Tweet ID : 748705597323898880
Tweet ID : 748699167502000129
Tweet ID :

Tweet ID : 712809025985978368
Tweet ID : 712717840512598017
Tweet ID : 712668654853337088
Tweet ID : 712438159032893441
Tweet ID : 712309440758808576
Tweet ID : 712097430750289920
Tweet ID : 712092745624633345
Tweet ID : 712085617388212225
Tweet ID : 712065007010385924
Tweet ID : 711998809858043904
Tweet ID : 711968124745228288
Tweet ID : 711743778164514816
Tweet ID : 711732680602345472
Tweet ID : 711694788429553666
Tweet ID : 711652651650457602
Tweet ID : 711363825979756544
Tweet ID : 711306686208872448
Tweet ID : 711008018775851008
Tweet ID : 710997087345876993
Tweet ID : 710844581445812225
Tweet ID : 710833117892898816
Tweet ID : 710658690886586372
Tweet ID : 710609963652087808
Tweet ID : 710588934686908417
Tweet ID : 710296729921429505
Tweet ID : 710283270106132480
Tweet ID : 710272297844797440
Tweet ID : 710269109699739648
Tweet ID : 710153181850935296
Tweet ID : 710140971284037632
Tweet ID : 710117014656950272
Tweet ID : 709918798883774466
Tweet ID : 709901256215666688
Tweet ID :

Tweet ID : 691675652215414786
Tweet ID : 691483041324204033
Tweet ID : 691459709405118465
Tweet ID : 691444869282295808
Tweet ID : 691416866452082688
Tweet ID : 691321916024623104
Tweet ID : 691096613310316544
Tweet ID : 691090071332753408
Tweet ID : 690989312272396288
Tweet ID : 690959652130045952
Tweet ID : 690938899477221376
Tweet ID : 690932576555528194
Tweet ID : 690735892932222976
Tweet ID : 690728923253055490
Tweet ID : 690690673629138944
Tweet ID : 690649993829576704
Tweet ID : 690607260360429569
Tweet ID : 690597161306841088
Tweet ID : 690400367696297985
Tweet ID : 690374419777196032
Tweet ID : 690360449368465409
Tweet ID : 690348396616552449
Tweet ID : 690248561355657216
Tweet ID : 690021994562220032
Tweet ID : 690015576308211712
Tweet ID : 690005060500217858
Tweet ID : 689999384604450816
Tweet ID : 689993469801164801
Tweet ID : 689977555533848577
Tweet ID : 689905486972461056
Tweet ID : 689877686181715968
Tweet ID : 689835978131935233
Tweet ID : 689661964914655233
Tweet ID :

Rate limit reached. Sleeping for: 509


Tweet ID : 686377065986265092
Tweet ID : 686358356425093120
Tweet ID : 686286779679375361
Tweet ID : 686050296934563840
Tweet ID : 686035780142297088
Tweet ID : 686034024800862208
Tweet ID : 686007916130873345
Tweet ID : 686003207160610816
Tweet ID : 685973236358713344
Tweet ID : 685943807276412928
Tweet ID : 685906723014619143
Tweet ID : 685681090388975616
Tweet ID : 685667379192414208
Tweet ID : 685663452032069632
Tweet ID : 685641971164143616
Tweet ID : 685547936038666240
Tweet ID : 685532292383666176
Tweet ID : 685325112850124800
Tweet ID : 685321586178670592
Tweet ID : 685315239903100929
Tweet ID : 685307451701334016
Tweet ID : 685268753634967552
Tweet ID : 685198997565345792
Tweet ID : 685169283572338688
Tweet ID : 684969860808454144
Tweet ID : 684959798585110529
Tweet ID : 684940049151070208
Tweet ID : 684926975086034944
Tweet ID : 684914660081053696
Tweet ID : 684902183876321280
Tweet ID : 684880619965411328
Tweet ID : 684830982659280897
Tweet ID : 684800227459624960
Tweet ID :

Tweet ID : 675349384339542016
Tweet ID : 675334060156301312
Tweet ID : 675166823650848770
Tweet ID : 675153376133427200
Tweet ID : 675149409102012420
Tweet ID : 675147105808306176
Tweet ID : 675146535592706048
Tweet ID : 675145476954566656
Tweet ID : 675135153782571009
Tweet ID : 675113801096802304
Tweet ID : 675111688094527488
Tweet ID : 675109292475830276
Tweet ID : 675047298674663426
Tweet ID : 675015141583413248
Tweet ID : 675006312288268288
Tweet ID : 675003128568291329
Tweet ID : 674999807681908736
Tweet ID : 674805413498527744
Tweet ID : 674800520222154752
Tweet ID : 674793399141146624
Tweet ID : 674790488185167872
Tweet ID : 674788554665512960
Tweet ID : 674781762103414784
Tweet ID : 674774481756377088
Tweet ID : 674767892831932416
Tweet ID : 674764817387900928
Tweet ID : 674754018082705410
Tweet ID : 674752233200820224
Tweet ID : 674743008475090944
Tweet ID : 674742531037511680
Tweet ID : 674739953134403584
Tweet ID : 674737130913071104
Tweet ID : 674690135443775488
Tweet ID :

Tweet ID : 670003130994700288
Tweet ID : 669993076832759809
Tweet ID : 669972011175813120
Tweet ID : 669970042633789440
Tweet ID : 669942763794931712
Tweet ID : 669926384437997569
Tweet ID : 669923323644657664
Tweet ID : 669753178989142016
Tweet ID : 669749430875258880
Tweet ID : 669684865554620416
Tweet ID : 669683899023405056
Tweet ID : 669682095984410625
Tweet ID : 669680153564442624
Tweet ID : 669661792646373376
Tweet ID : 669625907762618368
Tweet ID : 669603084620980224
Tweet ID : 669597912108789760
Tweet ID : 669583744538451968
Tweet ID : 669573570759163904
Tweet ID : 669571471778410496
Tweet ID : 669567591774625800
Tweet ID : 669564461267722241
Tweet ID : 669393256313184256
Tweet ID : 669375718304980992
Tweet ID : 669371483794317312
Tweet ID : 669367896104181761
Tweet ID : 669363888236994561
Tweet ID : 669359674819481600
Tweet ID : 669354382627049472
Tweet ID : 669353438988365824
Tweet ID : 669351434509529089
Tweet ID : 669328503091937280
Tweet ID : 669327207240699904
Tweet ID :

In [42]:
print(errors)

[888202515573088257, 873697596434513921, 872668790621863937, 872261713294495745, 869988702071779329, 866816280283807744, 861769973181624320, 856602993587888130, 851953902622658560, 845459076796616705, 844704788403113984, 842892208864923648, 837366284874571778, 837012587749474308, 829374341691346946, 827228250799742977, 812747805718642688, 802247111496568832, 779123168116150273, 775096608509886464, 771004394259247104, 770743923962707968, 759566828574212096, 754011816964026368, 680055455951884288]


4. Using the Python Tweepy library , Gathering retweet count and like of tweet count thanks to the usage of the Twitter API

Reference Link :https://stackoverflow.com/questions/47925828/how-to-create-a-pandas-dataframe-using-tweepy

In [46]:
extracted_tweet_data = [] 
with open('tweet_json.txt') as tweet_json_file:  # tweet-json.txt
    for line in tweet_json_file: 
        json_data = json.loads(line)
        tweet_id = json_data['id']
        retweet_count = json_data['retweet_count']
        favorite_count = json_data['favorite_count']
        extracted_tweet_data.append(
                                        {
                                         'tweet_id': tweet_id,
                                         'retweet_count': int(retweet_count),
                                         'favorite_count': int(favorite_count),
                                        }
                                   )
        

extracted_data = pd.DataFrame(extracted_tweet_data, columns = ['tweet_id', 'retweet_count', 'favorite_count']) 

<a id='assessing'></a>
### Part II - Assessing Data


<a id='cleaning'></a>
### Part III - Cleaning Data


<a id='analyzing_visualizing'></a>
### Part IV - Analyzing and visualizing Wrangled data
