# Project: Wrangling and Analyze Data

In [1]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import seaborn as sns
import tweepy
%matplotlib inline

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
archive.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [4]:
img_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
img_data = requests.get(img_url)
open('img_data.tsv','wb').write(img_data.content)
im_data = pd.read_csv('img_data.tsv',sep ='\t')

In [5]:
im_data.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [6]:
import json

In [7]:
x = {"sec_key":"sjscbZ8MqpCy9FcSPhOXDgoVg","secret":"gdEQxtOSTaBkIQdj1SlRDwM9iAvBdZFZqgGzndAkBMd0H08pTr",
"token":"AAAAAAAAAAAAAAAAAAAAAAhTeAEAAAAA23c0LL54qgRDuNM6VaY64RIqBa4%3DA17AWAxSM3qV8zjGwLq32MLFt6IqHhh61mrSO0bLoV3bGuq1Dz"}

In [8]:
cred = json.dumps(x)

In [9]:
with open('twits_cred.json','w') as outfile:
    outfile.write(cred)

In [10]:
cred = pd.read_json('twits_cred.json',lines=True)

In [11]:
cred.keys()

Index(['sec_key', 'secret', 'token'], dtype='object')

In [12]:
cred.sec_key.values[0]

'sjscbZ8MqpCy9FcSPhOXDgoVg'

In [13]:
#passing credentials
consumer_key = cred.sec_key.values[0]
consumer_secret = cred.secret.values[0]

In [14]:
consumer_secret

'gdEQxtOSTaBkIQdj1SlRDwM9iAvBdZFZqgGzndAkBMd0H08pTr'

In [15]:
auth = tweepy.OAuth2BearerHandler(cred.token.values[0])
api = tweepy.API(auth,wait_on_rate_limit=True)

In [16]:
archive.keys()

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [17]:
archive.tweet_id[1].dtype

dtype('int64')

In [18]:
tweet_id = archive.tweet_id

In [19]:
tweet = api.get_status(tweet_id[20])
print(tweet.retweet_count)
print(tweet.favorite_count)

2893
19149


In [20]:
np.sum(tweet_id.isna())

0

In [21]:
tweet_like = pd.DataFrame()

In [22]:
tweet_like['tweet_id'] = archive.tweet_id

In [23]:
tweet_id[19]

888202515573088257

for sid in range(len(tweet_id)):
                tweet=api.get_status(tweet_id[sid])
                retweets= tweet.retweet_count
                likes = tweet.favorite_count
                geo = tweet.geo
                print(sid,'>>>>>',retweets)
                

In [24]:
#df_iter = pd.read_csv("yellow.csv",iterator=True,chunksize=100000)


In [25]:
len(tweet_id)

2356

In [33]:
fails

{}

In [34]:
fails = {}

In [35]:
for sid in list(tweet_id):
        try:
            tweet = api.get_status(sid, tweet_mode='extended')
            
        except tweepy.NotFound as e:
            if getattr(e, 'api_code', None) == 404:
                fails[sid] = e
                continue
        except tweepy.Forbidden as f:
            if getattr(f,'api_code',None)==403:
                print(f.reason)
                break
        
        except StopIteration:
                break

Rate limit reached. Sleeping for: 15


In [31]:
fails

{}

In [26]:
from time import time

In [27]:
a =[]

In [30]:
for sid in list(tweet_id):
    try:
        t_start = time()
        tweet=api.get_status(sid)
        retweets= tweet.retweet_count
        a.append(retweets)
        t_end = time()
    except tweepy.NotFound as e:
        if getattr(e, 'api_code', None) == 404:
            print(e)
            continue
    except tweepy.Forbidden as f:
        if getattr(f,'api_code',None)==403:
            print(f.reason)
            break
        
    except StopIteration:
            break

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

Rate limit reached. Sleeping for: 350


successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

Rate limit reached. Sleeping for: 384


successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

Rate limit reached. Sleeping for: 364


successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

Rate limit reached. Sleeping for: 38


successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the retweet counts
successfuly fetched  all the ret

In [31]:
import json 

In [32]:
w = pd.DataFrame()
w['retweets'] = a

In [33]:
w.to_csv('tweet_data.csv')

In [21]:
file = pd.read_csv('tweet_data.csv')

In [23]:
len(file)

2346

In [37]:
def twit_count(x_id):
    ids = list(x_id)
    rtweet = []
    location = []
    fav = []
    for sid in list(tweet_id:
        try:
                tweet=api.get_status(sid)
                retweets= tweet.retweet_count
                likes = tweet.favorite_count
                rtweet.append(retweets)
        except tweepy.NotFound as e:
            if getattr(e, 'api_code', None) == 404:
                print(e)
                continue
        except tweepy.Forbidden as f:
            if getattr(f,'api_code',None)==403:
                print(f.reason)
                break
        
        except StopIteration:
            break
     
    return  rtweet,location,fav

In [None]:
_twit = twit_count(tweet_id)

In [54]:
tweepy.

SyntaxError: invalid syntax (2163768001.py, line 1)

In [46]:
_twit[2]

[33810,
 29336,
 22065,
 36947,
 35317,
 17813,
 10366,
 56877,
 24528,
 27967,
 27044,
 24566,
 42069,
 23683,
 13341,
 22128,
 25647,
 22481,
 17318]

In [96]:
list(archive.tweet_id)[for i in range(10):
                       i]

SyntaxError: invalid syntax (3485571196.py, line 1)

In [88]:
tweet_like['retweet_count']=[api.get_status(list(archive.tweet_id)[i for i in range(len(archive.tweet_id))]).retweet_count]

SyntaxError: invalid syntax. Perhaps you forgot a comma? (2695652412.py, line 1)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Quality issues
1.

2.

3.

4.

5.

6.

7.

8.

### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization