# Wrangle and Analyze Data

## Task Overview:

Gather - Gather up the various datasets

Assess - Assess the quality and issues with the data sets.

Clean - Tidy up the data sets, removing bad data, combining into a single set.

Store - Store the cleaned and combined dataset.

Analyze and Visualize - Analyze the data and support the analysis with good visualizations of the data.

Report in Two Parts:
<li>Document the data wrangling efforts
<li>Document and present the data analyses and visualizations

#### The Data
This project includes three datasets

##### Enhanced Twitter Archive
##### Additional Data via the Twitter API
##### Image Predictions File


In [43]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import os
import time
import datetime
import random
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.stats import pearsonr
%matplotlib inline
pd.set_option('display.max_colwidth', None)

## Gather the Data

### Enhanced Twitter Archive
This is the twitter archive dataset Udacity got from WeRateDogs. When Udacity received this, it contained basic tweet data for some 5000+ tweets. It is missing some fields that are in each tweet, this is provided to the student as an exercise in using an API. One field included in this archive is the text of each tweet. From this Udacity has extracted the name of the dog, and rating, in the form of a numerator and denominator and the standardized silly doggie descriptors used by WeRateDogs, referred to as "stage." These are doggo, floofer, pupper, and puppo. This modified archive is provided to the student as a 'Twitter archive "enhanced."' Tweets with out photos or ratings were removed leaving only 2356 in the archive.

In [5]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


### Image Predictions File
The second data source for the project is a file created by Udacity including additional information about the dogs in the tweets:
> Udacity ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

Columns in that table are:

* tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
* p1 is the algorithm's #1 prediction for the image in the tweet
* p1_conf is how confident the algorithm is in its #1 prediction
* p1_dog is whether or not the #1 prediction is a breed of dog
* p2 is the algorithm's second most likely prediction
* p2_conf is how confident the algorithm is in its #2 prediction
* p2_dog is whether or not the #2 prediction is a breed of dog
* etc.

In [6]:
ip_URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
image_predictions = requests.get(ip_URL)
image_predictions.headers
with open(ip_URL.split('/')[-1], mode='wb') as file:
    try:
        file.write(image_predictions.content)
    finally:
        file.close()

In [7]:
image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Additional Data via the Twitter API

>Gather each tweet's **retweet count** and **favorite ("like") count** at the minimum and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file.

Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) **tweet ID, retweet count, and favorite count.**

**Note: do not include your Twitter API keys, secrets, and tokens in your project submission.**

In [10]:
# import tweepy

# build the Twitter accss object
api_key = "1P1RQHJCrMe8XtOATAGn7pknq"
api_key_secret = "lyvgxEDciBXzaIlmvvSpVMKb4ealQFFRzf9DGM1DkUq2RIRtLp"

bearer_token = "AAAAAAAAAAAAAAAAAAAAAC7nogEAAAAAG1iGYfpyLE7TNBhtM14GE11RlWY%3Drw3ywjwnRH0k0800fzrxS6ePxenhKGrYCLk2Es4gOQIXDmdDYo"

access_token = "221136429-LGzfNKJpBEZMMrUNPnq4Wd0eBu3NcVVEytlDYsOA"
access_token_secret = "GfQVSEMxsYiQp3dTcvHiGWu1R2tfgpuTd3lHNZ1WYHUBu"

authenticator = tweepy.OAuthHandler(api_key, api_key_secret)
authenticator.set_access_token(access_token, access_token_secret)

#tweet_ids = [892420643555336193, 892177421306343426]

authenticator = tweepy.OAuthHandler(api_key, api_key_secret)
authenticator.set_access_token(access_token, access_token_secret)
api = tweepy.API(authenticator, wait_on_rate_limit=True)

In [30]:
# keep track of tweets provided by Udacity that have been deleted
failed_tweet = []
with open('tweet_json.txt', mode='w', encoding='utf-8') as file:
    for tweet_id in twitter_archive['tweet_id']:
#        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            tweet_json = json.dumps(tweet._json)
            # write tweet JSON data line by line
            file.write(tweet_json + '\n')
#        except tweepy.TweepError as err:
#            failed_tweet.append(tweet_id)
#            if len(failed_tweet) > 10:
#                break

Forbidden: 403 Forbidden
453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product

In [23]:
len(failed_tweet)

11

This API is failing on every request. I'll ulse the alternative data source

In [37]:
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

api = tweepy.API(authenticator, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
#            tweet = api.get_status(tweet_id, tweet_mode='compat')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

1: 892420643555336193


AttributeError: module 'tweepy' has no attribute 'TweepError'

Even the sample code returns the error:

<code><font color=990000><b>Forbidden</b></font>: 403 Forbidden
453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product</code>

I'll load the downloaded file.

In [44]:
tweet_json = []

with open(tweet-json.txt, 'r') as json_file:
    # read the first line to start the loop
    line = json_file.readline()
    while line:
        data = json.loads(line)

        # extract variables from the JSON data
        data_id = data['id']
        data_retweet_count = data['retweet_count']
        data_favorite_count = data['favorite_count']
        
        # create a dictionary with the JSON data, then add to a list
        json_data = {'tweet_id': data_id, 
                     'retweet_count': data_retweet_count, 
                     'favorite_count': data_favorite_count
                    }
        tweet_json_data.append(json_data)

        # read the next line of JSON data
        line = json_file.readline()
        # ----- while -----

# convert the tweet JSON data dictionary list to a DataFrame
tweet_data_extra = pd.DataFrame(tweet_json_data, 
                                   columns = ['tweet_id',
                                              'retweet_count',
                                              'favorite_count'])

tweet_data_extra.head(30)

NameError: name 'tweet' is not defined

In [40]:
tweet_data.head

<bound method NDFrame.head of                 tweet_id  retweet_count  favorite_count
0     892420643555336193           8853           39467
1     892177421306343426           6514           33819
2     891815181378084864           4328           25461
3     891689557279858688           8964           42908
4     891327558926688256           9774           41048
...                  ...            ...             ...
2349  666049248165822465             41             111
2350  666044226329800704            147             311
2351  666033412701032449             47             128
2352  666029285002620928             48             132
2353  666020888022790149            532            2535

[2354 rows x 3 columns]>

## Step 2: Assessing data

### Twitter Archive

In [81]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [103]:
# twitter_archive.query('doggo != "None"')['text']

# find all instances of doggo, disregarding case, in text
for text in twitter_archive.text():
    re.search(?i:"doggo",text)'

#twitter_archive['text'].str.extract(doggo_regex)

SyntaxError: invalid syntax (3787196277.py, line 5)

In [96]:
# allow for digits with decimals in regex

rating_regex = '(\d+\.?\d*)/(\d+\.?\d*)'

#twitter_archive['rating_numerator'] = twitter_archive['rating_numerator'].astype(float)
#twitter_archive['rating_denominator'] = twitter_archive['rating_denominator'].astype(float)

twitter_archive[['rating_numerator', 'rating_denominator']] = twitter_archive['text'].str.extract(rating_regex)

# this seems out of order. Don't they need to be float values before I try to stuff float values into them?
# but it doesn't work if it's earlier
twitter_archive['rating_numerator'] = twitter_archive['rating_numerator'].astype(float)
twitter_archive['rating_denominator'] = twitter_archive['rating_denominator'].astype(float)

twitter_archive[['rating_numerator', 'rating_denominator']].describe()

Unnamed: 0,rating_numerator,rating_denominator
count,2356.0,2356.0
mean,13.06368,10.455433
std,45.839085,6.745237
min,0.0,0.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,1776.0,170.0
