# Data analysis of Twitter API data

## Table of Contents
- [Introduction](#intro)
- [Part I - Gathering Data](#gathering)
- [Part II - Assessing Data](#assessing)
- [Part III - Cleaning Data](#cleaning)
- [Part IV - Data analysis](#analysis)

In [3]:
import os
import glob
import pandas as pd
import numpy as np
import requests
import tweepy
import json
from timeit import default_timer as timer

%matplotlib inline

<a id='intro'></a>
# Introduction
The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user *@dog_rates*, also known as *WeRateDogs*. *WeRateDogs* is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." *WeRateDogs* has over 4 million followers and has received international media coverage.

## Project Motivation
### Context
Goal: wrangle *WeRateDogs* Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.
### Data
**Enhanced Twitter Archive**

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).
I extracted this data programmatically, but I didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. You'll need to assess and clean these columns if you want to use them for analysis and visualization.

## Key points
Key points to keep in mind when data wrangling for this project:

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* Cleaning includes merging individual pieces of data according to the rules of tidy data.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

<a id='gathering'></a>
# Part I - Gathering Data
The relevant data is retrieved by getting each of the three pieces of data as described below:

1. The *WeRateDogs* Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: [twitter_archive_enhanced.csv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)

2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file `image_predictions.tsv` is hosted on Udacity's servers.

3. Additionaly, each tweet's retweet count and favorite ("like") count at minimum is gathered. Using the tweet IDs in the *WeRateDogs* Twitter archive, we query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called `tweet_json.txt` file. Each tweet's JSON data is written to its own line. Then this .txt file is read line by line into a pandas DataFrame.

In [2]:
#The WeRateDogs Twitter archive file
df_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_archive.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


In [None]:
df_archive.info()

In [58]:
# The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet
# Created according to a neural network, Download from Udacity's servers
URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
file_name = URL.split('/')[-1]

In [None]:
r = requests.get(URL)
if r.ok:    
    with open(file_name, mode='wb') as file:
        file.write(r.content)

In [59]:
# Read flat file
df_predictions = pd.read_csv(file_name, sep='\t')
df_predictions.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


In [None]:
df_predictions.info()

In [None]:
# Dataframe shape
df_predictions.shape, df_archive.shape

In [None]:
# Check for duplicates tweets
df_predictions.tweet_id.duplicated().sum(), df_archive.tweet_id.duplicated().sum()

## Connecting to Twitter API
At this step with help of [Tweepy](http://www.tweepy.org/query) Python library we wil query Twitter's API for additional data beyond the data already included in the WeRateDogs Twitter archive file. This additional data will include retweet count and favorite count. 

[Tweepy API Documentation](http://docs.tweepy.org/en/v3.2.0/api.html#API)

In [5]:
# Set up the connection to Twitter API (requires Twitter account)
consumer_key = os.getenv('TW_CONSUMER_KEY')
consumer_secret = os.getenv('TW_CONSUMER_SECRET')

access_token = os.getenv('TW_ACCESS_TOKEN')
access_secret = os.getenv('TW_ACCESS_SECRET')

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# Note the handling of Twitter rate limit may extend the tweet query time
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [11]:
#api.rate_limit_status()

### Extract tweet object data to a text file

In [8]:
# Tweet IDs for which to gather additional data
tweet_ids = df_archive.tweet_id.values[:5]
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()

# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as file:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    # Rate limits are divided into 15 minute intervals
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, file)
            file.write('\n')
        except tweepy.TweepError as e:
            print(f'\nTweet id - {tweet_id} - does not exist anymore.\n')
            fails_dict[tweet_id] = e
            pass
end = timer()
print(f'Total execution time: {end - start}\n)
print(fails_dict)

1: 892420643555336193
Success
2: 892177421306343426
Success
3: 891815181378084864
Success
4: 891689557279858688
Success
5: 891327558926688256
Success
1.5414323999998487
{}


### Store Retweets and Favorite data

In [93]:
dict_list = []

# read .txt file as JSON file
with open('tweet_json.txt', 'r') as file:    
    for line in file:       
        # Convert to Python dictionary
        data = json.loads(line)
        # populate tweet dictionary
        dict_list.append({'tweet_id': data['id'],
                        'retweet_count': data['retweet_count'],
                        'favorite_count': data['favorite_count']
                        })
# Create a DataFrame with the the new parameters
df_new = pd.DataFrame(dict_list, columns = ['tweet_id', 'retweet_count', 'favorite_count'])
df_new

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7732,36334
1,892177421306343426,5725,31313
2,891815181378084864,3787,23588
3,891689557279858688,7896,39643
4,891327558926688256,8516,37843


In [None]:
#api.get_user('dog_rates')

<a id='assessing'></a>
# Part II - Assessing Data

After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

<a id='cleaning'></a>
# Part III - Cleaning Data

Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

<a id='analysis'></a>
# Part IV - Analyzing Data
Storing, Analyzing, and Visualizing Data

Reporting of the project:
  1. Summary of data wrangling efforts are reported in `wrangle_report.html`.
  2. Summary of data analyses and visualizations are reported in `act_report.html`.

## Storing Data

Store the clean DataFrame(s) in a CSV file with the main one named `twitter_archive_master.csv`. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

## Visualisations