# Project: Wrangle and Analyze Data

# Gathering

We can gather from 3 different sources, and 3 different types:
- The WeRateDogs Twitter Archive (csv)
- The Tweet Image Predictions (tsv)
- Tweet's retweet count and favorite (Twitter API) (txt)


In [2]:
# Importing our needed libraries
import pandas as pd
import numpy as np
import requests

## 1. We Rate Dogs Archive
This is given to us by Udacity, it is named: `twitter_archive_enhanced.csv`
I have downloaded from **'project description'**

In [3]:
# loading in the file
df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')


## 2. Tweet Image Predictions
This is also given to us by Udacity, however we must use the `Requests` library to download it from a given URL.

In [4]:
# downloading files using requests
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)

open('image-predictions.tsv','wb').write(r.content)


335079

In [5]:
# load the tsv into a dataframe (making sure to seperate by tab)
df_predicted_images = pd.read_csv('image-predictions.tsv', sep = '\t')

## 3. Querying the Twitter API
* Grabbing every tweet's **Retweet** and **Favorite** count using `Tweepy`

    Steps:
    
    * Use tweet IDs from the archive
    * Query the twitter API for each tweet's JSON
    * Store the data in a file called `tweet_json.txt`
    * Each JSON should be written to its own line
    * read the .txt file line by line into pandas DF storing:
        * Tweet ID
        * Retweet Count
        * Favorite Count
        
*note: the phone verificaiton has some trouble, so I went ahead and used the file provided by Udacity, I will double check the verification in the future.*

In [6]:
# # BIG NOTE: Oddly, this is also resulting in Fails instead of the real value, so I have uploaded the tweet-json.txt file 
# # which was included next to the code.

# import tweepy
# from tweepy import OAuthHandler
# import json
# from timeit import default_timer as timer

# # Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# # These are hidden to comply with Twitter's API terms and conditions
# consumer_key = 'HIDDEN'
# consumer_secret = 'HIDDEN'
# access_token = 'HIDDEN'
# access_secret = 'HIDDEN'

# auth = OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)

# api = tweepy.API(auth, wait_on_rate_limit=True)

# # NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# # df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# # change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# # NOTE TO REVIEWER: this student had mobile verification issues so the following
# # Twitter API code was sent to this student from a Udacity instructor
# # Tweet IDs for which to gather additional data via Twitter's API
# tweet_ids = df_twitter_archive.tweet_id.values
# len(tweet_ids)

# # Query Twitter's API for JSON data for each tweet ID in the Twitter archive
# count = 0
# fails_dict = {}
# start = timer()
# # Save each tweet's returned JSON as a new line in a .txt file
# with open('tweet_json.txt', 'w') as outfile:
#     # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
#     for tweet_id in tweet_ids:
#         count += 1
#         print(str(count) + ": " + str(tweet_id))
#         try:
#             tweet = api.get_status(tweet_id, tweet_mode='extended')
#             print("Success")
#             json.dump(tweet._json, outfile)
#             outfile.write('\n')
#         except tweepy.TweepError as e:
#             print("Fail")
#             fails_dict[tweet_id] = e
#             pass
# end = timer()
# print(end - start)
# print(fails_dict)

In [7]:
# link for this useful code: https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas

df_tweets = pd.read_json('tweet-json.txt',lines=True)

# Assessing

We will be assessing every file we got above in the same order:
1. We Rate Dogs Archive
2. Tweet Image Predictions
3. Querying the Twitter API

Keys of assessment:
* Completeness
* Validity
* Accuracy
* Consistency 

at the end we will group the issues under Quality and Tidiness

## Asessing We Rate Dogs Archive

### Visual Assesment
using pandas .head() and .sample()

In [8]:
# It's best to simply look at the head
df_twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [9]:
# It's also beneficial to look at a sample
df_twitter_archive.sample(25)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2002,672481316919734272,,,2015-12-03 18:23:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Jazz. She should be on the cover ...,,,,https://twitter.com/dog_rates/status/672481316...,12,10,Jazz,,,pupper,
2107,670465786746662913,,,2015-11-28 04:54:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Silly dog here. Wearing bunny ears. Nice long ...,,,,https://twitter.com/dog_rates/status/670465786...,7,10,,,,,
1327,705975130514706432,,,2016-03-05 04:36:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Adele. Her tongue flies out of her mou...,,,,https://twitter.com/dog_rates/status/705975130...,10,10,Adele,,,pupper,
1205,715928423106027520,,,2016-04-01 15:46:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bubbles. He's a Yorkshire Piccolope. 1...,,,,https://twitter.com/dog_rates/status/715928423...,11,10,Bubbles,,,,
211,851953902622658560,,,2017-04-12 00:23:33 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Astrid. She's a guide d...,8.293743e+17,4196984000.0,2017-02-08 17:00:26 +0000,https://twitter.com/dog_rates/status/829374341...,13,10,Astrid,doggo,,,
1655,683391852557561860,,,2016-01-02 20:58:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...","Say hello to Jack (pronounced ""Kevin""). He's a...",,,,https://twitter.com/dog_rates/status/683391852...,11,10,Jack,,,,
766,777684233540206592,,,2016-09-19 01:42:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""Yep... just as I suspected. You're not flossi...",,,,https://twitter.com/dog_rates/status/777684233...,12,10,,,,,
1586,686760001961103360,,,2016-01-12 04:01:58 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This pupper forgot how to walk. 12/10 happens ...,,,,https://vine.co/v/iMvubwT260D,12,10,,,,pupper,
1291,708119489313951744,,,2016-03-11 02:36:57 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cooper. He basks in the glory of rebel...,,,,https://twitter.com/dog_rates/status/708119489...,9,10,Cooper,,,,
1839,675891555769696257,,,2015-12-13 04:14:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Donny. He's summoning the demon monste...,,,,https://twitter.com/dog_rates/status/675891555...,6,10,Donny,,,,


### Programmatic assessment

In [10]:
# Checking .info
df_twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [11]:
# checking for duplicates on the tweet id
df_twitter_archive.tweet_id.duplicated().sum()

0

In [12]:
# checking for denominators that arent 10
np.sort(df_twitter_archive.rating_denominator.unique())

array([  0,   2,   7,  10,  11,  15,  16,  20,  40,  50,  70,  80,  90,
       110, 120, 130, 150, 170])

In [13]:
# checking the number of occurances
df_twitter_archive.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

**From what we can see, is that the number of occurances for these numbers are so low, and from the information given in the project details we can say that these might be inforrectly gathered so we have to manually check them.**

In [17]:
# we use this line to print the whole text:
pd.set_option("display.max_colwidth", -1)

# check the rows where the denomintor is not 10:
df_twitter_archive.query('rating_denominator != "10"')[['rating_numerator','rating_denominator']]'rating_numerator','rating_denominator'

Unnamed: 0,rating_numerator,rating_denominator
313,960,0
342,11,15
433,84,70
516,24,7
784,9,11
902,165,150
1068,9,11
1120,204,170
1165,4,20
1202,50,50


In [18]:
# getting the text for the rows printed above: 
df_twitter_archive.query('rating_denominator != "10"').text

313     @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho                                                                       
342     @docmisterio account started on 11/15/15                                                                                                                                 
433     The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd                                                                      
516     Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
784     RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…                             
902     Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE                        

In [None]:
df_twitter_archive.query('rating_denominator != "10"').expanded_urls

## Quality Issues

### We Rate Dogs Archive

* Some ratings have been gathered incorrectly (nominator and denominator) the indices and their problems:

| index| Incorrect | Correct |
| --- | --- | --- |
| 313 | 960 / 0 | 13 / 10 |  
| 342 | 11/15 | NO RATING |
| 516 | 24/7  | NO RATING |
| 

* 342 is not a rating tweet.

* Missing Expanded URLS 

### Tweet Image Predictions

*
*

### Querying the Twitter API

*
*



## Tidiness Issues

### We Rate Dogs Archive

* doggo, floofer, pupper, and puppo (one variable 4 columns)

*

### Tweet Image Predictions

*
*

### Querying the Twitter API

*
*

# Cleaning

# Storing Wrangled Data