# Project: Wrangling and Analyze Data

In [1]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import re
import matplotlib as plt
% matplotlib inline

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
# Reading csv file to dataframe
archive = pd.read_csv('twitter-archive-enhanced.csv')

archive.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
# Requesting tsv file
image_predict_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(image_predict_url)

#Saving local copy of tsv
with open('image_predictions.tsv', 'wb') as f1:
    f1.write(response.content)

In [3]:
# Reading tsv file to dataframe    
image_predict = pd.read_csv('image_predictions.tsv', sep='\t')

image_predict.head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [None]:
# Actual keys removed and repaced with dummies before submission
consumer_key = 'NOPE'
consumer_secret = 'NOPE'
access_token = 'NOPE'
access_secret = 'NOPE'

# Authorizing and initializing API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [None]:
# Querying tweets by id and adding them to json formatted txt file
with open('tweet_json.txt', 'a', encoding='utf8') as f2:
    for tweet_id in archive['tweet_id']:
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet._json, f2)
            f2.write('\n')
        except:
            continue

In [4]:
# Reading json formatted txt file into dataframe line-by-line
tweet_data = pd.read_json('tweet_json.txt', lines=True)
# Reducing dataframe to only attributes selected for analysis
tweet_data = tweet_data[['id', 'favorite_count', 'retweet_count']]

tweet_data.head(1)

Unnamed: 0,id,favorite_count,retweet_count
0,892420643555336193,34095,7066


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [None]:
archive

In [None]:
archive.info()

In [None]:
archive.doggo.value_counts()

In [None]:
archive.floofer.value_counts()

In [None]:
archive.pupper.value_counts()

In [None]:
archive.puppo.value_counts()

In [None]:
archive.query('rating_denominator != 10')

In [None]:
non_10_denom = archive.query('rating_denominator != 10')
pd.set_option('display.max_colwidth', -1)
print(non_10_denom['text'])

In [None]:
image_predict

In [None]:
image_predict.info()

In [None]:
image_predict.p1.value_counts()

In [None]:
tweet_data

In [None]:
tweet_data.info()

### Quality issues
1. Dog type missing from majority of records (only 388 have values other than "None")

2. Missing and incorrect values present for name

3. Some records' rating_denominator values are inaccurate (e.g. values other than 10 for posts not of groups of animals)

4. A column should exist listing number of animals being rated in a tweet for the following reasons:
    
    a. Groups are assigned a cumulative rating resulting in non-standard numerators and denominators
    
    b. The number of animals in a photo may impact the accurary of the image prediction program
    

5. Dataset contains retweets

6. Dataset contains tweets that aren't ratings

7. Values in p1, p2, & p3 columns not consistently capitalized, potential for one value to be entered multiple ways 

8. Column timestamp should be datatime, not object

### Tidiness issues
1. The dataframes should be combined into a single table

2. The dog type columns in archive should be a single column with dog type as a categorical value

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [57]:
# Make copies of original pieces of data
archive_clean = archive.copy()
image_predict_clean = image_predict.copy()
tweet_data_clean = tweet_data.copy()

### Issue #1: Dataset contains retweets

#### Define

Remove from archive_clean all records for which retweeted_status_id is not null using the isnull method.

#### Code

In [58]:
# Selecting only rows where retweeted_status_is is null
archive_clean = archive_clean[archive_clean['retweeted_status_id'].isnull()]

#### Test

In [59]:
# Should return an empty series
archive_clean.shape

(2175, 17)

### Issue #2: Dataset contains tweets that aren't ratings

#### Define
Use a regex pattern to filter dataset for tweets containing a rating.

#### Code

In [70]:
# Creating regex pattern to identify ratings
rating_pattern = re.compile(r'(\d*/\d*)')

# Using pattern to filter for tweets with ratings
archive_clean[archive_clean.text.str.match(rating_pattern)]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
701,786051337297522688,7.72743e+17,7.30505e+17,2016-10-12 03:50:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",13/10 for breakdancing puppo @shibbnbot,,,,,13,10,,,,,puppo
967,750381685133418496,7.501805e+17,4717297000.0,2016-07-05 17:31:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",13/10 such a good doggo\n@spaghemily,,,,,13,10,,doggo,,,
1345,704491224099647488,7.044857e+17,28785490.0,2016-03-01 02:19:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",13/10 hero af\n@ABC,,,,,13,10,,,,,
1447,696488710901260288,,,2016-02-08 00:20:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",12/10 revolutionary af https://t.co/zKzq4nIY86,,,,https://twitter.com/dog_rates/status/696488710...,12,10,,,,,
1523,690607260360429569,6.903413e+17,467036700.0,2016-01-22 18:49:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",12/10 @LightningHoltt,,,,,12,10,,,,,
1566,687841446767013888,,,2016-01-15 03:39:15 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",13/10 I can't stop watching this (vid by @k8ly...,,,,https://vine.co/v/iOWwUPH1hrw,13,10,,,,,
1801,676957860086095872,,,2015-12-16 02:51:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",10/10 I'd follow this dog into battle no quest...,,,,https://twitter.com/dog_rates/status/676957860...,10,10,,,,,
1857,675517828909424640,,,2015-12-12 03:29:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",12/10 stay woke https://t.co/XDiQw4Akiw,,,,https://twitter.com/dog_rates/status/675517828...,12,10,,,,,
1914,674330906434379776,6.658147e+17,16374680.0,2015-12-08 20:53:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",13/10\n@ABC7,,,,,13,10,,,,,
2010,672248013293752320,,,2015-12-03 02:56:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",10/10 for dog. 7/10 for cat. 12/10 for human. ...,,,,https://twitter.com/dog_rates/status/672248013...,10,10,,,,,


#### Test

In [62]:
archive_clean.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization