# Project: DATA WRANGLING - WeRateDogs

## Data Gathering

The first step of the wrangling process is data gathering.

In this step I will be looking to gather all three pieces of data that will be needed for this project
- The first one being to manually read in the `twitter-archive-enhanced.csv` data downloaded on my workstation
- Programmatically downloading the `image-predictions.tsv` data from the [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)
- And the third being to query Twitter API using the tweet ID in the `twitter-archive-enhanced.csv` data to gather each tweet's JSON data using Python's tweepy library and store each tweets entire set of JSON data in a file called `tweet_json.txt` file.

**First we will go ahead to import all the packages we will be needing for this project**

In [1]:
#import packages
import pandas as pd
import requests
import numpy as np
import tweepy
import os
import json
import tweepy

As pointed out earlier we manually read in the first dataset already downloaded into a dataframe

In [2]:
df = pd.read_csv('twitter-archive-enhanced.csv')

Using the Requests library we will programmatically download the `image-predictions.tsv` data online, and save it's content into a tsv file, that will be read into a dataframe later on

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [4]:
#Save data into file using response.content
with open(os.path.join(url.split('/')[-1]), mode= 'wb') as file:
    file.write(response.content)

Finally using the Tweepy library, I would query additional data via the Twitter API to gather each tweets JSON data and store the contents in the file `tweet_json.txt`

In [5]:
#Set consumer key, secret, and access_token and secret
#They will be hidden to comply with Twitter API rules
consumer_key = 'XXXXXXXXXXXXXXXXXXXXX'
consumer_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
access_token = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
access_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

#Set Authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit= True)

In [6]:
#Query Twitter's API for JSON data for each tweet id in the dataframe
'''
id_of_tweet = df.tweet_id
count = 0
failed = {}
#Save output in a newline in a txt file
with open('tweet_json.txt', mode= 'w') as outputfile:
    for idx in id_of_tweet:
        count += 1
        try:
            tweet = api.get_status(idx, tweet_mode= 'extended')
            json.dump(tweet._json, outputfile)
            outputfile.write('\n')
        
        except tweepy.errors.TweepyException as e:
            print('No Data found')
            failed[idx] = e
            pass
print(failed)
'''

No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found


Rate limit reached. Sleeping for: 95


No Data found


Rate limit reached. Sleeping for: 176


{888202515573088257: NotFound('404 Not Found\n144 - No status found with that ID.',), 873697596434513921: NotFound('404 Not Found\n144 - No status found with that ID.',), 872668790621863937: NotFound('404 Not Found\n144 - No status found with that ID.',), 872261713294495745: NotFound('404 Not Found\n144 - No status found with that ID.',), 869988702071779329: NotFound('404 Not Found\n144 - No status found with that ID.',), 866816280283807744: NotFound('404 Not Found\n144 - No status found with that ID.',), 861769973181624320: NotFound('404 Not Found\n144 - No status found with that ID.',), 856602993587888130: NotFound('404 Not Found\n144 - No status found with that ID.',), 856330835276025856: NotFound('404 Not Found\n144 - No status found with that ID.',), 851953902622658560: NotFound('404 Not Found\n144 - No status found with that ID.',), 851861385021730816: NotFound('404 Not Found\n144 - No status found with that ID.',), 845459076796616705: NotFound('404 Not Found\n144 - No status fou

Now read the `tweet_json.txt` file by line into a pandas dataframe with variables of interest

In [11]:
json_list = []
with open('tweet_json.txt', mode= 'r') as json_file:
    for text in json_file:
        texts = json.loads(text)
        tweet_id = texts['id']
        retweet_count = texts['retweet_count']
        favorite_count = texts['favorite_count']
        tweet_date = texts['created_at']
        tweet_source = texts['source']
        json_list.append({'tweet_id' : tweet_id,
                       'retweet_count' : retweet_count,
                       'favorite_count' : favorite_count,
                       'tweet_date' : tweet_date,
                       'tweet_source' : tweet_source})

df_json = pd.DataFrame(json_list, columns = ['tweet_id', 'retweet_count','favorite_count','tweet_date','tweet_source'])

In [12]:
df_json

Unnamed: 0,tweet_id,retweet_count,favorite_count,tweet_date,tweet_source
0,892420643555336193,6981,33737,Tue Aug 01 16:23:56 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
1,892177421306343426,5284,29265,Tue Aug 01 00:17:27 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
2,891815181378084864,3468,22000,Mon Jul 31 00:18:03 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
3,891689557279858688,7203,36844,Sun Jul 30 15:58:51 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
4,891327558926688256,7727,35231,Sat Jul 29 16:00:24 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
...,...,...,...,...,...
2322,666049248165822465,36,88,Mon Nov 16 00:24:50 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2323,666044226329800704,115,246,Mon Nov 16 00:04:52 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2324,666033412701032449,36,100,Sun Nov 15 23:21:54 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2325,666029285002620928,39,112,Sun Nov 15 23:05:30 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."


In [14]:
df_json.tweet_source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.


### Quality issues
1.

2.

3.

4.

5.

6.

7.

8.

### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization