# WeRateDogs Twitter - Data Wrangling

*Greg Clunies*<br>
*06/16/2018*

## Project Key Points

**Key points to keep in mind when data wrangling for this project:**

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least **8 quality issues** and at least **2 tidiness issues** in this dataset.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

## Gather

### Gathering Data for this Project
Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb:

1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

In [1]:
import numpy as np
import pandas as pd
import requests
import os
import json
import tweepy

In [2]:
df_dogs = pd.read_csv('twitter-archive-enhanced.csv')
df_dogs.head()

FileNotFoundError: File b'twitter-archive-enhanced.csv' does not exist

## Assess

## Clean

#### Define

#### Code

#### Test

## Project Submission
In this project, you'll gather, assess, and clean data then act on it through analysis, visualization and/or modeling.

**Before you submit:**

1. Ensure you meet specifications for all items in the [Project Rubric](https://review.udacity.com/#!/rubrics/1136/view). Your project "meets specifications" only if it meets specifications for all of the criteria.
2. Ensure you have not included your API keys, secrets, and tokens in your project files.
3. If you completed your project in the Project Workspace, ensure the following files are present in your workspace, then click "Submit Project" in the bottom righthand corner of the Project Workspace page:
    - wrangle_act.ipynb: code for gathering, assessing, cleaning, analyzing, and visualizing data
    - wrangle_report.pdf or wrangle_report.html: documentation for data wrangling steps: gather, assess, and clean
    - act_report.pdf or act_report.html: documentation of analysis and insights into final data
    - twitter_archive_enhanced.csv: file as given
    - image_predictions.tsv: file downloaded programmatically
    - tweet_json.txt: file constructed via API
    - twitter_archive_master.csv: combined and cleaned data
    - any additional files (e.g. files for additional pieces of gathered data or a database file for your stored clean data)

4. If you completed your project outside of the Udacity Classroom, package the above listed files into a zip archive or push them from a GitHub repo, then click the "Submit Project" button on this page.
As stated in point 4 above, you can submit your files as a zip archive or you can link to a GitHub repository containing your project files. If you go with GitHub, note that your submission will be a snapshot of the linked repository at time of submission. It is recommended that you keep each project in a separate repository to avoid any potential confusion: if a reviewer gets multiple folders representing multiple projects, there might be confusion regarding what project is to be evaluated.

It can take us up to a week to grade the project, but in most cases it is much faster. You will get an email once your submission has been reviewed. If you are having any problems submitting your project or wish to check on the status of your submission, please email us at dataanalyst-project@udacity.com. In the meantime, you should feel free to proceed with your learning journey by continuing on to the next module in the program.