# Wrangling Report : WeRateDogs Twitter Data

## Introduction – project overview

**Data wrangling** is the process of gathering your data, assessing its quality and structure, and cleaning it before you do things like analysis, visualization, or build predictive models using machine learning.

Using Python and its libraries, I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. The full wrangling action has been documented in `wrangle_act.ipynb notebook`. 


## Gather

I worked on the following three datasets in this project

_**Enhanced Twitter Archive**_

This CSV was downloaded directly from Udacity’s severs. The downloaded file was upload and read the data into a pandas DataFrame (`archive_df`) in `wrangle_act.ipynb notebook`.


_**Additional Data via the Twitter API**_

I utilized Python's _Tweepy library_ to query Twitter’s API for each tweet's JSON data using the tweet IDs within from WeRateDogs Twitter archive (`archive_df`). I stored each tweet's entire set of JSON data in a file called tweet_json.txt file. 

Each tweet's JSON data was written to its own line. I read the `.txt` file line by line into a pandas DataFrame with `tweet ID, retweet count, and favorite ("like") count`. 


_**Image Predictions File**_

The file `image_predictions.tsv` is present in each tweet according to a neural network. The file is hosted on Udacity's servers. I downloaded the file programmatically using the Requests library with the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv


## Assess

After gathering all three pieces of data, I assessed them _visually_ and _programmatically_ to document the **quality** and **tidiness** issues. I detected and documented nine (9) quality issues and two (2) tidiness issues in the **"Accessing Data"** section in `wrangle_act.ipynb notebook`.

_**Visual assessment**_: 

I assessed the data in an external application, Excel. I also displayed the gathered datasets in the wrangle_act.ipynb for visual assessment purposes. 

_**Programmatic assessment**_: 

I used pandas' functions including info(), describe(), head() among other pandas and python methods to assess the data.


**NOTE: ** 

The Twitter Archive data (`archive_df`) has dog ratings. The rating numerators are greater than the denominators but this does not need to be cleaned, as it is a unique rating system of WeRateDogs. Some of the columns are retweets. And we want only want original ratings (no retweets) that have images


## Clean

I cleaned all of the issues documented after assessing in the "Cleaning Data" section in the `wrangle_act.ipynb notebook`.

Before cleaning, I made a copy of the original datasets. I used the _**define-code-test**_ framework to clearly document and clean the datasets.

## Storing the Data

In the **"Storing Data"** section in the `wrangle_act.ipynb notebook`, I stored the _gathered, assesses_ and _cleaned_ master datasets including the Enhanced Twitter archive (`archive_df`), image predictions (`image_df`) data and the Twitter API data (`tweet_df`) into a master CSV file named `twitter_archive_master.csv`.