# Wrangling WeRateDogs Tweets

George J.J. Wu

The current investigation aims to analyze Twitter user WeRateDogs([@dog_rates](https://twitter.com/dog_rates))'s original ratings of dogs. This paper will document the wrangling efforts that were made toward the goal of the current investigation. The wrangling process took place in three stages: the gathering stage, the assessment stage, and the cleaning stage.

## Gathering

The current investigation gathered data from three sources, as follows:

- File on hand: *twitter_archive_enhanced.csv*, which contains basic information for 2,356 tweets from the WeRateDogs Twitter archive. This file was loaded into a table named `tweets_archive`.
- File downloaded programmatically from URL: *image_predictions.tsv*, which contains breed information for 2,075 animal images according to a neural network. This file was loaded into a table named `breeds_image`.
- File obtained by querying the Twitter API: *tweet_json.txt*, which contains retweet count and favorite count information for 2,345 tweets. This file was loaded into a table named `tweets_info`.

## Assessment

The current investigation proceeded to assessing the gathered data, with exclusive interest in tweets that conform to the following schema:

- tweets that contained original ratings, and were not retweets or replies.
- tweets that contained an image.
- tweets that were about dogs, and were not about hens, turtles, goats, or bears.

A combination of visual and programmatical assessments revealed the following issues:

### Quality/Content Issues
#### `tweets_archive` table
- Missing information for *in_reply_to_status_id*, *in_reply_to_user_id*, *retweeted_status_id*, *retweeted_status_user_id*, and *retweeted_status_timestamp* columns.
- Missing information for *doggo*, *floofer*, *pupper*, and *puppo* columns are denoted as "None", instead of "NaN".
- Erroneous datatypes for *timestamp*, *retweeted_status_id*, *retweeted_status_user_id*, and *retweeted_status_timestamp* columns.
- A number of tweets were not original ratings, as they were either replies or retweets.
- Some tweets had decimals in their ratings (eg. 9.75/10), and these ratings were not extracted properly.
- Some tweets had two occurrences of "#/#" numbers in their text, but the wrong set of "#/#" numbers were extracted as the dog ratings.
- Two tweets had ratings of 0 (they were about plagiarism).

#### `breeds_predict` table
- Only have images for 2075 tweets (missing images for some tweets from the `tweets_archive` table).
- Some tweets were not actually about dogs, but were about hens, turtles, goats, piglets, and etc.

#### `tweets_info` table
- Only have retweet count and favorite count information for 2345 tweets (missing information for 11 tweets from the `tweets_archive` table).

### Tidiness/Structural Issues
- The `tweets_info` table belongs with the `tweets_archive` table.
- Columns *doggo*, *floofer*, *pupper*, *puppo* in the `tweets_archive` table belong to a new variable.


## Cleaning

Copies of data were made specifically for the cleaning process. The current investigation attempted to address missing information issues first, followed by tidiness issues, and finally quality issues. Some issues were bundled together so they could be addressed efficiently using similar code blocks. For each issue or issue bundles, the cleaning operation was defined and put into code, then the result of each cleaning operation was tested. 

## Reflection

The current investigation only identified and addressed a small portion of quality and tidiness issues in the gathered dataset. Real-world data such as Twitter information is indeed messy and complex, whereas the data wrangling process is time-consuming and sometimes technically challenging. The current investigation has gained a much deeper appreciation into the efforts that often go into data wrangling before quality data analysis can take place.