## Wrangling Twitter Data for WeRateDogs - Wrangle Report
###### By Kaspar Lee

### Introduction

In this report, I will describe my efforts to wrangle the Twitter data of the WeRateDogs account.

My work consisted of three stages:

- Gathering Data
- Assessing Data
- Cleaning Data

### Gathering Data

The data for this projected was gathered from the following sources:

- **WeRateDogs Twitter Archive**: Downloaded from Udacity manually.
- **Retweet and Favourite Counts**: For each tweet, with some additional data, was gathered using the Twitter API, storing the JSON data for each tweet in JSON data file `tweet_json.txt`
- **Image Predictions**: This file was hosted by Udacity, and downloaded programatically using the `requests` library to save from the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

### Assessing Data

Once all of the above data was gathered, I assessed them both visually and programmatically, for issues with both the quality and the tidiness of the data. I came across the following issues.

#### Quality Issues:

- 181 entries are retweets, and multiple unnecessary columns exist just for retweets
- IDs are stored as integers or floats, rather than being stored as strings. They should not be stored as number types as they will should never be used in any sort of mathematical calculation.
- Some tweets have no images (number of non-null `expanded_urls` is less than total number of entries)
- Timestamps are stored as strings instead of more appropriate *datatime64* objects
- Value in `source` column stored as HTML code rather than actual link
- Rating numerator max value is 1776 (incorrectly extracted)
- Rating denominator min value is 0 and max value is 170 (incorrectly extracted)
- Name is incorrectly "a" for 55 entries, and "None" for 745 entries, should be a null value
- Dog stage columns have "None" for empty values, a non-null value, for what should be a null value, such as `NaN`
- `contributors`, `coordinates` and `geo` columns empty

#### Tidiness Issues:

- `df1` dog stage variable (i.e. doggo, floofer, pupper and puppo) should all be together in a single column, as they are all values for one variable, the type of dog.
- Duplicate columns across dataframes

### Cleaning Data

To clean the data, I had to resolve every single one of the issues found when assessing. I did not find any one-off issues that required manually cleaning.


I utilised Pandas to programmatically clean and merge the datasets to produce one master dataset that is both clean and tidy.

Each issue consisted of 3 cleaning stages; defining what I needed to do to clean the data, writing the code in order to clean the data, and testing that the data was clean after executing the code. Once the test displayed that the issue had been resolved, I moved onto the next issue, until all were resolved. 

This data was then used to create the analysis report.

### Conclusion

The entire data wrangling phase is very important into order to convert your raw, dirty, messy data into usable data that can be used to product useful visualisations. If the data was not cleaned, it would be very difficult to produce visualisations, and much of what would be produced would be somewhat inaccurate or misleading. Data wrangling allows us to eliminate this and produce useful visuals.

Through these stages, I successfully wrangled the data in order to create a master dataset, that I then analysed to produce useful insights.