# Data Wrangling Project - report
- A summary of the work performed to clean and tidy data supplied as part of the Udacity Data Analysis nanodegree

## The Context
Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "[they're good dogs Brent](http://knowyourmeme.com/memes/theyre-good-dogs-brent)". WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs [downloaded their Twitter archive](https://support.twitter.com/articles/20170160) and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

**The goal:** wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

## The Data
### Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

![data_image](images/data_image.png)

### Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: **retweet count** and **favorite count** are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.

### Image Predictions File

One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).
![pictures_data_image](images/pictures_image.png)

## The deliverables
Tasks in the project are as follows:

- Data wrangling, which consists of:
    - Gathering data (downloadable file in the Resources tab in the left most panel of your classroom and linked in step 1 below).
    - Assessing data
    - Cleaning data
- Storing, analyzing, and visualizing your wrangled data
- Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations

### Expected outputs
- `wrangle_act.ipynb`: code for gathering, assessing, cleaning, analyzing, and visualizing data
- `wrangle_report.pdf` or `wrangle_report.html`: documentation for data wrangling steps: gather, assess, and clean
- `act_report.pdf` or `act_report.html`: documentation of analysis and insights into final data
- `twitter_archive_enhanced.csv`: file as given
- `image_predictions.tsv`: file downloaded programmatically
- `tweet_json.txt`: file constructed via API
- `twitter_archive_master.csv`: combined and cleaned data
- any additional files (e.g. files for additional pieces of gathered data or a database file for your stored clean data)

----
## Summary of outputs
- All wrangling, tidying and visualization was performed, according to requirements, in accompanying notebook `wrangle_act.ipynb`
- This is also available as `wrangle_act.html`
- This report is the required `wrangle_report.html`
- A separate `act_report.html` is also provided
- The original `twitter_archive_enhanced.csv` is retained
- Cleaned data is available as `twitter_archive_master.csv`
- Additionally, tidied data is supplied as `dog_types.csv`, `dog_names.csv`, and `dog_breed_predictions.csv` 

## Summary of Data Cleaning performed
- Need only original ratings that have images  
    1. Removed retweets
    2. Removed records without URLs
        - This was because only URLs can lead to an image
        - And in preference to only keeping records with a linked image prediction
    3. Removed some records that didn't actually have dog images 
        - When their score was zero 
        - However, kept some that didn't lead to dogs (e.g. a chicken, a person, a fan) as they were compatible with the site's humor
        - This was a proof of concept rather than an exhaustive fix
- Fixed incorrect data
    4. Removed duplicates from the `expanded_urls` column
    5. Replaced the word "None" in `Name` and `dog types` columns (`doggo`, `floofer`, `pupper`, `puppo`) with nulls (Python `None`)
    6. Removed invalid `Name` values (words such as 'a', 'actually', 'all', 'an', 'by', 'getting' etc.)  
    7. Added some missing `Name` values
        - This was a token effort
        - A robust approach, but beyond the scope of this exercise, would be to use named entity recognition with tools like [NLTK and spaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
    8. Corrected dog scores
    9. Corrected invalid dog types where these were not mentioned or incorrectly calculated
    10. Supplied missing dog types
- Added missing columns
    11. Fetched missing data - `retweet_count` and `favorite_count` - using the Tweepy API
        - see python program `get_tweets_by_id.py`
    12. Added `retweet_count` and `favorite_count` to the clean data

## Summary of Data Tidying performed
1. Extracted dog types columns `doggo`, `floofer`, `pupper` and `puppo` to a separate `df_dog_type` dataframe and exported to `dog_types.csv`
    - Consists of `ID`,`tweet_id` and `dog_type` columns
    - However, kept the original dog type columns, but reset the data type of the columns to boolean    
<br>    
2. Extracted `name` column to a separate `df_name` dataframe and exported to `dog_names.csv`
    - Consists of `ID`,`tweet_id` and `name` columns
    - However, kept the original `name` column, but after correcting it may now contain more than one name    
<br>   
3. Created a long-form version of the images table and exported to `dog_breed_predictions.csv`
    - Consists of `tweet_id`,`score`,`isdog` and `breed` columns
    - Did not alter the original images table, this can link to it as a parent table on tweet_id to get image file name