# Wrangle Report

This document briefly describes the wrangling efforts for the WeRateDogs Twitter dataset in the `wrangle_act.ipynb` notebook.

## Gathering Data

The dataset was gather through the following methods:
1. File on hand - `twitter_archive_enhanced.csv`
2. File hosted on Udacity's servers - `image_predictions.tsv`
3. Query of Twitter API - `tweet_json.txt`

### `twitter_archive_enhanced.csv`

Using the pandas library, `.csv` files were read directly into a DataFrame.

### `image_predictions.tsv`

Using the requests library, files hosted on the internet were programmatically downloaded.  Once downloaded, the pandas library was used to read in the `.tsv` into a DataFrame.

### `tweet_json.txt`

Using the tweepy and json libraries, the tweets were dumped into a `.txt` file.  The text file is read line by line to append the `tweet_id`, `favorite_count`, and `retweet_count` into a DataFrame

## Assessing Data
Once the data has all been gathered into individual DataFrames, the data is assessed both visually and programmatically to look for quality and tidiness issues.

Programmatic Methods:
- .head()
- .describe()
- .info()
- .duplicated()
- .value_counts()
- .query()
- .sum()

The quality issues were categorized by completeness, validity, accuracy, and consistency.  The tidiness issues were categorized by tidy data principles.

### Quality Issues
#### Completeness
1. `df_ae`: Missing and incorrect dog names
2. `df_ae`: Benebop Cumberfloof not identified as floofer

#### Validity
4. `df_ae`: Retweets may capture the same dog twice with a different tweet_id
5. `df_ae`: Replies do not have images
6. `df_ip`: 324 predictions where the top 3 predictions are not dog breeds.  Sampling data reveals turtles, fish, sloth, etc.

#### Accuracy
7. `df_ae`: Rating numerator and denominator have many outliers

#### Consistency
9. `df_ae`: Timestamp column is a string
10. `df_ae`: Source displays url

### Tidiness Issues

#### Each variable forms a column
11. `df_ip`: Four columns for stages of dog (doggo, pupper, puppo, floofer) should be one category column

#### Each observation forms a row
- N/A

#### Each type of observational unit forms a table
12. `df_ip`: Observational unit is for image prediction, `jpg_url` should be part of `df_ae` table.
13. `df_tj`: Retweet and favorite should be appended to `df_ae` table.

## Cleaning

This section will discuss some of the more involved cleaning efforts, the shortcomings, and possible improvements.

#### Issue #1: Missing and incorrect dog names.

Most of the tweets introduce the dog's name in the beginning of each tweet with "This is ...".

It appears the previous gathering efforts took note of this pattern and was able to capture most of the dog's name by extracting the word after "This is ...".

However, if the tweet did not begin with "This is ..." the name was defaulted to "None".  This explains the 745 records where the dog's name is "None".

This method also explains why the second most dog name is "a".  For example, if the tweet began with "This is a good boy..." then the method assigned the letter "a" to the dog's name.

On further inspection, if the dog's name was lowercase, it was likely labeled incorrectly.

The cleaning effort tried to correct the dog's name by filtering by incorrectly labeled tweets, and finding their name in the body of the text.

In the interest of time and practicallity, the notebook only includes correction for dog names labeled as "a".

More more can be done to correctly extract the dog names from the tweets.

#### Issue #3:  Extracting nested dictionaries/lists from JSON creates messy data.

The JSON files from Twitter are complex and include nested dictionaries/lists.  While trying to convert these complex JSON files into a DataFrame, issues arose as some nested dictionaries have the same key.

While trying to flatten or normalize the JSON files, it resulted in many empty columns and Series of lists that proved difficult to work with.

To get around this issue, only the columns of interest were extracted.

Additional insights may be derived from appropriate handling of the Twitter API JSON files.

#### Issue #6: Top 3 predictions are not dog breeds.

For the majority of predictions where there are no dog predictions, the majority of the images did not have a dog in the picture.

However, there are some instances where a dog is in a busy photo and a dog breed is not predicted.

For example, a photo of a dog taken from behind and his face is in the reflection of a computer monitor.  The top three predictions were for items on the desk.

Retraining the model may provide more accurate breed predictions.