# Data Wrangling Report - WeRateDogs

In this project we have accomplished the core tasks of data wrangling process:
* Gathering
* Assessment
* Cleaning

## Gathering

Data has been obtain from multiple data sources:
* Tweets archive from a csv file provided by the Udacity Course.
* Dog Photos labeled by a machine learning algorithm available through csv uploaded in an accesible URL.
* All information related to each tweet; retweets, likes, comments, etc. from the Twitter API.

In order to retrieve all the data multiple functionalities has been used: pandas, requests and tweepy.

## Assessment


Once all data was collected, visual and programmatic assessment has been performed within each dataset in order to identify **quality** and **tidiness** issues.

### Quality Issues

**Tweets archive**
* Some tweets couldn't be retrieved from Tweepy as they couldn't be found so we will have to drop those tweets.
* Tweets beyond August 1st, 2017 won't have image predictions and thus we will drop them as want to perform a full exploratory using both datasets. 
* Timestamp in 'tweets_archive' is of type string and should be DateTime to perform futures actions such is getting timestamps elapsed time.
* Doggo, Flooofer, Pupper or Puppo are strings and should be either Categorical type or Boolean as values are from set of possible values and not a free valued string.
* Doggo, Flooofer, Pupper or Puppo columns contains values different from its dog stage; for instance under Doggo column why could find Floofer value.
* Some tweets has non-null values on retweeted_status_id, that means it is a retweet so we shouldn't take them into account to avoid multiple ratings on the same dog.
* Some rating_denominator are different from 10 that suggest that values are not perfectly retrieved from tweet text and we should check values from tweet got from Tweepy in order to fix this.

**Image predictions**
* P1, P2 and P3 labels predictions should be Categorical type instead of string type.
* Some dog image predictions doesn't contain a breed type.
* Some dog image predictions are from retweets and thus multiple predictions are found from the same image_url.

**All tables**
* Due to the fact that we got missing tweets from Tweepy because of retrieval errors, discarded image predictions because of non-predicted as dog images and that we tweets dropped because denominator values are incongruent, we have to assure that all tweets ID are available in final data frames.
### Tidiness Issues
* Dog stages are separated by columns and using one single column named "dog_stage" as Categorical type would be more suitable.
* Also, we should use only two tables; one related to the tweet it self (id, created_at, likes and retweets) and other for the data related to each dog (image prediction label, dog_stage, dog_name and rating).

## Cleaning
After identifying data issues a list of fixing procedures were written before proceeding to the cleaning process:

#### Quality
* Drop tweets from 'tweet_archive' that are not successfully retrieved from Twitter API (tweepy_tweets).
* Change data type of 'timestamp' column of 'tweet_archive' to date type.
* Drop tweets beyond August 1st, 2017.
* Set 'doggo', 'floofer', 'pupper' and 'puppo' columns from 'tweet_archive' as Categorical Type.
* Change dog stage value to None in those cases where dog stage is set in the wrong column.
* Drop retweets from 'tweet_archive'.
* Set 'p1', 'p2' and 'p3' columns from 'image_predictions' to Categorical Type.
* Drop non-breed predictions from 'image_predictions'. Those which 'p1_dog', 'p2_dog' **and** 'p3_dog' with **False** values.
* Drop image predictions that comes from a retweet from 'image_predictions' or from tweets that couldn't be retrieved from tweepy.
* Reevaluate 'rating_numerator' and 'rating_denominator' parsing 'full_text' column of 'tweepy_tweets' and drop those with denominator above 10.
* Drop tweets that are not included in all datasets.

#### Tidiness
* Melt dog stage ('doggo', 'floofer', 'pupper', 'puppo' columns) into a single column named 'dog_stage'.
* Elaborate a single data frame containing 'tweet_id', 'text', 'created_at', 'favorite_count' and 'retweet_count' columns named tweets_master.
* Elaborate a single data frame containing 'tweet_id', 'dog_name', 'dog_stage', 'rating_numerator', 'rating_denominator', 'p1', 'p1_conf', 'p2', 'p2_conf', 'p3', 'p3_conf' and 'jpg_url' columns named dog_metrics_master.