## Wrangle Report

### Introduction

In this project we're studying data from the **WeRateDogs** tweeter account that rates people's dogs with funny comments about the dogs. We will use Wrangling to tackle this task. 
The Wrangling steps are:
- Gathering Data
- Assessing Data
- Cleaning Data

### Gathering

In this step we will gather all three datasets needed for the project. Each process of gathering the data will be different. 
- The **WeRateDogs** Twitter archive will come from a manual download by following link to <a href="https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv" target="_blank">twitter_archive_enhanced.csv</a> file.
- The tweet image predictions file (**image_predictions.tsv**) is present in each tweet according to a neural network. The file is hosted on the Udacity servers and should be downloaded programmatically using the request library and following link (URL of the file: <a href="https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv" target="_blank">https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv</a>)
- Gather each tweet's retweet count and favorite ("like") count at the minimum and any additional data you find interesting. Using the tweet IDs in the **WeRateDogs Twitter archive**, query the **Twitter API** for each tweet's **JSON** data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called **tweet_json.txt** file.
- Import the data into our programming environment (**Jupyter Notebook**).

### Assessing

After gathering all three pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document quality issues and tidiness issues.

You need to use two types of assessment:

**Visual assessment:** Display the data in the Jupyter Notebook for visual assessment purposes. 

**Programmatic assessment:** Use pandas' functions and/or methods to assess the data.

### Quality issues (Completeness, Validity, Accuracy, Consistency)
#### Tweeter Archives Dataset####
- The following columns (*in_reply_to_status_id, in_reply_to_user_id,
retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp*) are part of retweeting. The data metrics will eventually be removed for our data analysis. No retweet informaiton.

- Data type for *timestamp* should be a datetime and not a string object.

- The *doggo, floofer, pupper, puppo* state invalid values, i.e "None". Need to be converted to "NaN".

- The following columns (*doggo, floofer, pupper, puppo*) are actual values for the dog types. They need to be included as values under the new "Dog Stage" column, as well as removed.

- The *Name* columns state invalid values "None", replace with "NaN".

- The *rating_numerator* equal to zero for the following records:
 - tweet_id: 835152434251116546 
 - tweei_id: 746906459439529985 

- There *rating_denominator* equal to zero for the following record:
 - tweet_id: 835246439529840640; Note: This is a retweet so will be    removed during that process.
 
- Create a new rating column from the numerator and denominator for rating standards. 

- Change the datatype for the rating_numerator and rating_denominator from int to float, in order for the decimals to show.

- Remove all values in the Name column that have lower case letters, i.e. a, an, actually, the, etc.


#### Image Prediction Dataset ####
- The *p1, p2, p3* column names don't make senses and don't have descriptions. Need better column names with    capitalization.

- Duplicate *jpg_url* for some tweet_id's. Need to be removed.

- Removed the useless img_num column.

- Lower and upper case letters for some values in prediction columns.

- Data type for *tweet_id* should be a string object and not a int (number).


#### Tweet (JSON) Dataset####
- Rename the id column to match the other dataset column name of tweet_id.

- Change the datatype of the tweet_id column to a string object from an int.

- The majority of the columns aren't needed for the analysis. We'll remove the metrics.

### Cleaning

As part of the cleaning process we'll take care of all the issues documented while assessing. We will create and keep the necessary columns and others may be removed and dropped. The dataset will be cleaned and provide quality metrics.

Make sure you complete the following items in this step.

- Before you perform the cleaning, you will make a copy of the original data.
- During cleaning, use the define-code-test framework and clearly document it.
- Cleaning includes merging individual pieces of data according to the rules of tidy data. 
- The results should be a High-Quality and Master DataFrame.

### Conclusion

Data wrangling is a core skill that everyone who works with data should be familiar with since so much of the world's data isn't clean. In this project we tackled all the concepts of the wrangling process from gathering, assessing, and cleaning the data. These are important steps to handling dirty and messy data to achieve high quality data analysis.