# Wrangle Report

## Introduction

The purpose of this project is to put in practice what I learned in data wrangling.

The dataset in project is a twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10.

My goal in project is wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations through the following processes:
- Gathering Data
- Assessing Data
- Cleaning Data

## Gathering Data

Os dados desse projetos foram fornecidos de 3 formas diferentes:

- The WeRateDogs Twitter archive `twitter_archive_enhanced.csv`, that was provided from manually by Udacity.
- The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. The file `image_predictions.tsv` is hosted on Udacity's servers URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv.
- The additional information was collected directly from Twitter via API, to increase data for analysis.


The `twitter_archive_enhanced.csv` file had to be downloaded manually on the Udacity plataform, saved to the local project folder and imported into Jupyter Notebook using the python pandas library.

The `image_predictions.tsv` file downloaded programmatically directly to Jupyter Notebook using the python requests library, saving locally and then importing it with pandas.

Additional information about the tweets was acquired through the **Twitter API**, using python's **Tweepy** library. Here I had to register at https://developer.twitter.com/ to purchase access tokens.

The data collection via API was performed inside the Jupyter Notebook and stored in a file in **JSON** format. It took 1943.86 seconds to run and I had 2331 tweet information successfully and 25 failed.

## Assessing Data

After gathering the data and storing them in DataFrames into Jupyter Notebook, start the process of evaluating the data to identify possible problems with it.

I performed visual assessment, but here it is more difficult to arrive at good analyzes because the datasets were large.

Then I started the evaluation programmatically, using the functions of the pandas library to identify possible problems.

With the analysis I arrived at the problems below that need to be treated:

### Quality Issues
- df_twitter_archive:
    - Column `name` with some dogs have 'None' as a name, or 'a'.
    - Erroneous datatypes (`timestamp`)
    - Nulls values in columns `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp` and `expanded_urls`.
    - Values in inconsistent in columns `rating_denominator` and `rating_numerator`.
    - Nulls represented as `None` (`doggo`, `floofer`, `pupper` and `puppo`).
    - Values in column `retweeted_status_id` indicates retweeted.
    - Nulls values in columns `retweet_count` and `favorite_count`.
- df_image_prediction:
    - Duplicated in column `jpg_url` indicates retweeted.
    - There are images that have been classified as non-dogs.

### Tidiness Issues
- df_twitter_archive:
    - The last four columns all relate to the same variable (dogoo, floofer, pupper, puppo).
- df_image_prediction:
    - This data set is part of the same observational unit as the data in the df_twitter_archive.
- df_tweets_info_additional:
    - The id column with name different from the other tables.
    - This data set is part of the same observational unit as the data in the df_twitter_archive.

## Cleaning Data

After raising the problems that needed to be addressed, I entered the cleaning data phase to carry out the necessary corrections.

In this step I used the process below for each problem:
- Define: Describe what needs to be done to resolve the problem.
- Code: Write and run the code to correct the problem.
- Test: Test to validate that we were successful in the proposed solution.