## Reporting: wragle_report


## Introduction

The dataset I wrangled (and analyzed and visualized) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog.


## Processes

I imported the libraries needed for the wrangling process such as *pandas, numpy, requests, json and matplotlib* and then I proceeded to wrangling the data with this workflow:

- Data gathering
- Data assessment
- Data cleaning
- Data storage
- Data analysis and visualization

### Data Gathering

The data was gathered from multiple sources:
- A .csv file (using pd.read_csv)
- from a url (using the request library)
- from a tsv file (which was read line by line as a json file (this was done because I couldn't access the twitter API))

### Data assessment

The data was assessed both visually and programmatically and the following issues were identified.

#### Quality issues
Archive table
1. in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp are mostly nulls and I don't really need them because they are replies and retweets.


2. Several dog names in the name column are not actually dog names


3. name, doggo, floofer, pupper and puppo columns have values that are null but python does not read as null because there is a filler word 'None'


4. Id fields should be strings


5. timestamp fields should be datetime


6. The dog breeds can be standardized


7. Ratings with decimals incorrectly extracted


8. Incorrect ratings


### Tidiness issues
1. The dog stages are in separate columns

2. Retweet count, favorite count and dog breed are not part of the archive table


### Data Cleaning

All the issues documented above were cleaned programmatically:

1. I remove in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp from archive table using the .drop method but first I drop rows that have replys and retweets


2. I change non dog names to null


3. I replace None in name, doggo, floofer, pupper, and puppo with NaN


4. I change tweet_id to str using .astype method


5. I change timestamp to datetime using .astype method


6. I standardize dog breed column using .str.lower() to turn dog_breed to lowercase letters


7. I correctly extract ratings with decimals from the text using .str.extract and regex pattern and assign to rating numerator and denominator


8. I update incorrect ratings

#### Tidiness issues
1. I put the dog stages in one column using .join method

2. I merge archive, tweets and images tables


### Data Storage

I stored the data using .to_csv

### Data analysis and visualization

I analyze and visualize my wrangled data and produce insights(using .describe, .corr, and groupby method) and visualizations(using matplotlib).
