## Data Wrangling Report - WeRateDogs Data Analysis

### By Moyinoluwa Sobowale



## Goal
**To wrangle, analyze and visualize the tweet archive of Twitter user "@dog_rates", also known as WeRateDogs.**  

WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.


## Data Wrangling

The following wrangling processes were done by me to prepare the data for analysis.
- Data Gathering
- Data Assessing
- Data Cleaning
- Storing The Data

## Data Gathering

I gathered all the three pieces of data required for this project and loaded into a jupyter notebook.

1. I directly downloaded the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv) and read the file into a pandas dataframe called "twitter_archive".
2. I used the Requests library to download the tweet image prediction file (image_predictions.tsv) and loaded it into a pandas dataframe called "image_prediction".
3. I used the Tweepy library to query additional data via the Twitter API (tweet-json.txt) and then, read the tweet-Json.txt file line by line into a pandas dataframe with tweet_id, retweet_count, favorite_count and date_created


## Data Assessing 

I assessed the 3 datasets gathered both visually and programmatically.

**Visual Assessment**

I printed the three different pandas dataframes in different cells in my jupiter notebook and visually assessed the outputs.

**Programmatic Assessment**

I used the following programmatic assessment methods in pandas to assess each data gathered:

- .shape: To get the number of rows and columns in each dataframe
- .shape[0]: To get the number of rows (entries) in each dataframe
- .head (DataFrame): To view the details of the first 5 rows of each dataframe
- .tail (DataFrame): To view the details of the last 5 rows of each dataframe
- .sample (DataFrame): To view the details of random rows of each dataframe
- .info (DataFrame): To view a concise summary of each dataframe including the number of non-null values in each column
- .describe (DataFrame): To get useful descriptive statistics for each column of each dataframe
- .value_counts (column): To view the distribution of data in a column in each dataframe
- .duplicated(): To check for duplicate entries/rows in a column in each dataframe
- .isnull() or .isna(): To check for missing values in a column in each dataframe
- .nunique() : To get the number of unique values in each column in each dataframe


**Below are the data quality and tidiness issues which i found and solved in the 3 datasets gathered.**

### Quality issues

### twitter_archive dataframe
1. **Missing values:** "retweeted_status_id", "retweeted_status_user_id", "retweeted_status_timestamp", "in_reply_to_status_id" and "in_reply_to_user_id" columns contained 2175, 2175, 2175, 2278, 2278 missing values respectively.


2. **Duplicate values:** There were 137 duplicate values/rows in the "expanded_url" column. 


3. **Missing values:** "expanded_urls" column had 59 missing values.


4. **Wrong datatype/format:** "timestamp" and "retweeted_status_timestamp" columns were in object/string datatype instead of datetime format.


5. **Wrong datatype/format:** "tweet_id" column was numerical (integer) format instead of object/string format.


6. **Wrong datatype/format:** "in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id" were in numerical (float) format instead of object/string format.


7. **Invalid data:** Some rows in the "name" column started with lowercase letters and as such were considered invalid name entries unlike the other dog names which started with uppercase letters. Here are the invalid name entries found in the "name" column: ['such', 'a', 'quite', 'not', 'one', 'incredibly', 'mad', 'an', 'very', 'just', 'my', 'his', 'actually', 'getting', 'this', 'unacceptable', 'all', 'old', 'infuriating', 'the', 'by', 'officially', 'life', 'light', 'space']




### image_prediction dataframe
8. **Duplicate values:** There were 66 duplicate values/rows in the "jpg_url" column.


9. **Wrong datatype/format:** "tweet_id" column was in numerical (integer) format instead of object/string format.


10. **Inconsistent Format:** The values in columns "p1", "p2", and "p3" were all not in a consistent case format, some values started with uppercase letters while others were all in lowercase letters.




### twitter_plus dataframe
11. **Wrong datatype/format:** "tweet_id" column was in numerical (integer) format instead of object/string format.


12. **Wrong datatype/format:** "date_created" column was in object/string datatype instead of datetime format.




### General Observation
I noticed that the number of tweet_id values in all 3 datasets were different:
- twitter_archive = 2356 tweet_ids
- img_prediction = 2075 tweet_ids
- twitter_plus = 2354 tweet_ids
This observation was later solved by merging the 3 datasets into a single master dataset.




### Tidiness issues

### twitter_archive dataframe

1. The following different columns: "doggo", "floofer", "pupper" and "puppo" were all different dog stages. They all needed to be put in a single column called "dog_stage".



2. Since we only wanted to analyse original ratings (no retweets) for our analysis, i had to discard/drop:
- the tweet_ids that had data stored in "retweeted_status_id", "retweeted_status_user _id", and "retweeted_status _timestamp" columns. This was due to the fact that these tweet_ids were related to retweets and were not required for our analysis.
- the tweet_ids that had data stored in "in_reply_to_status_id" and "in_reply_to_user_id" columns because these tweet_ids were also not original ratings as they were related to tweet replies and were also not required for our analysis.



3. The dog numerator and denomintor rating values were in different columns. They were later combined into one single column




### twitter_plus dataframe

4. We already had the "date_created" column in the twitter_archive dataframe.



### General Observation
5. The three datasets for this project had to be combined into a single master dataset. This was done based on the tweet_id column in the three datasets



## Data Cleaning
- I made copies of each of the original datasets before cleaning.

- All of the issues documented above while assessing the original datasets gathered were resolved in the copies using the 3 important data cleaning steps: Define, Code, Test.

- The 3 cleaned copies of the dataset gathered merged into one master dataset called master_data using tweet_id the column.


## Storing The Data

- The cleaned master dataset was later saved to a CSV file named "twitter_archive_master.csv". The saved CSV file (twitter_archive_master.csv data) was loaded into a pandas dataframe called dog_data and i wrangled the data further to obtain useful insights.



## Conclusion
I totally enjoyed working on the project, i am also glad that i was able to improve my python skills while exploring new concepts.