# <font size = "8" color=lightblue>  <div align="center"> Data Wrangling of the WeRateDogs Tweet Archive: General insights and Visualization</div></font> 

# What is Data Wrangling?

Real-world data rarely comes clean. Hence we define data wrangling as the process of gathering, assessing and cleaning data for "Wow!"-worthy analyses and visualizations.


# When would you need to wrangle a data?

Fortunately data wrangling is the first step to take when you have a data analysis project in hand. When it comes to real-world data, a quick visual assessment can reveal issues with the data.

Issues of real-world data are:

* Quality issues:
 * missing enteries
 * duplicates
 * incorrect data
 * validity
* Structural issues:
 * Inaapropriate data entry.
 


# Let's look at an example

# <font color=blue>WeRateDogs Enhanced Twitter Archive</font>

[WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs) is a Twitter account of Twitter user [@dog_rates](https://twitter.com/dog_rates) that rates people's dogs with a humorous comment about the dog. 

WeRateDogs asks people to send photos of their dogs, then tweets selected photos rating and a humorous comment. Dogs are rated on a scale of one to ten, but are invariably given ratings in excess of the maximum, such as "13/10". Popular posts are re-posted on Instagram and Facebook.

Some of the issues with the tweet archive data includes

* Missing images of dogs in some tweets
* Duplicates of tweet id
* Inaccurate enteries of dog's name, breeds and stages.


# What is our data analysis problem?

To wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. I’m going to use Python to wrangle the data. We’ll use the Pandas and Matplotlib libraries to help build up our dataframe and create visualizations.

I’m using a Jupyter Notebook to code in and and you can get the finished product from my Github repo.


### Let's see some useful libraries

In [6]:
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import requests
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import re

# <font color=blue>Insights and Visualization </font>

One of the charts we created using these libraries is:

<img src="rf_correlation.png" width=700/>

### _What does the chart tell us?_

The chart is a scatter plot of a relationship between favorites and retweets in Pandas. It tells us that retweets and number of likes (favorites) have a strong and positive correlation. As it is expected, the higher the number of retweeted posts, the more people gets to see the post and like the post.  


### Has there being a change in the fovorites and retweets counts over time?

<img src="rf_timestamp.pdf" width = 900/>

Indeed yes! The number of favorites and retweets have increased over time as shown from the chart above. The number of favorites is higher than retweets.

Talk of a relationship between rating, favorites and retweets,

### Do higher rated dogs get more retweets hence favorites? 

<img src="rf_rating.pdf" width = 900/>

Yes yes! Observe that the maximum rating in the data set is 14. The chart above shows that the dogs with the highest number of retweets and favorites have the rating of 13 and favorites 12 respectively. However on average, dogs with in this rating category have lower number of likes and retweets as we will see later.


### Let us now draw attention towards the dogs, which breeds are popular? Which gets the highest rating, retweets and favorites? 

### What are the top 10 most popular dog breeds?

We selected the top 10 breeds with highest number of counts, below is a barchat showing the relationship between dog breeds and counts. 

<img src="bar_dogbreed.png" width = 900/>

Golden Retriever is the most popular dog breed in the dataset with over 80 counts. 

## Is popularity in dog breed due to high rating, number of retweets or number of likes (favorites)?

To answer this question, we calculated the mean rating for each dog breed and selected the 10 we have shown a bar chat of the mean rating vs dog breeds in the below. 

<img src="rate_dogbreed.png" />

Bedlington Terrier which is not in the list of top 10 most populardog breed, is seen here to have the higest rating. Interestingly, we see that non of the most popular dog breeds are in the list of highest average rating. This shows that popularity in dog breeds are not due to rating.

## Which dog breed has the highest number of retweets and likes?

<img src="breed_fr.png" />

As we have discussed earlier, popularity of dogs breeds implies increased retweet counts and favorite counts respectively. When we look at the chart above, English springer has the highest mean retweets and favorites. We see that Bedlington Terrier which has the highest rating and it is 4th dog breed with highest retweets and favorites.


# Dog Stage

## What can we say about the different dog stages: Pupper, Puppo, Doggo, Floofer?

<img src="bar_dogstage.png" />

The most popular dog stage is Pupper, however it has the lowest mean rating out of all dog stages. Floofer on the otherhnd which is the least popular dog stage, has the highest rating.

<img src="dog_stage1.png" />

We see Floofer which is the least popular dog stage in the data state, has the highest rating, and number of favourites. Doggo has the higest number of retweets.

## Do you want to get a new dog by investigating the WeRateDogs twitter account

Consider the Bedlington Terrier dog in the Floofer stage.You probably do  not wannt to base your judgment by looking at the retee

## Want to get complete information about the data wrangling process?

Find more information in wrangle_act.ipynb.


In [10]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'act_report.ipynb'])


0