# Project: Wrangling and Analyze Data


## Introduction

1. In this notebook I will be wrangling(and analyzing and visualizing) the tweet archive of <a href='https://twitter.com/dog_rates'>@dog_rates</a> also known as <a href='https://en.wikipedia.org/wiki/WeRateDogs'>WeRateDogs</a>. 

2. I will be following the following workflow:

    Step 1: Gathering data - Using the python <a href='https://requests.readthedocs.io/en/latest/'>requests</a> library and the <a href='https://developer.twitter.com/en/docs/twitter-api'>Twitter API</a> and downloading a csv file from this <a href='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'>link</a>

    Step 2: Assessing data - Using both visual and programmatic methods.

    Step 3: Cleaning data - Programmatically using the <a href='https://pandas.pydata.org/'>Pandas</a> Library.

    Step 4: Storing data - In a csv file.

    Step 5: Analyzing, and visualizing data - Programmatically using <a href='https://matplotlib.org./'>Matplotlib</a>.

    Step 6: Reporting: data wrangling efforts - <a href='wrangle_report.html'>wrangle_act</a>,
      data analyses and visualizations - <a href='act_report.html'>act_report</a>.


In [11]:
# Import libraries
import os
import re
import json
import tweepy
import config
import requests
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from IPython.display import Image, Video
%matplotlib inline

sn.set()

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [6]:
# Load twitter-archive-enhaced.csv into a pandas dataframe
archive = pd.read_csv('twitter-archive-enhanced.csv')
archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [7]:
# Get image-predictions.tsv and write to file image_predictions.csv to local folder
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
file = 'image_predictions.csv'
try:
    if file not in os.listdir():
        with open('image_predictions.tsv', mode='wb') as f:
            response = requests.get(url)
            f.write(response.content)
except Exception as e:
    print(e)
predictions = pd.read_csv('image_predictions.tsv', delimiter='\t')
predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [4]:
# Instantiate client object and query tweet information
client = tweepy.Client(bearer_token=config.bearer_token)

In [None]:
# Get tweets
length = len(archive)
file = 'tweet_json.txt'
count = 0
try:
    for i in range(0, length, 100):
        tweets = client.get_tweets(ids=[str(i) for i in archive.tweet_id.iloc[i:i+100]],
                                       tweet_fields=['created_at','public_metrics'])
        for tweet in tweets.data:
            tweet_json = {
                'tweet_id': tweet.id,
                'timestamp': str(tweet.created_at),
                'retweet_count': tweet.public_metrics['retweet_count'],
                'reply_count': tweet.public_metrics['reply_count'],
                'like_count': tweet.public_metrics['like_count']
            }
            if file not in os.listdir():
                j_son = json.dumps(tweet_json, indent=4)
                with open(file, 'w') as f:
                    f.write(j_son)
                    f.write('\n')
except Exception as e:
    print(e)

In [9]:
# See tweet_json dataframe
tweet_json = pd.read_json('tweet_json.txt', encoding='')
tweet_json.head()

ValueError: Unexpected character found when decoding 'true'

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [None]:
# See archive observations
archive.shape

In [None]:
# See archive summary info
archive.info()

In [None]:
# See predictions summary info
predictions.info()

In [None]:
# See tweet_json summary info
tweet_json.info()

In [None]:
# Check archive random sample
archive.sample(20)

In [None]:
# Check predictions random sample
predictions.sample(20)

In [None]:
# Check for duplicates in archive
archive.duplicated().sum()

In [None]:
# Check for duplicates in predictions
predictions.duplicated().sum()

In [None]:
# Check for duplicates in tweet_json
tweet_json.duplicated().sum()

In [None]:
# See Summary stats for meaningful numeric columns
archive.drop(columns=[i for i in list(archive) if '_id' in i]).describe()

In [None]:
# See summary stats for tweet_json
tweet_json.drop('tweet_id', axis=1).describe()

### Quality issues

1. `timestamp` column datatype is `object` in `archive`.

2. Both `in_reply` columns have only 78 non-null values in `archive`.

3. All `retweeted` colums have only 181 non-null values in `archive`.

4. Dog name of row `275` is  `10` in `archive` and others are random letters or words(Visual asessment).

5. Some dog types are missing even though are contained in the `text` and others are wrong.

6. The source column values contain `HTML` tags and other irrelevant information.

7. `expanded_urls` is not human friendly.

8. `tweet id` column values are of type `int` in all dataframes.

9. `image_predictions` dataframe has 2075 observations.

10. `tweet_json` dataframe has 2327 observations. 

11. `rating_numerator` has a maximum value of 1776.

### Tidiness issues
1. `doggo`, `floofer`, `pupper`, `puppo` should be observations and not columns(Visual Assesment).

2. `tweet_json` should be part of `archive`(Visual Assesment).

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original dataframes
archive_clean = archive.copy()
predictions_clean = predictions.copy()
tweet_json_clean = tweet_json.copy()

### Issue #1:

#### Define:

- Change the `timestamp` column datatype in `archive` to `datetime`.

#### Code

In [None]:
# Changing timestamp col datatype
archive_clean['timestamp'] = pd.to_datetime(archive_clean.timestamp)

#### Test

In [None]:
# See changes 
archive_clean.info()

### Issue #2:

#### Define

- Drop columns with a lot of missing values.

#### Code

In [None]:
# Drop cols with over 90% null values
archive_clean.drop(columns=['in_reply_to_status_id','in_reply_to_user_id',\
                            'retweeted_status_id','retweeted_status_user_id',\
                            'retweeted_status_timestamp' ], inplace=True)

#### Test

In [None]:
# See changes
archive_clean.head(1)

### Issue #3:

#### Define
- Change invalid dog names in the `archive` `name` column to None.



#### Code

In [None]:
# Replace invalid dog names with None
archive_clean['name'] = archive_clean.name.str.replace('\d+|^[a-z]', 'None', regex=True)

### Test

In [None]:
# See Changes
archive_clean.name.value_counts()

In [None]:
archive_clean.name[275]

### Issue#4:



#### Define

- Extract `doggo`,`floofer`,`pupper`,`puppo` from `text` col.


#### Code

In [None]:
# Extract dog types from text col and drop type cols
archive_clean['dog_type'] = archive_clean.text.str.extract('(floofer|doggo|pupper|puppo)', expand=True)
archive_clean.drop(columns=['doggo', 'floofer', 'pupper', 'puppo'], inplace=True)

#### Test

In [None]:
# See Changes
archive_clean.head(2)

In [None]:
archive_clean.dog_type.value_counts()

### Issue#5:


#### Define

- Extract only the text containing the source of the tweet and the source url.


#### Code

In [None]:
# Create source_text and source_urls from source col
archive_clean['source_text'] = archive_clean.source.str.extract('>([A-Za-z].+)<')[0]
archive_clean['source_url'] = archive_clean.source.str.extract('([a-z]+://.+)("\s)')[0]
# Drop source col
archive_clean.drop('source', axis=1,inplace=True)

#### Test

In [None]:
# See Changes
archive_clean.head()

### Issue#6:


#### Define

- Rename the expanded_urls column to tweet_url since it points to the specific tweet.


#### Code

In [None]:
# Rename expanded_urls col
archive_clean.rename(columns={'expanded_urls': 'tweet_url'}, inplace=True)

#### Test

In [None]:
# See Changes
archive_clean.head(1)

### Issue#7:

#### Define

- Change all `tweet_id` column datatypes from `int` type to `str` type.



#### Code

In [None]:
# Change all tweet_id datatypes to str type
archive_clean['tweet_id'] = archive_clean.tweet_id.astype(str)
predictions_clean['tweet_id'] = predictions_clean.tweet_id.astype(str)
tweet_json_clean['tweet_id'] = tweet_json_clean.tweet_id.astype(str)

#### Test

In [None]:
# See Changes
archive_clean.info()

In [None]:
# See Changes
predictions_clean.info()

In [None]:
# See Changes
tweet_json_clean.info()

### Issue#8:

#### Define

- Merge archive and tweet_json dataframes.



#### Code

In [None]:
# Merge archive and tweet_json dataframes
df_merge = pd.merge(archive_clean,
                    tweet_json_clean.drop('timestamp',axis=1), on='tweet_id')

#### Test

In [None]:
# See Changes
df_merge.head(2)

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
# Merge all dataframes and save to csv
master_df = pd.merge(df_merge, predictions_clean, on='tweet_id')

In [None]:
# See number of observations
master_df.shape

In [None]:
# Save master dataframe to a csv file.
master_df.to_csv('twitter_archive_master.csv', index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
# Read the master csv file into a pandas dataframe
df = pd.read_csv('twitter_archive_master.csv')
df.head(2)

In [None]:
# See df summary info
df.info()

In [None]:
# Get dogs classified as dogs in all probability levels
dog_class = df.query('p1_dog == True & p2_dog == True & p3_dog == True')

In [None]:
# See dog_class sample size
dog_class.shape

In [None]:
# Get top 5 dogs with a high probability 1 
top_5 = dog_class.groupby(['tweet_id', 'p1']).p1_conf.nlargest(1).sort_values(ascending=False)[:5]
top_5

In [None]:
# Get bottom 5 dogs with a low probability 1
bottom_5 = dog_class.groupby(['tweet_id', 'p1']).p1_conf.nlargest(1).sort_values(ascending=False)[-5:]
bottom_5

In [None]:
# Get top 5 highly rated dogs
top_rated = df.groupby(['tweet_id','p1']).rating_numerator.nlargest(1).sort_values(ascending=False)[:1]
top_rated

In [None]:
# Get top 5 most retweeted dog types
retweet = df.groupby(['tweet_id']).retweet_count.nlargest(1).sort_values(ascending=False)[:1]
retweet

In [None]:
# Get top 5 most engaged dog type tweets
engagement = df.groupby(['tweet_id']).reply_count.nlargest(1).sort_values(ascending=False)[:1]
engagement

In [None]:
# Get top 5 most liked dog type tweets
favourite = df.groupby(['tweet_id']).like_count.nlargest(1).sort_values(ascending=False)[:1]
favourite

In [None]:
# Url picture for the top classified dog
url_top = df.query('tweet_id == 697463031882764288').jpg_url
url_top

In [None]:
# Url picture for the bottom poorly classified dog
url_bottom = df.query('tweet_id == 666644823164719104').jpg_url
url_bottom

In [None]:
# Url for the dog with the highest dog rating
url_rated = df.query('tweet_id == 749981277374128128').jpg_url
url_rated

In [None]:
# Get all dogs not classified as dogs in all probability levels
not_dogs = df.query('p1_dog == False & p2_dog == False & p3_dog == False')

In [None]:
# Get the dog type most wrongly classified and what it was classified as
not_dogs.groupby(['p1', 'p2', 'p3']).dog_type.value_counts().nlargest(1)

In [None]:
wrong = not_dogs.dog_type.value_counts()
wrong

In [None]:
# Get picture url for the poorly classified dog
df.query('p1=="mousetrap" & p2=="black_widow" & p3=="paddlewheel"').jpg_url

### Insights:

1. The dog breed classified with the highest probability was the `Labrador_retriever`, which from my judgement was spot on, see the picture in `Visualization` section. You can read more about the dog breed <a href='https://en.wikipedia.org/wiki/Labrador_Retriever'>here.</a>

2. The algorithm works better on pictures that are clear and with better focus, the poorest classified dog breed's picture was a bit blurry and the dog was not clearly visible. which might be the reason for the poor classification and once more to make the statement even more concrete the dog that was classified as objets had a not so great picture. But that does not take away the fact that the picture was adorable, see it below in the `Visualization` section.
 
3. The top rated dog really deserved the rating as him/her and his/her owner really went out of their way to put on a presentation and for him/her to sit through the time to get the costume on really proves he/she lived up to the group he/she belongs. He/she is a REALLY GOOD DOG, BRENT!!

4. There is one dog in particular associated with the highest retweets, reply count, and like count. It is highly likely that there more retweets a tweet gets, the more people engage and the more people are likely interested in the tweet.

5. When the classification algorithm classifies wrong, It really CLASSIFIES WRONG! proof being the dog classified as a mousetrap, paddle wheel and a black widow.

### Visualization

In [None]:
# Creating the data and plotting
data_top = top_5.reset_index().drop(columns=['tweet_id', 'level_2']).set_index('p1')
data_top.plot(kind='barh')
plt.title('Top 5 Classification Probability')
plt.xlabel('Probability')
plt.ylabel('Breed Prediction')
plt.savefig('top.png', bbox_inches='tight');

In [None]:
# Creating the data and plotting
data_bottom = bottom_10.reset_index().drop(columns=['tweet_id', 'level_2']).set_index('p1')
data_bottom.plot(kind='barh')
plt.title('Bottom 5 Classification Probability')
plt.xlabel('Probability')
plt.ylabel('Breed Prediction')
plt.savefig('bottom.png', bbox_inches='tight');

#### Correctly classified dog breed

In [None]:
# Picture of the top classed with high probability dog
Image(url='https://pbs.twimg.com/media/Ca3i7CzXIAMLhg8.jpg')

In [None]:
# Picture of the dog classed with the lowest probability level 1
Image(url='https://pbs.twimg.com/media/CUBl6IwVAAA9_zT.jpg')

In [None]:
# Picture of the dog associated with the highest rating
Image(url='https://pbs.twimg.com/media/CmgBZ7kWcAAlzFD.jpg')

In [None]:
# Picture of the dog classified as object in all levels of probability
Image(url='https://pbs.twimg.com/media/CsVO7ljW8AAckRD.jpg')