### WRANGLE AND ANALYZE DATA

#### INTRODUCTION

Aim is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualisations.  
the data wrangling process will consist of three phases 
    - Gathering data
    - Assessing data
    - Cleaning data
- The cleaned data will be analyse and used to create visuals to give possible interpretation for the data.  

#### Gathering Data
Three pieces of data is required for this project:

1. The WeRateDogs Twitter archive data, provided by Udacity and downloaded manually from Udacity resource center
2. The tweet image predictions, hosted on Udacity's server and downloaded programmatically using `Get Requests` 
3. Retweet and favorite counts for the tweet_id's in the archived data from `1` above. This data is accessed y qerying the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file.

#### Assessing Data 
After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues.
target is to  Detect and document at least eight (8) quality issues and two (2) tidiness issues.

#### Cleaning Data
Issues documented while assessing wil be cleaned to give an output of high quality and tidy master pandas DataFrame.

#### Storing, Analyzing, and Visualizing Data 
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. Wrangled data will be analysed and visualised using Jupyter notebook. 
A written report will prepared to describe the wrangling efforts and second report to communicate the insights and displays will also be prepared. 

#### TABLE OF CONTENTS

- Data Gathering
- Assessing 
- Cleaning

In [1]:
#Import Necessary modules
import pandas as pd
import numpy as np
import os
#to make requests
import requests
#to display tables
from IPython.display import display
#to access twitter APi
import tweepy as tw

#to write json to pandas dataframe
from pandas import DataFrame

#for json file
import json


In [2]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}
</style> 

### DATA GATHERING

Three pieces of data will be gathered for this project.

**Data One;** The WeRateDogs Twitter archive. Provided and made available by Udacity. Downloaded from the resource centre and loaded into notebook as **.csv**

In [3]:
#import data to have an overview
df1 = pd.read_csv('twitter_archive_enhanced.csv')

In [4]:
df1.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [5]:
df1.shape

(2356, 17)

**Data Two;** The tweet image predictions, this information tells what breed of dog (or other object, animal, etc.) is present in each tweet. This was already provided by Udacity according to a neural network that can classify breeds of dogs. 

- This file (image_predictions.tsv) is hosted on Udacity's servers and downloaded using **Requests** library

In [6]:
#Getting the url
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [7]:
#to view raw data
#response.content

In [53]:
#savingg file to computer
with open('C:/Users/Frances-Anthony/Documents/Udacity/data_wrangle_analyze_project/image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

In [9]:
df2 = pd.read_csv('image_predictions.tsv', delimiter="\t")
df2.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


**Data three;** Extract each tweet's retweet count and favorite ("like") count for the `tweet_id` in the archived data downloaded. 

**HOW:** 
1. Using the tweet IDs in the `WeRateDogs` Twitter archive, 
2. query the Twitter API for each tweet's JSON data using Python's `Tweepy library` and 
3. store each tweet's entire set of JSON data in a file called `tweet_json.txt file.`
4. Read the `.txt` file line by line into a pandas frame with these columns (tweet_id, retweet_count, favorite_count).

In [None]:
#import necessary modules
#from the twitter archved data provided by Udacity call tweet_id to query twitter API for retweets and "likes" counts
#to access twitter API, input consumer and secret key gotten from twitter
#for privacy, i will leave them blank

#define keys
consumer_key= '.......'
consumer_secret = '..........'
access_token = '..........'
access_token_secret = '........'

auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

#get tweets from twitter API 
retweet_favorite_count = []

#save missing tweets in this list
not_found = []

with open('tweet_json.txt', mode = "w") as file:
    for i in list(df1.tweet_id):
        try:
            tweet = api.get_status(str(i))
            file.write(json.dumps(tweet._json))
            retweet_favorite_count.append({"tweet_id":str(i),
                                          "retweet_count": tweet._json['retweet_count'],
                                          "favourite_count": tweet._json['favorite_count']})
        except:
                not_found.append(i)

In [90]:
#Read the .txt file line by line into a pandas frame with these columns 
#(tweet_id, retweet_count, favorite count).
#df3 = pd.DataFrame(retweet_favorite_count, columns=['tweet_id', 'retweet_count', 'favourite_count'])

In [66]:
#write to a csv file
#df3.to_csv('retweet_and_favorites_counts.csv')

In [162]:
#read retweet_favorite_counts as csv and assign to dataframe
df3 = pd.read_csv('retweet_and_favorite_counts.csv')
df3.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,retweet_count,favourite_count
0,0,892420643555336193,7340,34978
1,1,892177421306343426,5475,30279
2,2,891815181378084864,3621,22784
3,3,891689557279858688,7529,38245
4,4,891327558926688256,8108,36522


In [94]:
#df=pd.read_csv('retweet_and_favorite_counts.csv')

In [163]:
df3.shape

(2331, 4)

## ASSESSING


The next phase of the project is assessing the gathered data. I will be assessing the data for quality and tidiness programmatically using pandas methods.


#### ASSESSING TWITTER ARCHIVE DATA

In [13]:
#i renamed the df for easy identification
archived_data = df1
archived_data.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### COLUMN DESCRIPTION
- tweet_id: 
- n_reply_to_status_id	
- in_reply_to_user_id	
- timestamp	
- source	
- text	
- retweeted_status_id	
- retweeted_status_user_id	
- retweeted_status_timestamp	
- expanded_urls	
- rating_numerator	
- rating_denominator	
- name	
- doggo	
- floofer
- pupper
- puppo

### QUALITY ISSUES
- missing data: n_reply_to_status_id and in_reply_to_user_id has only 78 rows available out of the 2356. Possibly drop these columns
- not original tweet, columns retweeted_status_timestamp, retweeted status_id and user_id means that rows are retweet and not original tweet. (drop rows)	
- dog name is none for 745 rows - check if the same rows with retweet
- dog names recorded as `a` or `an` should be `None`
- does not contain retweet and favorite counts 
- inconsistent rating_denominator -value should be 10, values    greater than 10 should be removed
- very high rating_numerator as much 1776
- time stamp has object data type, change to datetime
- 

### TIDINESS ISSUES
- create one column for dog stage, collapse multiple colunmns and rows. convert the rows (name, doggo, floofer,pupper,puppo into two columns, one with dog name and one with dog level specifying either from the list - use melt)
- expanded url has multiple url on one row
- the three datasets can be one,all have tweet_id

In [14]:
archived_data.shape

(2356, 17)

In [15]:
archived_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [49]:
archived_data.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [16]:
archived_data.puppo.unique()

array(['None', 'puppo'], dtype=object)

In [17]:
archived_data.puppo.value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

In [18]:
archived_data.retweeted_status_id.nunique() 

181

In [65]:
#archived_data.rating_denominator.isnull()
archived_data.rating_numerator.isnull().sum().any()

False

In [27]:
archived_data.name.nunique(),archived_data.name.value_counts()

(957,
 None       745
 a           55
 Charlie     12
 Cooper      11
 Oliver      11
           ... 
 Dylan        1
 Trigger      1
 Harlso       1
 Mark         1
 Tuco         1
 Name: name, Length: 957, dtype: int64)

In [138]:
#check for duplicates
archived_data.tweet_id.duplicated().sum()

0

In [32]:
archived_data.tweet_id.nunique() #no duplicates

2356

In [111]:
#archived_data.name.isnull().any()
archived_data.query('name == "None"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
12,889665388333682689,,,2017-07-25 01:55:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a puppo that seems to be on the fence a...,,,,https://twitter.com/dog_rates/status/889665388...,13,10,,,,,puppo
24,887343217045368832,,,2017-07-18 16:08:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",You may not have known you needed to see this ...,,,,https://twitter.com/dog_rates/status/887343217...,13,10,,,,,
25,887101392804085760,,,2017-07-18 00:07:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This... is a Jubilant Antarctic House Bear. We...,,,,https://twitter.com/dog_rates/status/887101392...,12,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2342,666082916733198337,,,2015-11-16 02:38:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a well-established sunblockerspan...,,,,https://twitter.com/dog_rates/status/666082916...,6,10,,,,,
2343,666073100786774016,,,2015-11-16 01:59:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Let's hope this flight isn't Malaysian (lol). ...,,,,https://twitter.com/dog_rates/status/666073100...,10,10,,,,,
2344,666071193221509120,,,2015-11-16 01:52:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a northern speckled Rhododendron....,,,,https://twitter.com/dog_rates/status/666071193...,9,10,,,,,
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,


In [171]:
#dog names recorded as a or an
#archived_data.query('name == "an"')
archived_data.query('name == "a"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
56,881536004380872706,,,2017-07-02 15:32:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a pupper approaching maximum borkdrive...,,,,https://twitter.com/dog_rates/status/881536004...,14,10,a,,,pupper,
649,792913359805018113,,,2016-10-31 02:17:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a perfect example of someone who has t...,,,,https://twitter.com/dog_rates/status/792913359...,13,10,a,,,,
801,772581559778025472,,,2016-09-04 23:46:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Guys this is getting so out of hand. We only r...,,,,https://twitter.com/dog_rates/status/772581559...,10,10,a,,,,
1002,747885874273214464,,,2016-06-28 20:14:22 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a mighty rare blue-tailed hammer sherk...,,,,https://twitter.com/dog_rates/status/747885874...,8,10,a,,,,
1004,747816857231626240,,,2016-06-28 15:40:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Viewer discretion is advised. This is a terrib...,,,,https://twitter.com/dog_rates/status/747816857...,4,10,a,,,,
1017,746872823977771008,,,2016-06-26 01:08:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a carrot. We only rate dogs. Please on...,,,,https://twitter.com/dog_rates/status/746872823...,11,10,a,,,,
1049,743222593470234624,,,2016-06-15 23:24:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a very rare Great Alaskan Bush Pupper....,,,,https://twitter.com/dog_rates/status/743222593...,12,10,a,,,pupper,
1193,717537687239008257,,,2016-04-06 02:21:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",People please. This is a Deadly Mediterranean ...,,,,https://twitter.com/dog_rates/status/717537687...,11,10,a,,,,
1207,715733265223708672,,,2016-04-01 02:51:22 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a taco. We only rate dogs. Please only...,,,,https://twitter.com/dog_rates/status/715733265...,10,10,a,,,,
1340,704859558691414016,,,2016-03-02 02:43:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a heartbreaking scene of an incredible...,,,,https://twitter.com/dog_rates/status/704859558...,10,10,a,,,pupper,


In [68]:
archived_data.retweeted_status_id.notnull().sum()

181

In [110]:
#confirm if rows possible to be retweets contains dog names
archived_data[archived_data['retweeted_status_id'].notnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.874740e+17,4.196984e+09,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,1.960740e+07,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4.196984e+09,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4.196984e+09,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4.196984e+09,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Shaggy. He knows exactl...,6.678667e+17,4.196984e+09,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724...,10,10,Shaggy,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Extremely intelligent dog here....,6.671383e+17,4.196984e+09,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269...,10,10,,,,,
1242,711998809858043904,,,2016-03-21 19:31:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @twitter: @dog_rates Awesome Tweet! 12/10. ...,7.119983e+17,7.832140e+05,2016-03-21 19:29:52 +0000,https://twitter.com/twitter/status/71199827977...,12,10,,,,,
2259,667550904950915073,,,2015-11-20 03:51:52 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @dogratingrating: Exceptional talent. Origi...,6.675487e+17,4.296832e+09,2015-11-20 03:43:06 +0000,https://twitter.com/dogratingrating/status/667...,12,10,,,,,


In [117]:
archived_data.rating_denominator.nunique(),archived_data.rating_denominator.duplicated().sum()

(18, 2338)

In [146]:
#archived_data.rating_denominator.value_counts()
archived_data.rating_denominator.isnull().sum().any()

False

In [132]:
archived_data.rating_denominator.unique()

array([ 10,   0,  15,  70,   7,  11, 150, 170,  20,  50,  90,  80,  40,
       130, 110,  16, 120,   2], dtype=int64)

In [136]:
archived_data.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [130]:
#archived_data.rating_numerator.notnull().sum()
archived_data.rating_numerator.isnull().sum()

0

In [131]:
archived_data.rating_numerator.unique()

array([  13,   12,   14,    5,   17,   11,   10,  420,  666,    6,   15,
        182,  960,    0,   75,    7,   84,    9,   24,    8,    1,   27,
          3,    4,  165, 1776,  204,   50,   99,   80,   45,   60,   44,
        143,  121,   20,   26,    2,  144,   88], dtype=int64)

In [159]:
archived_data.rating_numerator.value_counts();

In [152]:
archived_data.query('rating_numerator == 1776')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
979,749981277374128128,,,2016-07-04 15:00:45 +0000,"<a href=""https://about.twitter.com/products/tw...",This is Atticus. He's quite simply America af....,,,,https://twitter.com/dog_rates/status/749981277...,1776,10,Atticus,,,,


In [156]:
archived_data.rating_numerator.isnull().sum()

0

In [158]:
archived_data.expanded_urls.isnull()

0       False
1       False
2       False
3       False
4       False
        ...  
2351    False
2352    False
2353    False
2354    False
2355    False
Name: expanded_urls, Length: 2356, dtype: bool

In [124]:
archived_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

#### ASSESSING IMAGE PREDICTIONS DATA

In [46]:
#Reassign dataframe
image_pred = df2
image_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [47]:
image_pred.shape

(2075, 12)

#### Column Description

- `tweet_id` is the last part of the tweet URL after "status/" 
- `p1` is the algorithm's #1 prediction for the image in the tweet
- `p1_conf` is how confident the algorithm is in its #1 prediction in `%`
- `p1_dog` is whether or not the #1 prediction is a breed of dog
- `p2` is the algorithm's second most likely prediction
- `p2_conf` is how confident the algorithm is in its #2 prediction 
- `p2_dog` is whether or not the #2 prediction is a breed of dog
- `p3` is the algorithm's 3rd most likely prediction
- `p3_conf` is how confident the algorithm is in its #3 prediction 
- `p3_dog` is whether or not the #3 prediction is a breed of dog

In [48]:
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


#### Quality issues
- None

#### Tidiness issues
- None

In [51]:
sum(image_pred.tweet_id.duplicated())

0

#### ASSESSING RETWEET AND FAVORITE COUNT DATA

In [164]:
rtwt_fav_count = df3
rtwt_fav_count.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,retweet_count,favourite_count
0,0,892420643555336193,7340,34978
1,1,892177421306343426,5475,30279
2,2,891815181378084864,3621,22784
3,3,891689557279858688,7529,38245
4,4,891327558926688256,8108,36522


#### Column Description
- `tweet_id`
- `retweet_count`
- `favourite_count`

In [165]:
rtwt_fav_count.shape

(2331, 4)

In [166]:
rtwt_fav_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Unnamed: 0       2331 non-null   int64
 1   tweet_id         2331 non-null   int64
 2   retweet_count    2331 non-null   int64
 3   favourite_count  2331 non-null   int64
dtypes: int64(4)
memory usage: 73.0 KB


#### Quality issues
- first index column not needed - drop

#### Tidiness issues
- None

In [59]:
rtwt_fav_count.isnull().sum().any()

False

### CLEANING

In [215]:
#Make copies of dataframe
archived_data_clean = archived_data
image_pred_clean = image_pred
rtwt_fav_count_clean = rtwt_fav_count

In [216]:
archived_data_clean.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


**Define**

- Name and dog stages row do not obey [Tidy Data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) rule.
- Use pandas method `melt()`: Melt the name, doggo, floofer	pupper, puppo to obey the tidy data rule. 

**Clean**

In [217]:
pd.melt(archived_data_clean, id_vars=['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id','timestamp','source','text','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls','rating_numerator','rating_denominator','name','doggo','floofer','pupper','puppo'], 
                                      value_vars = ['doggo','floofer', 'pupper', 'puppo'],
                                      var_name = 'dog_stage')
archived_data_clean 

ValueError: arrays must all be same length

In [198]:
archived_data_clean = pd.melt(archived_data_clean, id_vars=['tweet_id', 'name'], 
                              value_vars = ['doggo','floofer', 'pupper', 'puppo'],
                              var_name = 'dog_stage')
archived_data_clean = archived_data_clean .drop()

In [202]:
#archived_data_clean.sample(20)
archived_data_clean .shape

(2356, 17)

**Test**