<h2 id="tocheading">Table of Contents</h2>
<div id="toc"></div>

In [6]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

### WRANGLE AND ANALYZE DATA PROJECT

#### Introduction

The aim of this project is to wrangle WeRateDogs Twitter data in order to create interesting and trustworthy analyses and visualisations.  
The data wrangling process consist of three phases 
    - Gathering data
    - Assessing data
    - Cleaning data
- The cleaned data will be analysed and used to create visuals to give possible interpretation for the data. A report for both processes will also be submitted at the end. 

#### Gathering Data
Three pieces of data is required for this project:

1. The WeRateDogs Twitter archive data, provided by Udacity and downloaded manually from Udacity resource center
2. The tweet image predictions, hosted on Udacity's server and downloaded programmatically using `Get Requests` 
3. Retweet and favorite counts for the tweet_id's in the archived data from `1` above. This data is accessed by qerying the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file.

#### Assessing Data 
- After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues.
- Target is to  detect and document at least eight (8) quality issues and two (2) tidiness issues.

#### Cleaning Data
- Issues documented while assessing wil be cleaned to give an output of high quality and tidy master pandas DataFrame.

#### Storing, Analyzing, and Visualizing Data 
- Store the clean DataFrame(s) in a CSV file with the main one named `twitter_archive_master.csv`. Wrangled data will be analysed and visualised using Jupyter notebook. This should include at least three (3) insights and one (1) visualisation. 
- A written report will prepared to describe the wrangling efforts and second report to communicate the insights and displays will also be prepared. 

In [126]:
#Import Necessary modules
import pandas as pd
import numpy as np
import os
#to make requests
import requests
#to display tables
from IPython.display import display
#to access twitter APi
import tweepy as tw

#to write json to pandas dataframe
from pandas import DataFrame

#for json file
import json


In [127]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}
</style> 

### DATA GATHERING

Three pieces of data will be gathered for this project.

**Data One;** The WeRateDogs Twitter archive. Provided and made available by Udacity. Downloaded from the resource centre and loaded into notebook as **.csv**

In [330]:
#import data to have an overview
df1 = pd.read_csv('twitter_archive_enhanced.csv')

In [323]:
df1.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


In [117]:
df1.shape

(2356, 17)

**Data Two;** The tweet image predictions, this information tells what breed of dog (or other object, animal, etc.) is present in each tweet. This was already provided by Udacity according to a neural network that can classify breeds of dogs. 

- This file (image_predictions.tsv) is hosted on Udacity's servers and downloaded using **Requests** library

In [6]:
#Getting the url
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [7]:
#to view raw data
#response.content

In [53]:
#savingg file to computer
with open('C:/Users/Frances-Anthony/Documents/Udacity/data_wrangle_analyze_project/image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

In [331]:
#Load image data
df2 = pd.read_csv('image_predictions.tsv', delimiter="\t")
df2.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


**Data three;** Extract each tweet's retweet count and favorite ("like") count for the `tweet_id` in the archived data downloaded. 

**HOW:** 
1. using the tweet IDs in the `WeRateDogs` Twitter archive, 
2. query the Twitter API for each tweet's JSON data using Python's `Tweepy library` and 
3. store each tweet's entire set of JSON data in a file called `tweet_json.txt file.`
4. Read the `.txt` file line by line into a pandas frame with these columns (tweet_id, retweet_count, favorite_count).

In [None]:
#import necessary modules in this insatnce Tweepy has already been imported above
#from the twitter archved data provided by Udacity call tweet_id to query twitter API for retweets and "likes" counts
#to access twitter API, input consumer and secret key gotten from twitter
#for privacy, i will leave them blank

#define keys
consumer_key= '.......'
consumer_secret = '..........'
access_token = '..........'
access_token_secret = '........'

auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

#get tweets from twitter API 
retweet_favorite_count = []

#save missing tweets in this list
not_found = []

with open('tweet_json.txt', mode = "w") as file:
    for i in list(df1.tweet_id):
        try:
            tweet = api.get_status(str(i))
            file.write(json.dumps(tweet._json))
            retweet_favorite_count.append({"tweet_id":str(i),
                                          "retweet_count": tweet._json['retweet_count'],
                                          "favourite_count": tweet._json['favorite_count']})
        except:
                not_found.append(i)

In [90]:
#Read the .txt file line by line into a pandas frame with these columns 
#(tweet_id, retweet_count, favorite count).
#df3 = pd.DataFrame(retweet_favorite_count, columns=['tweet_id', 'retweet_count', 'favourite_count'])

In [66]:
#write to a csv file
#df3.to_csv('retweet_and_favorites_counts.csv')

In [13]:
#read retweet_favorite_counts as csv and assign to dataframe
df3 = pd.read_csv('retweet_and_favorite_counts.csv')
df3.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,retweet_count,favourite_count
0,0,892420643555336193,7340,34978
1,1,892177421306343426,5475,30279
2,2,891815181378084864,3621,22784
3,3,891689557279858688,7529,38245
4,4,891327558926688256,8108,36522


In [14]:
df3.shape

(2331, 4)

## ASSESSING


The next phase of the project is assessing the gathered data. I will be assessing the data for quality and tidiness both manually and programmatically using pandas methods.


#### 1. ASSESSING TWITTER ARCHIVE DATA

In [179]:
#renamed the df 
archived_data = df1
archived_data.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### COLUMN DESCRIPTION
- tweet_id: 
- n_reply_to_status_id	
- in_reply_to_user_id	
- timestamp	
- source	
- text	
- retweeted_status_id	
- retweeted_status_user_id	
- retweeted_status_timestamp	
- expanded_urls	
- rating_numerator	
- rating_denominator	
- name	
- doggo	
- floofer
- pupper
- puppo

### QUALITY ISSUES
- missing data: in_reply_to_status_id and in_reply_to_user_id has only 78 rows available out of the 2356.
- not original tweet, columns retweeted_status_timestamp, retweeted status_id and user_id means that rows are retweet and not original tweet. (drop rows)	
- dog name is none for 745 rows - check if the same rows with retweet
- dog names recorded as `a` or `an` should be `None`
- rows with `a` also have dog style
- does not contain retweet and favorite counts 
- inconsistent rating_denominator -value should be 10, values    greater than 10 should be removed
- very high rating_numerator as much 1776, not an issue but keep in mind
- time stamp has object data type, change to datetime
- some rows have several identical values in the expanded_url column concatenated by a comma.

### TIDINESS ISSUES
- create one column for dog stage, collapse multiple colunmns and rows. convert the rows (name, doggo, floofer,pupper,puppo into two columns, one with dog name and one with dog level specifying either from the list)
- expanded url has multiple url on one row
- the three datasets can be one,all have tweet_id

In [16]:
archived_data.shape

(2356, 17)

In [294]:
archived_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [332]:
archived_data.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [18]:
archived_data.puppo.unique()

array(['None', 'puppo'], dtype=object)

In [19]:
archived_data.puppo.value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

In [20]:
archived_data.retweeted_status_id.nunique() 

181

In [21]:
#archived_data.rating_denominator.isnull()
archived_data.rating_numerator.isnull().sum().any()

False

In [22]:
archived_data.name.nunique(),archived_data.name.value_counts();

In [23]:
#check for duplicates
archived_data.tweet_id.duplicated().sum()

0

In [24]:
archived_data.tweet_id.nunique() #no duplicates

2356

In [25]:
#archived_data.name.isnull().any()
archived_data.query('name == "None"');

In [26]:
#dog names recorded as a or an
#archived_data.query('name == "an"')
archived_data.query('name == "a"');

In [27]:
archived_data.retweeted_status_id.notnull().sum()

181

In [29]:
#confirm if rows possible to be retweets contains dog names
#rchived_data[archived_data['retweeted_status_id'].notnull()]

In [348]:
archived_data.rating_denominator.nunique(),archived_data.rating_denominator.duplicated().sum()

(18, 2338)

In [349]:
#archived_data.rating_denominator.value_counts()
archived_data.rating_denominator.isnull().sum().any()

False

In [350]:
archived_data.rating_denominator.unique()

array([ 10,   0,  15,  70,   7,  11, 150, 170,  20,  50,  90,  80,  40,
       130, 110,  16, 120,   2], dtype=int64)

In [32]:
archived_data.rating_denominator.value_counts();

In [352]:
#archived_data.rating_numerator.notnull().sum()
archived_data.rating_numerator.isnull().sum()

0

In [33]:
archived_data.rating_numerator.unique();

In [35]:
archived_data.rating_numerator.value_counts();

In [36]:
archived_data.query('rating_numerator == 1776')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
979,749981277374128128,,,2016-07-04 15:00:45 +0000,"<a href=""https://about.twitter.com/products/tw...",This is Atticus. He's quite simply America af....,,,,https://twitter.com/dog_rates/status/749981277...,1776,10,Atticus,,,,


In [37]:
archived_data.rating_numerator.isnull().sum()

0

In [38]:
archived_data.expanded_urls.isnull();

#### 2. ASSESSING IMAGE PREDICTIONS DATA

In [332]:
#Reassign dataframe
image_pred = df2
image_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [40]:
image_pred.shape

(2075, 12)

#### Column Description

- `tweet_id` is the last part of the tweet URL after "status/" 
- `p1` is the algorithm's #1 prediction for the image in the tweet
- `p1_conf` is how confident the algorithm is in its #1 prediction in `%`
- `p1_dog` is whether or not the #1 prediction is a breed of dog
- `p2` is the algorithm's second most likely prediction
- `p2_conf` is how confident the algorithm is in its #2 prediction 
- `p2_dog` is whether or not the #2 prediction is a breed of dog
- `p3` is the algorithm's 3rd most likely prediction
- `p3_conf` is how confident the algorithm is in its #3 prediction 
- `p3_dog` is whether or not the #3 prediction is a breed of dog

In [326]:
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


#### Quality issues
- inconsistent naming for dog breeds p1, remove underscore
- convert prediction number to type `int` and remove letter `p` 
- inconsistent labelling for dog breeds, convert all breed name to lower case
- some breed predictions are false
#### Tidiness issues
- Predictions are spread in three columns.
- Confidence intervals are spread in three columns.
- Dog tests are spread in three columns.
- Melt all three into two columns (breed and confidence)

In [50]:
sum(image_pred.tweet_id.duplicated())

0

In [275]:
image_pred.p1_dog.unique(), image_pred.p2_dog.unique(), image_pred.p3_dog.unique()

(array([ True, False]), array([ True, False]), array([ True, False]))

In [306]:
image_pred.query('p1 == "NaN"')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [400]:
image_pred.p1.unique();

#### 3. ASSESSING RETWEET AND FAVORITE COUNT DATA

In [43]:
rtwt_fav_count = df3
rtwt_fav_count.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,retweet_count,favourite_count
0,0,892420643555336193,7340,34978
1,1,892177421306343426,5475,30279
2,2,891815181378084864,3621,22784
3,3,891689557279858688,7529,38245
4,4,891327558926688256,8108,36522


#### Column Description
- `tweet_id` the tweet_id
- `retweet_count` number of retweet for each Id
- `favourite_count` number of `likes` for the Id

In [44]:
rtwt_fav_count.shape

(2331, 4)

In [53]:
rtwt_fav_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Unnamed: 0       2331 non-null   int64
 1   tweet_id         2331 non-null   int64
 2   retweet_count    2331 non-null   int64
 3   favourite_count  2331 non-null   int64
dtypes: int64(4)
memory usage: 73.0 KB


#### Quality issues
- first column(unamed 0) not needed - drop

#### Tidiness issues
- None

In [54]:
rtwt_fav_count.isnull().sum().any()

False

In [288]:
#check for missing or NaN values
rtwt_fav_count.query('favourite_count == "NaN"')
rtwt_fav_count.query('retweet_count == "NaN"')

Unnamed: 0,tweet_id,retweet_count,favourite_count


### CLEANING

**CLEANING FOR ARCHIVED DATA**

In [259]:
df1 = pd.read_csv('twitter_archive_enhanced.csv')

In [260]:
archived_data = df1

In [333]:
#Make copies of dataframe
archived_data_clean = archived_data
image_pred_clean = image_pred
rtwt_fav_count_clean = rtwt_fav_count

In [185]:
archived_data_clean.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


<font color='red'> **TIDINESS ISSUE - 1** </font>

**Define**

- Name and dog stages row do not obey [Tidy Data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) rule.
- Use pandas method `assign`: To melt the dog stages columns that occured in multiple columns **doggo, floofer, pupper, puppo** to obey the tidy data rule. 

**Clean**

In [194]:
#create a copy for df
archived_data_clean_melt = archived_data_clean

In [169]:
#count for values in individual columns
archived_data_clean_melt['dog_stage'] = None
archived_data_clean_melt['dog_stage'] = archived_data_clean_melt.doggo + archived_data_clean_melt.floofer + archived_data_clean_melt.pupper + archived_data_clean_melt.puppo
archived_data_clean_melt['dog_stage'].value_counts()

NoneNoneNoneNone        1976
NoneNonepupperNone       245
doggoNoneNoneNone         83
NoneNoneNonepuppo         29
doggoNonepupperNone       12
NoneflooferNoneNone        9
doggoNoneNonepuppo         1
doggoflooferNoneNone       1
Name: dog_stage, dtype: int64

In [201]:
#use assign method to reassign the vlaues to the new column called dog_stage
archived_data_clean_melt = archived_data_clean_melt.assign(dog_stage = archived_data_clean_melt.doggo.astype(str) + archived_data_clean_melt.floofer.astype(str) + 
archived_data_clean_melt.pupper.astype(str) + archived_data_clean_melt.puppo.astype(str))                                             

In [202]:
archived_data_clean_melt.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,dog_stage
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,NoneNoneNoneNone
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,NoneNoneNoneNone
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,NoneNoneNoneNone
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,NoneNoneNoneNone
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,NoneNoneNoneNone


In [203]:
#replace all None with empty
archived_data_clean_melt['dog_stage'] = archived_data_clean_melt['dog_stage'].map(lambda x: x.replace("None",""))

In [204]:
archived_data_clean_melt.sample(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,dog_stage
522,809808892968534016,,,2016-12-16 17:14:20 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Maximus. His face is st...,7.939622e+17,4196984000.0,2016-11-02 23:45:19 +0000,https://twitter.com/dog_rates/status/793962221...,12,10,Maximus,,,,,
655,791784077045166082,,,2016-10-27 23:30:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: I'm not sure what this dog is d...,6.820881e+17,4196984000.0,2015-12-30 06:37:25 +0000,"https://vine.co/v/iqMjlxULzbn,https://vine.co/...",12,10,,,,,,
939,753039830821511168,,,2016-07-13 01:34:21 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",So this just changed my life. 13/10 please enj...,,,,https://vine.co/v/5W2Dg3XPX7a,13,10,,,,,,


In [205]:
#drop other stage columns
archived_data_clean_melt.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis =1, inplace=True)
archived_data_clean_melt.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_stage
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,


In [206]:
#view unique values in column
archived_data_clean_melt.dog_stage.unique()

array(['', 'doggo', 'puppo', 'pupper', 'floofer', 'doggopuppo',
       'doggofloofer', 'doggopupper'], dtype=object)

In [207]:
#delete rows with empty dog stage
archived_data_clean_melt = archived_data_clean_melt[(archived_data_clean_melt.dog_stage != "")] 

In [212]:
archived_data_clean_melt.dog_stage.unique()

array(['doggo', 'puppo', 'pupper', 'floofer', 'doggopuppo',
       'doggofloofer', 'doggopupper'], dtype=object)

In [208]:
archived_data_clean_melt.shape

(380, 14)

**Test**

In [213]:
archived_data_clean_melt['dog_stage'].value_counts()

pupper          245
doggo            83
puppo            29
doggopupper      12
floofer           9
doggopuppo        1
doggofloofer      1
Name: dog_stage, dtype: int64

<font color='blue'> **QUALITY ISSUE - 1** </font>

**Define**

Some dog names recorded as `a` or `an` probably an error due to missign data. Rename as `None`

**Code**

In [241]:
archived_data_clean.head();

In [239]:
archived_data_clean.query('name == "a"');

In [240]:
archived_data_clean.query('name == "an"');

In [219]:
#rename rows with name as 'a' and 'an' to None
archived_data_clean.loc[archived_data_clean['name'] == "a", 'name'] = 'None'
archived_data_clean.loc[archived_data_clean['name'] == "an", 'name'] = 'None'

**Test**

In [224]:
archived_data_clean.query('name == "an"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [223]:
archived_data_clean.query('name == "a"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


<font color='blue'> **QUALITY ISSUE - 2** </font>

**Define**

In-consistent `rating_denominator`. Denominator cannot be greater then **10**. Select rows and drop

**Code**

In [234]:
not_ten = list(archived_data_clean.query('rating_denominator !=10').index)
archived_data_clean.drop(index=not_ten, inplace = True)

In [235]:
archived_data_clean.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


In [236]:
archived_data_clean.rating_denominator.unique()

array([10], dtype=int64)

In [233]:
#archived_data_clean.drop(archived_data_clean.index[archived_data_clean['rating_denominator'] > 10])

**TEST**

In [237]:
archived_data_clean.rating_denominator.unique()

array([10], dtype=int64)

In [238]:
archived_data_clean.rating_denominator.value_counts()

10    2333
Name: rating_denominator, dtype: int64

<font color='blue'> **QUALITY ISSUE - 3** </font>

**DEFINE**

- Missing data for in_reply_to_status_id and in_reply_to_user_id has only 78 rows available out of the 2356. Drop these columns.

In [262]:
archived_data_clean.in_reply_to_user_id.count()

78

In [263]:
archived_data_clean.in_reply_to_user_id	.isnull().sum(), archived_data_clean.in_reply_to_status_id.isnull().sum()

(2278, 2278)

**CODE**

In [264]:
#drop columns
archived_data_clean = archived_data_clean.drop(['in_reply_to_status_id','in_reply_to_user_id'], axis=1)

**TEST**

In [265]:
archived_data_clean.head(2)

Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


<font color='blue'> **QUALITY ISSUE - 4** </font>

**DEFINE**

- Not original tweet, columns retweeted_status_timestamp, retweeted status_id and user_id(drop columns). This shows rows are retweets and not original posts, retweets are not of interest in this data.

**CODE**

In [266]:
archived_data_clean.retweeted_status_id.isnull().sum(), archived_data_clean.retweeted_status_user_id.isnull().sum()

(2175, 2175)

In [267]:
#drop columns
archived_data_clean = archived_data_clean.drop(['retweeted_status_id','retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1)

**TEST**

In [268]:
archived_data_clean.head(2)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


<font color='blue'> **QUALITY ISSUE - 5** </font>

**DEFINE**
- Timestamp column is stored as object, convert to datetime

**CODE**

In [269]:
archived_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int64 
 6   rating_denominator  2356 non-null   int64 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
dtypes: int64(3), object(9)
memory usage: 221.0+ KB


In [271]:
archived_data_clean.timestamp = archived_data_clean.timestamp.astype('datetime64')

**TEST**

In [272]:
archived_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   tweet_id            2356 non-null   int64         
 1   timestamp           2356 non-null   datetime64[ns]
 2   source              2356 non-null   object        
 3   text                2356 non-null   object        
 4   expanded_urls       2297 non-null   object        
 5   rating_numerator    2356 non-null   int64         
 6   rating_denominator  2356 non-null   int64         
 7   name                2356 non-null   object        
 8   doggo               2356 non-null   object        
 9   floofer             2356 non-null   object        
 10  pupper              2356 non-null   object        
 11  puppo               2356 non-null   object        
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 221.0+ KB


**CLEANING FOR IMAGE PREDICTION DATA**

<font color='red'> **TIDINESS ISSUE** </font>

In [527]:
image_pred_clean = image_pred
image_pred_clean.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


**DEFINE**
- Predictions, Confidence intervals and algorithm tests are stored in three columns each. Melt all three into two columns (breed and confidence)

**CODE**

In [528]:
#melt columns for the prediction number p1, p2,p3 and convert to prediction number and predicted breed
image_pred_clean = pd.melt(image_pred_clean, id_vars=['tweet_id', 'jpg_url', 'img_num', 'p1_conf', 'p1_dog', 'p2_dog', 'p3_conf', 'p3_dog'], var_name = 'pred_number',
                          value_name = 'predicted_breed')
#melt the p1_conf to p3_conf for the confidence interval of the predictions
image_pred_clean = pd.melt(image_pred_clean, id_vars=['tweet_id', 'jpg_url', 'img_num', 'p1_dog', 'p2_dog', 'p3_dog', 'pred_number', 'predicted_breed'], var_name = 'conf',
                          value_name = 'conf_interval')
#drop duplicated rows
image_pred_clean = image_pred_clean[image_pred_clean['pred_number'] == image_pred_clean['conf'].str[:2]]

#melt p1_dog to p3_dog for the accuracy of the prediction, if prediction is dog or not
image_pred_clean = pd.melt(image_pred_clean, id_vars=['tweet_id', 'jpg_url', 'img_num', 'pred_number', 'predicted_breed','conf', 'conf_interval'], var_name = 'dog_pred_num',
                          value_name = 'dog_prediction')
#remove duplicates
image_pred_clean = image_pred_clean[image_pred_clean['pred_number'] == image_pred_clean['dog_pred_num'].str[:2]]

image_pred_clean

Unnamed: 0,tweet_id,jpg_url,img_num,pred_number,predicted_breed,conf,conf_interval,dog_pred_num,dog_prediction
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,p1,Welsh_springer_spaniel,p1_conf,0.465074,p1_dog,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,p1,redbone,p1_conf,0.506826,p1_dog,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,p1,German_shepherd,p1_conf,0.596461,p1_dog,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,p1,Rhodesian_ridgeback,p1_conf,0.408143,p1_dog,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,p1,miniature_pinscher,p1_conf,0.560311,p1_dog,True
...,...,...,...,...,...,...,...,...,...
12445,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,p3,German_short-haired_pointer,p3_conf,0.175219,p3_dog,True
12446,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,p3,spatula,p3_conf,0.040836,p3_dog,False
12447,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,p3,kelpie,p3_conf,0.031379,p3_dog,True
12448,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,p3,papillon,p3_conf,0.068957,p3_dog,True


In [529]:
image_pred.shape, image_pred_clean.shape

((2075, 12), (4150, 9))

In [530]:
image_pred_clean = image_pred_clean.drop(['conf','dog_pred_num'], axis=1)

In [531]:
image_pred_clean

Unnamed: 0,tweet_id,jpg_url,img_num,pred_number,predicted_breed,conf_interval,dog_prediction
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,p1,Welsh_springer_spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,p1,redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,p1,German_shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,p1,Rhodesian_ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,p1,miniature_pinscher,0.560311,True
...,...,...,...,...,...,...,...
12445,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,p3,German_short-haired_pointer,0.175219,True
12446,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,p3,spatula,0.040836,False
12447,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,p3,kelpie,0.031379,True
12448,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,p3,papillon,0.068957,True


In [532]:
image_pred_clean.duplicated().sum().any()

False

In [533]:
image_pred_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4150 entries, 0 to 12449
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tweet_id         4150 non-null   int64  
 1   jpg_url          4150 non-null   object 
 2   img_num          4150 non-null   int64  
 3   pred_number      4150 non-null   object 
 4   predicted_breed  4150 non-null   object 
 5   conf_interval    4150 non-null   float64
 6   dog_prediction   4150 non-null   bool   
dtypes: bool(1), float64(1), int64(2), object(3)
memory usage: 231.0+ KB


<font color='blue'> **QUALITY ISSUE - 1** </font>

**DEFINE**

Prdiction number should be an integer. remove p from pred_number column and change to int64

**CODE**

In [534]:
image_pred_clean['pred_number'] = image_pred_clean['pred_number'].str.replace(r'\D', '').astype(int)

In [535]:
image_pred_clean

Unnamed: 0,tweet_id,jpg_url,img_num,pred_number,predicted_breed,conf_interval,dog_prediction
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,1,Welsh_springer_spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,1,redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,1,German_shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,1,Rhodesian_ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,1,miniature_pinscher,0.560311,True
...,...,...,...,...,...,...,...
12445,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,3,German_short-haired_pointer,0.175219,True
12446,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,3,spatula,0.040836,False
12447,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,3,kelpie,0.031379,True
12448,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,3,papillon,0.068957,True


In [536]:
image_pred_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4150 entries, 0 to 12449
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tweet_id         4150 non-null   int64  
 1   jpg_url          4150 non-null   object 
 2   img_num          4150 non-null   int64  
 3   pred_number      4150 non-null   int32  
 4   predicted_breed  4150 non-null   object 
 5   conf_interval    4150 non-null   float64
 6   dog_prediction   4150 non-null   bool   
dtypes: bool(1), float64(1), int32(1), int64(2), object(2)
memory usage: 214.8+ KB


<font color='blue'> **QUALITY ISSUE - 2** </font>

**DEFINE**
- inconsistent naming for dog breeds p1. Remove `_` from breed name

**CODE**

In [537]:
image_pred_clean['predicted_breed'] = image_pred_clean['predicted_breed'].str.replace(r'_', ' ')

In [538]:
image_pred_clean

Unnamed: 0,tweet_id,jpg_url,img_num,pred_number,predicted_breed,conf_interval,dog_prediction
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,1,Welsh springer spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,1,redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,1,German shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,1,Rhodesian ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,1,miniature pinscher,0.560311,True
...,...,...,...,...,...,...,...
12445,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,3,German short-haired pointer,0.175219,True
12446,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,3,spatula,0.040836,False
12447,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,3,kelpie,0.031379,True
12448,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,3,papillon,0.068957,True


<font color='blue'> **QUALITY ISSUE - 3** </font>

**DEFINE**
- inconsistent labelling for dog breeds, convert all breed name to lower case

**CODE**

In [539]:
image_pred_clean['predicted_breed'] = image_pred_clean['predicted_breed'].str.lower()

In [540]:
image_pred_clean

Unnamed: 0,tweet_id,jpg_url,img_num,pred_number,predicted_breed,conf_interval,dog_prediction
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,1,welsh springer spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,1,redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,1,german shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,1,rhodesian ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,1,miniature pinscher,0.560311,True
...,...,...,...,...,...,...,...
12445,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,3,german short-haired pointer,0.175219,True
12446,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,3,spatula,0.040836,False
12447,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,3,kelpie,0.031379,True
12448,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,3,papillon,0.068957,True


<font color='blue'> **QUALITY ISSUE - 4** </font>

**DEFINE**
-Some dog predictins are false, filter out rows

In [541]:
image_pred_clean.dog_prediction.value_counts()

True     3031
False    1119
Name: dog_prediction, dtype: int64

**CODE**

In [542]:
image_pred_clean = image_pred_clean[(image_pred_clean[['dog_prediction']] != False).all(axis=1)]

**TEST**

In [544]:
image_pred_clean

Unnamed: 0,tweet_id,jpg_url,img_num,pred_number,predicted_breed,conf_interval,dog_prediction
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,1,welsh springer spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,1,redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,1,german shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,1,rhodesian ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,1,miniature pinscher,0.560311,True
...,...,...,...,...,...,...,...
12441,890609185150312448,https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg,1,3,chesapeake bay retriever,0.118184,True
12442,890729181411237888,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,2,3,pembroke,0.076507,True
12445,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,3,german short-haired pointer,0.175219,True
12447,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,3,kelpie,0.031379,True


**CLEANING RETWEET AND FAVORITE COUNT DATA**

**DEFINE**

- DELETE first column(unamed 0) not needed.

**CODE**

In [276]:
rtwt_fav_count.head(2)

Unnamed: 0.1,Unnamed: 0,tweet_id,retweet_count,favourite_count
0,0,892420643555336193,7340,34978
1,1,892177421306343426,5475,30279


In [280]:
#drop unnamed column
rtwt_fav_count.drop('Unnamed: 0', axis=1, inplace=True)

**TEST**

In [281]:
rtwt_fav_count.head()

Unnamed: 0,tweet_id,retweet_count,favourite_count
0,892420643555336193,7340,34978
1,892177421306343426,5475,30279
2,891815181378084864,3621,22784
3,891689557279858688,7529,38245
4,891327558926688256,8108,36522
