# Data Wrangling for WeRateDogs Twitter archive

## Table of Contents

<ul>
<li><a href=\"#intro\">1 Introduction</a></li>
<li><a href=\"#wrangling\">2 Data Wrangling</a></li>
<li><a href=\"#eda\">3 Exploratory Data Analysis</a></li>
<li><a href=\"#conclusions\">4 Conclusion and limitations</a></li>
<li><a href=\"#Appendix\">5 Appendix</a></li>
</ul>



<a id='intro'></a>
## 1 Introduction
> This sub project is for data wrangling process of the 'Wrangling and Analyze Data' project. As the rule of thumb, this project has three components, gathering data, assessing data and cleaning data. Moreover, in the vary end of this notebook, I will store the cleaning data in .csv files for analysis and visualization later on.

In [1]:
import numpy as np
import pandas as pd
import requests
import io
import tweepy
from tweepy import OAuthHandler
import json
import timeit
import config # info of twitter API secrets and keys
import re
import datetime
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns

## 2 Gathering Data
There are three data resources:
* Manually download: `twitter_archive_enhanced.csv`
* Derive from Udacity's servers: `image_predictions.tsv`
* Derive by Tweepy: `tweet_json.txt`

>`twitter_archive_enhanced.csv`: This file is downloaded manually and stores under the same path of this notebook for accessibility.

>`image_predictions.tsv`: This file is obtained using requests library in [section 2.1](need a html link here)

>`tweet_json.txt`: This file is obtained using requests library in [section 2.2](need a html link here)

### 2.1 Read `twitter_archive_enhanced.csv` from Udacity's servers 

In [2]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

### 2.1 Extract `image_predictions.tsv` from Udacity's servers 

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
urlData = requests.get(url).content
img_pred = pd.read_csv(io.StringIO(urlData.decode('utf-8')),sep='\t')

In [4]:
img_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### 2.2 Extract data using twitter API

In [5]:
consumer_key = config.consumer_key
consumer_secret = config.consumer_secret
access_token = config.access_token
access_secret = config.access_secret

In [6]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [7]:
# start = timeit.timeit() # set up a timer
# fails_dict={} # collect deleted ids
# count = 0 # get the processing status
# with open('tweet_json.txt', 'w') as outfile:
#     for twt_id in img_pred['tweet_id']:
#         try:
#             tweet = api.get_status(twt_id,tweet_mode='extended',wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
#             print('{} record success'.format(count),end="\r")
#             json.dump(tweet._json, outfile)
#             outfile.write('\n')
#         except tweepy.TweepError as e:
#             print('Fail',end="\r")
#             fails_dict[twt_id] = e
#             pass
#         count += 1
# end = timeit.timeit()

In [8]:
df_api = pd.DataFrame(columns=['id','display_text_range','retweet_count','favorite_count'])
with open('tweet_json.txt') as json_file:
    for line in json_file:
        data_str = json.loads(line)
        data_parse = pd.DataFrame.from_dict(data_str,orient="index")
        data_interested = data_parse[0][['id','display_text_range','retweet_count','favorite_count']]
        df_api = df_api.append(data_interested,ignore_index=True)

In [9]:
df_api.head()

Unnamed: 0,id,display_text_range,retweet_count,favorite_count
0,666020888022790149,"[0, 131]",466,2434
1,666029285002620928,"[0, 139]",42,121
2,666033412701032449,"[0, 130]",41,113
3,666044226329800704,"[0, 137]",133,274
4,666049248165822465,"[0, 120]",41,99


## 3 Data Wrangling
In the previous section, three tables are generated from different sources. In this sections, each table is carefully assessed and cleaned. The three dataframes are listed below:

* `twitter_archive`: retrieved from a .csv file
* `img_pred`: retrieved from Udacity server
* `df_api`: retrieved from twitter

### 3.1 Data Assesssing

#### 3.1.1 Data Assessing: `twitter_archive` table
**Quality issues**
* more than 50% of NAN values in columns related `in_reply_to` and `retweeted_status`
* redundant information in `source` column.
* Nones in columns `['doggo','floofer','pupper','puppo']`
* incorrect ratings
* incorrect ['doggo','floofer','pupper','puppo']
* Erroneous datatypes(timestamp,source,doggo,floofer,pupper,puppo)
* incorrect name for the dogs with name 'a' or 'None', some of them have a name and so of them do not have a name.
* contains retweeted tweets (without image)

**Tidniess issues**
* Text column contains multiple variables: text, rate and url


In [10]:
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [11]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [12]:
twitter_archive.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [13]:
for i in range(0,100):
    print('record index: '+ str(i) + '\n'+ twitter_archive.text[i] + '\nstage: ' + twitter_archive.puppo[i])

record index: 0
This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
stage: None
record index: 1
This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV
stage: None
record index: 2
This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB
stage: None
record index: 3
This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
stage: None
record index: 4
This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
stage: None
record index: 5
Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fD

In [14]:
twitter_archive['doggo'].value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

In [15]:
twitter_archive['name'].value_counts()

None        745
a            55
Charlie      12
Lucy         11
Oliver       11
           ... 
Marlee        1
Timofy        1
Kingsley      1
Beemo         1
Vixen         1
Name: name, Length: 957, dtype: int64

In [16]:
pd.set_option('display.max_colwidth', -1)
twitter_archive[twitter_archive.name == 'a'].text

  """Entry point for launching an IPython kernel.


56      Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF
649     Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq                           
801     Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn               
1002    This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys. 8/10 https://t.co/TGenMeXreW               
1004    Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R                 
1017    This is a carrot. We only rate dogs. Please only send in dogs. You all really should know this by now ...11/10 https://t.co/9e48aPrBm2                     
1049    This is 

####  3.1.2 Data Assessing: `img_pred` table

In [17]:
img_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [18]:
img_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [19]:
img_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


####  3.1.3 Data Assessing: `df_api` table
**Quality issues**
* '[]' and starting point of the range for `display_text_range`
* Erroneous datatypes `display_text_range`, `retweet_count` and `favorite_count`


In [20]:
df_api

Unnamed: 0,id,display_text_range,retweet_count,favorite_count
0,666020888022790149,"[0, 131]",466,2434
1,666029285002620928,"[0, 139]",42,121
2,666033412701032449,"[0, 130]",41,113
3,666044226329800704,"[0, 137]",133,274
4,666049248165822465,"[0, 120]",41,99
...,...,...,...,...
2054,891327558926688256,"[0, 138]",8555,38021
2055,891689557279858688,"[0, 79]",7926,39825
2056,891815181378084864,"[0, 121]",3808,23699
2057,892177421306343426,"[0, 138]",5752,31449


In [21]:
df_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2059 entries, 0 to 2058
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  2059 non-null   object
 1   display_text_range  2059 non-null   object
 2   retweet_count       2059 non-null   object
 3   favorite_count      2059 non-null   object
dtypes: object(4)
memory usage: 64.5+ KB


### 3.2 Data Cleaning

In [22]:
twitter_archive_clean = twitter_archive.copy()
img_pred_clean = img_pred.copy()
df_api_clean = df_api.copy()

**Issue:** table `twitter_archive_clean` More than 50% of NAN values in columns related `in_reply_to` and `retweeted_status`.

**Define**: since these infomation is trivial for the later analysis, these columns are dropped.

In [23]:
labels =  ['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp']
twitter_archive_clean = twitter_archive_clean.drop(columns = labels, axis = 1 )

In [24]:
# test
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int64 
 6   rating_denominator  2356 non-null   int64 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
dtypes: int64(3), object(9)
memory usage: 221.0+ KB


**Issue:** table `twitter_archive_clean` redundant information in `source` column

**Define:** remove the html tags using regular expression

In [25]:
twitter_archive_clean = twitter_archive_clean.replace({'source': r'<[^>]*>'}, {'source': ''}, regex=True)

In [26]:
# test
twitter_archive_clean.source.value_counts()

Twitter for iPhone     2221
Vine - Make a Scene    91  
Twitter Web Client     33  
TweetDeck              11  
Name: source, dtype: int64

**Issue:** table `twitter_archive_clean`, `text` column contains multiple variables (text, rate, url)

**Define:** add a new column `Full_text` to store text, extract the text using regular expression

In [27]:
twitter_archive_clean['url'] = twitter_archive_clean.text.str.extract('(http.*)')
twitter_archive_clean['rate'] = twitter_archive_clean.text.str.extract('(\d+\/\d+)')

In [28]:
twitter_archive_clean['Full_text'] = twitter_archive_clean.text.str.replace('(http.*)','').str.replace('(\d+\/\d+)','')

In [29]:
# test I
twitter_archive_clean[['text','url','rate','Full_text']].sample(10)

Unnamed: 0,text,url,rate,Full_text
1188,"This pic is old but I hadn't seen it until today and had to share. Creative af. 13/10 very good boy, would pet well https://t.co/4kD16wMA1Z",https://t.co/4kD16wMA1Z,13/10,"This pic is old but I hadn't seen it until today and had to share. Creative af. very good boy, would pet well"
2327,This is a southern Vesuvius bumblegruff. Can drive a truck (wow). Made friends with 5 other nifty dogs (neat). 7/10 https://t.co/LopTBkKa8h,https://t.co/LopTBkKa8h,7/10,This is a southern Vesuvius bumblegruff. Can drive a truck (wow). Made friends with 5 other nifty dogs (neat).
958,Here's a pupper that's very hungry but too lazy to get up and eat. 12/10 (vid by @RealDavidCortes) https://t.co/lsVAMBq6ex,https://t.co/lsVAMBq6ex,12/10,Here's a pupper that's very hungry but too lazy to get up and eat. (vid by @RealDavidCortes)
1473,What kind of person sends in a pic without a dog in it? So churlish. Neat rug tho 7/10 https://t.co/LSTAwTdTaw,https://t.co/LSTAwTdTaw,7/10,What kind of person sends in a pic without a dog in it? So churlish. Neat rug tho
1613,I would like everyone to appreciate this pup's face as much as I do. 11/10 https://t.co/QIe7oxkSNo,https://t.co/QIe7oxkSNo,11/10,I would like everyone to appreciate this pup's face as much as I do.
619,This is Ruby. She just turned on the news. Officially terrified. 11/10 deep breaths Ruby https://t.co/y5KarNXWXt,https://t.co/y5KarNXWXt,11/10,This is Ruby. She just turned on the news. Officially terrified. deep breaths Ruby
581,RT @dog_rates: This is Sampson. He's about to get hit with a vicious draw 2. Has no idea. 11/10 poor pupper https://t.co/FYT9QBEnKG,https://t.co/FYT9QBEnKG,11/10,RT @dog_rates: This is Sampson. He's about to get hit with a vicious draw 2. Has no idea. poor pupper
879,This is Theo. He can walk on water. Still coming to terms with it. 12/10 magical af https://t.co/8Kmuj6SFbC,https://t.co/8Kmuj6SFbC,12/10,This is Theo. He can walk on water. Still coming to terms with it. magical af
160,RT @tallylott: h*ckin adorable promposal. 13/10 @dog_rates https://t.co/6n8hzNihJ9,https://t.co/6n8hzNihJ9,13/10,RT @tallylott: h*ckin adorable promposal. @dog_rates
1352,"""YOU CAN'T HANDLE THE TRUTH"" both 10/10 https://t.co/ZvxdB4i9AG",https://t.co/ZvxdB4i9AG,10/10,"""YOU CAN'T HANDLE THE TRUTH"" both"


In [30]:
# test II
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int64 
 6   rating_denominator  2356 non-null   int64 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
 12  url                 2286 non-null   object
 13  rate                2356 non-null   object
 14  Full_text           2356 non-null   object
dtypes: int64(3), object(12)
memory usage: 276.2+ KB


**Issue:** table `twitter_archive_clean` has incorrect ratings in `rating_numrating_numerator` and `rating_denominator`
**Define:** replace this two columns by splitting the values in `rates`.

In [31]:
twitter_archive_clean['rating_numerator'], twitter_archive_clean['rating_denominator'] = twitter_archive_clean.rate.str.split(pat = '/').str

  """Entry point for launching an IPython kernel.


In [32]:
twitter_archive_clean['rating_numerator'] = twitter_archive_clean['rating_numerator'].astype(int)
twitter_archive_clean['rating_denominator'] = twitter_archive_clean['rating_denominator'].astype(int)

In [33]:
# test
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int32 
 6   rating_denominator  2356 non-null   int32 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
 12  url                 2286 non-null   object
 13  rate                2356 non-null   object
 14  Full_text           2356 non-null   object
dtypes: int32(2), int64(1), object(12)
memory usage: 257.8+ KB


In [34]:
# test
twitter_archive_clean[['rating_numerator','rating_denominator']].describe()

Unnamed: 0,rating_numerator,rating_denominator
count,2356.0,2356.0
mean,13.126486,10.455433
std,45.876648,6.745237
min,0.0,0.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,1776.0,170.0


**Issue:** table `twitter_archive_clean`, `timestamp` column has the incorrect data type.

**Define:** convert `timestamp` to datetime datatype

In [35]:
twitter_archive_clean.timestamp = pd.to_datetime(twitter_archive_clean.timestamp)

In [36]:
# test
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2356 non-null   int64              
 1   timestamp           2356 non-null   datetime64[ns, UTC]
 2   source              2356 non-null   object             
 3   text                2356 non-null   object             
 4   expanded_urls       2297 non-null   object             
 5   rating_numerator    2356 non-null   int32              
 6   rating_denominator  2356 non-null   int32              
 7   name                2356 non-null   object             
 8   doggo               2356 non-null   object             
 9   floofer             2356 non-null   object             
 10  pupper              2356 non-null   object             
 11  puppo               2356 non-null   object             
 12  url                 2286 non-null 

**Issue:** table `twitter_archive_clean`, `['doggo','floofer','pupper','puppo']` columns have the incorrect values.

**Define:** use contain to find the entries containing the key words in `Full_text` column. The key words are defined by regular expression. Then replace these columns with new values ('True's and 'False's)

In [37]:
pats = {'doggo':'[Dd][Oo][Gg][Gg][Oo]','floofer':'[Ff][Ll][Oo][Oo][Ff][Ee][Rr]',
       'pupper':'[Pp][Uu][Pp][Pp][Ee][Rr]', 'puppo':'[Pp][Uu][Pp][Pp][Oo]'}
for pat in pats.keys():
    pattern = pats[pat]
    twitter_archive_clean[pat] = twitter_archive_clean.Full_text.str.contains(pattern,regex = True)

In [38]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2356 non-null   int64              
 1   timestamp           2356 non-null   datetime64[ns, UTC]
 2   source              2356 non-null   object             
 3   text                2356 non-null   object             
 4   expanded_urls       2297 non-null   object             
 5   rating_numerator    2356 non-null   int32              
 6   rating_denominator  2356 non-null   int32              
 7   name                2356 non-null   object             
 8   doggo               2356 non-null   bool               
 9   floofer             2356 non-null   bool               
 10  pupper              2356 non-null   bool               
 11  puppo               2356 non-null   bool               
 12  url                 2286 non-null 

In [39]:
# test
twitter_archive_clean[twitter_archive_clean.doggo == True].Full_text.sample(10)

746     Here's a doggo questioning his entire existence.  someone tell him he's a good boy                           
1117    This is Kyle (pronounced 'Mitch'). He strives to be the best doggo he can be.  would pat on head approvingly 
1141    Here's a doggo struggling to cope with the winds.                                                            
731     This is Combo. The daily struggles of being a doggo have finally caught up with him.                         
318     Here's a doggo fully pupared for a shower. H*ckin exquisite balance. Sneaky tongue slip too.                 
448     This is Sunny. She was also a very good First Doggo.  would also be an absolute honor to pet                 
807     Doggo will persevere. \n                                                                                     
440     Here we have a doggo who has messed up. He was hoping you wouldn't notice.  someone help him                 
624     Elder doggo does a splash. Both  incredible stuf

In [40]:
twitter_archive_clean[twitter_archive_clean.floofer == True].Full_text.sample(10)

1614    Say hello to Petrick. He's an Altostratus Floofer. Just had a run in with a trash bag. Groovy checkered floor.                          
200     At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two.  only send dogs 
46      Meet Grizzwald. He may be the floofiest floofer I ever did see. Lost eyes saving a schoolbus from a volcano erpuption.  heroic as h*ck  
1022    Here's a golden floofer helping with the groceries. Bed got in way. Still  helpful af (vid by @categoen)                                
984     This is Blu. He's a wild bush Floofer. I wish anything made me as happy as bushes make Blu.  would frolic with                          
1110    This is Moose. He's a Polynesian Floofer. Dapper af.  would pet diligently                                                              
774     Atlas rolled around in some chalk and now he's a magical rainbow floofer.  please never take a bath                       

In [41]:
twitter_archive_clean[twitter_archive_clean.pupper == True].Full_text.sample(10)

1797    This is the happiest pupper I've ever seen.  would trade lives with                                                          
962     Meet Milo. He hauled ass until he ran out of treadmill and then passed out from exhaustion.  sleep tight pupper              
1625    This little fella really hates stairs. Prefers bush.  legendary pupper                                                       
1401    I know this is a tad late but here's a wonderful Valentine's Day pupper                                                      
1928    Herd of wild dogs here. Not sure what they're trying to do. No real goals in life.  find your purpose puppers                
993     This is one of the most reckless puppers I've ever seen. How she got a license in the first place is beyond me.              
1723    This pupper is not coming inside until she catches a snowflake on her tongue.  the determination is palpable                 
1720    Say hello to Kawhi. He was doing fine until his hat fe

In [42]:
twitter_archive_clean[twitter_archive_clean.puppo == True].Full_text.sample(10)

439     This is Oliver. He has dreams of being a service puppo so he can help his owner.  selfless af\n\nmake it happen:\n                      
228     Jerry just apuppologized to me. He said there was no ill-intent to the slippage. I overreacted I admit. Pupgraded to an  would pet      
554     This is Diogi. He fell in the pool as soon as he was brought home. Clumsy puppo.  would pet until dry                                   
85      Meet Venti, a seemingly caffeinated puppoccino. She was just informed the weekend would include walks, pats and scritches.  much excite 
94      This is Sebastian. He can't see all the colors of the rainbow, but he can see that this flag makes his human happy.  #PrideMonth puppo  
1048    This is Kilo. He cannot reach the snackum. Nifty tongue, but not nifty enough.  maybe one day puppo                                     
274     @0_kelvin_0 &gt; is reserved for puppos sorry Kevin                                                                       

**Issue:** table `twitter_archive_clean`, `name` column contains several incorrect names (assigned 'a')

**Define:** use str.contains to find the correct names and replace the incorrected names("a" and "None").

In [43]:
# GET ALL CORRECT NAMES
# select 
nan_name = twitter_archive_clean.query('name == "a" or name == "None"')
# pattern 1: named xxx, pattern 2: name is xxx
corrected_names = pd.DataFrame()
patterns = ['(named\s(\w+))','(name\sis\s(\w+))']
for pat in patterns:
    names = nan_name.Full_text.str.extract(pat)[1].dropna().to_frame()
    corrected_names = corrected_names.append(names)

In [44]:
# reset the indices and rename the column in a more decriptive way.
corrected_names = corrected_names.reset_index()
corrected_names = corrected_names.rename(columns = {'index':'ind',1:'cor_name'})

In [45]:
twitter_archive_clean.name[603]

'None'

In [46]:
# replace the incorrect names with the correct name.
length = len(corrected_names)
for i in range(length):
    twitter_archive_clean.name[corrected_names.ind[i]] = corrected_names.cor_name[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [47]:
# text
a = twitter_archive_clean.query('name == "a" or name == "None" and Full_text.str.contains("name|call")',engine='python')

a[['name','Full_text']].sample(10)

Unnamed: 0,name,Full_text
1193,a,People please. This is a Deadly Mediterranean Plop T-Rex. We only rate dogs. Only send in dogs. Thanks you...
1361,a,This is a Butternut Cumberfloof. It's not windy they just look like that. back at it again with the red socks
1017,a,This is a carrot. We only rate dogs. Please only send in dogs. You all really should know this by now ...
2354,a,This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. would walk the shit out of
2327,a,This is a southern Vesuvius bumblegruff. Can drive a truck (wow). Made friends with 5 other nifty dogs (neat).
1002,a,This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys.
1877,a,C'mon guys. We've been over this. We only rate dogs. This is a cow. Please only submit dogs. Thank you......
600,,"RT @dog_rates: I shall call him squishy and he shall be mine, and he shall be my squishy."
1596,,When bae calls your name from across the room. (vid by @christinemcc98)
1854,a,Seriously guys?! Only send in dogs. I only rate dogs. This is a baby black bear...


**Issue:** table `twitter_archive_clean`, contains retweeted tweets (without image)

**Define:** merge table `img_pred_clean` to `twitter_archive_clean`.


In [48]:
twitter_archive_clean = twitter_archive_clean.merge(img_pred_clean,left_on='tweet_id',right_on='tweet_id')

In [49]:
# test
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2074
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2075 non-null   int64              
 1   timestamp           2075 non-null   datetime64[ns, UTC]
 2   source              2075 non-null   object             
 3   text                2075 non-null   object             
 4   expanded_urls       2075 non-null   object             
 5   rating_numerator    2075 non-null   int32              
 6   rating_denominator  2075 non-null   int32              
 7   name                2075 non-null   object             
 8   doggo               2075 non-null   bool               
 9   floofer             2075 non-null   bool               
 10  pupper              2075 non-null   bool               
 11  puppo               2075 non-null   bool               
 12  url                 2075 non-null 

**Issue:** table `df_api`, column `display_text_range` have '[]' and starting point of the range.

**Define:** extract the length of the text from list objects.

In [50]:
df_api_clean['text_len'] = pd.DataFrame({'text_len':df_api_clean.display_text_range[i][1]} for i in range(len(df_api_clean)))

In [51]:
df_api_clean.head()

Unnamed: 0,id,display_text_range,retweet_count,favorite_count,text_len
0,666020888022790149,"[0, 131]",466,2434,131
1,666029285002620928,"[0, 139]",42,121,139
2,666033412701032449,"[0, 130]",41,113,130
3,666044226329800704,"[0, 137]",133,274,137
4,666049248165822465,"[0, 120]",41,99,120


In [52]:
df_api_clean = df_api_clean.rename(columns={'id':'tweet_id'})

In [53]:
df_api_clean = df_api_clean.drop('display_text_range',axis = 1)

In [54]:
df_api_clean.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count,text_len
0,666020888022790149,466,2434,131
1,666029285002620928,42,121,139
2,666033412701032449,41,113,130
3,666044226329800704,133,274,137
4,666049248165822465,41,99,120


**Issue:** Erroneous datatypes `tweet_id`, `retweet_count` and `favorite_count`.

**Define:** convert strings to int64 using `astype()`


In [55]:
df_api_clean.retweet_count = df_api_clean.retweet_count.astype('int64')
df_api_clean.favorite_count = df_api_clean.favorite_count.astype('int64')
df_api_clean.tweet_id = df_api_clean.tweet_id.astype('int64')

In [56]:
# test
df_api_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2059 entries, 0 to 2058
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2059 non-null   int64
 1   retweet_count   2059 non-null   int64
 2   favorite_count  2059 non-null   int64
 3   text_len        2059 non-null   int64
dtypes: int64(4)
memory usage: 64.5 KB


merge `df_api_clean` table to `twitter_archive_clean` table 

In [57]:
# merge df_api_clean and twitter_archive_clean to twitter_clean
twitter_clean = twitter_archive_clean.copy()

In [58]:
twitter_clean = twitter_clean.merge(df_api_clean,left_on='tweet_id',
    right_on='tweet_id')

In [59]:
twitter_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2059 entries, 0 to 2058
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2059 non-null   int64              
 1   timestamp           2059 non-null   datetime64[ns, UTC]
 2   source              2059 non-null   object             
 3   text                2059 non-null   object             
 4   expanded_urls       2059 non-null   object             
 5   rating_numerator    2059 non-null   int32              
 6   rating_denominator  2059 non-null   int32              
 7   name                2059 non-null   object             
 8   doggo               2059 non-null   bool               
 9   floofer             2059 non-null   bool               
 10  pupper              2059 non-null   bool               
 11  puppo               2059 non-null   bool               
 12  url                 2059 non-null 

**Issue:** table `twitter_archive_clean`, contains retweeted tweets (RT, and identical retweet counts)

**Define:** remove rows that `Full_text` column with pattern: RT @xxxx

In [60]:
RT = (twitter_clean.Full_text.str.contains('RT\s@\w+') == True)
twitter_clean[RT]['Full_text']

32      RT @dog_rates: This is Lilly. She just parallel barked. Kindly requests a reward now.  would pet so well 
67      RT @rachel2195: @dog_rates the boyfriend and his soaking wet pupper h*cking love his new hat             
107     RT @rachaeleasler: these @dog_rates hats are  bean approved                                              
130     RT @tallylott: h*ckin adorable promposal.  @dog_rates                                                    
167     RT @eddie_coe98: Thanks @dog_rates completed my laptop.  would buy again                                 
                                          ...                                                                    
745     RT @dog_rates: This is Rubio. He has too much skin.                                                      
762     RT @dog_rates: Everyone needs to watch this.                                                             
1022    RT @twitter: @dog_rates Awesome Tweet! . Would Retweet. #LoveTwitter            

In [61]:
RT_ind = twitter_clean[RT].index

In [62]:
twitter_clean = twitter_clean.drop(RT_ind,axis=0).reset_index()

In [63]:
# test
sum(twitter_clean.Full_text.str.contains('RT\s@\w+') == True)

0

### 3.3 Export clean data 

In [64]:
# Drop columns that we are not interested in.
drop_columns = ['text','expanded_urls','url','rate','jpg_url']
twitter_clean.drop(drop_columns,axis=1,inplace=True)

In [65]:
# test
twitter_clean.head()

Unnamed: 0,index,tweet_id,timestamp,source,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favorite_count,text_len
0,0,892420643555336193,2017-08-01 16:23:56+00:00,Twitter for iPhone,13,10,Phineas,False,False,False,...,False,bagel,0.085851,False,banana,0.07611,False,7759,36489,85
1,1,892177421306343426,2017-08-01 00:17:27+00:00,Twitter for iPhone,13,10,Tilly,False,False,False,...,True,Pekinese,0.090647,True,papillon,0.068957,True,5752,31449,138
2,2,891815181378084864,2017-07-31 00:18:03+00:00,Twitter for iPhone,12,10,Archie,False,False,False,...,True,malamute,0.078253,True,kelpie,0.031379,True,3808,23699,121
3,3,891689557279858688,2017-07-30 15:58:51+00:00,Twitter for iPhone,13,10,Darla,False,False,False,...,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,7926,39825,79
4,4,891327558926688256,2017-07-29 16:00:24+00:00,Twitter for iPhone,12,10,Franklin,False,False,False,...,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,8555,38021,138


In [66]:
# create a csv file for clean data.
twitter_clean.to_csv('twitter_archive_master.csv', sep=',', encoding='utf-8',index=False);

In [67]:
# create a sqlite database for clean data.
engine = create_engine('sqlite:///twitter_archive_master.db', echo=False)
twitter_clean.to_sql('twitter_archive_master', con=engine,if_exists='replace')

## 4 Exploratory Data Analysis
In this section, we use the clean data to answer the following questions:
* Which tweet has the most retweets and/or the most likes?
* Which breed of dogs has the most tweets based on the predictions, what about the trend of breeds?
* What is the trend of average content length?
* Any relationship beween retweet count and favorite count?
* What are the key metric for retweets count?

In [68]:
df = pd.read_csv('twitter_archive_master.csv')

In [69]:
df.head()

Unnamed: 0,index,tweet_id,timestamp,source,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favorite_count,text_len
0,0,892420643555336193,2017-08-01 16:23:56+00:00,Twitter for iPhone,13,10,Phineas,False,False,False,...,False,bagel,0.085851,False,banana,0.07611,False,7759,36489,85
1,1,892177421306343426,2017-08-01 00:17:27+00:00,Twitter for iPhone,13,10,Tilly,False,False,False,...,True,Pekinese,0.090647,True,papillon,0.068957,True,5752,31449,138
2,2,891815181378084864,2017-07-31 00:18:03+00:00,Twitter for iPhone,12,10,Archie,False,False,False,...,True,malamute,0.078253,True,kelpie,0.031379,True,3808,23699,121
3,3,891689557279858688,2017-07-30 15:58:51+00:00,Twitter for iPhone,13,10,Darla,False,False,False,...,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,7926,39825,79
4,4,891327558926688256,2017-07-29 16:00:24+00:00,Twitter for iPhone,12,10,Franklin,False,False,False,...,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,8555,38021,138


In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1987 entries, 0 to 1986
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   index               1987 non-null   int64  
 1   tweet_id            1987 non-null   int64  
 2   timestamp           1987 non-null   object 
 3   source              1987 non-null   object 
 4   rating_numerator    1987 non-null   int64  
 5   rating_denominator  1987 non-null   int64  
 6   name                1987 non-null   object 
 7   doggo               1987 non-null   bool   
 8   floofer             1987 non-null   bool   
 9   pupper              1987 non-null   bool   
 10  puppo               1987 non-null   bool   
 11  Full_text           1987 non-null   object 
 12  img_num             1987 non-null   int64  
 13  p1                  1987 non-null   object 
 14  p1_conf             1987 non-null   float64
 15  p1_dog              1987 non-null   bool   
 16  p2    

In [71]:
df_clean = df.copy()

### 4.1 Data clean

Based on the questions mentioned above, table `df` are modified by the following steps:
* **Drop columns related with prediction 2 and prediction 3**: Since prediction 1 has the highest confidence, prediction 1 is used as the final results.
* **Create new columns `Month`,`DayofWeek`,`Time`**：month, day of week and time are extracted and stored individually to study different trends over time.

#### 4.1.1 Drop columns related with prediction 2 and prediction 3

In [72]:
drop_cols= ['p2','p2_conf','p2_dog','p3','p3_conf','p3_dog']
df_clean.drop(drop_cols,axis=1,inplace = True)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1987 entries, 0 to 1986
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   index               1987 non-null   int64  
 1   tweet_id            1987 non-null   int64  
 2   timestamp           1987 non-null   object 
 3   source              1987 non-null   object 
 4   rating_numerator    1987 non-null   int64  
 5   rating_denominator  1987 non-null   int64  
 6   name                1987 non-null   object 
 7   doggo               1987 non-null   bool   
 8   floofer             1987 non-null   bool   
 9   pupper              1987 non-null   bool   
 10  puppo               1987 non-null   bool   
 11  Full_text           1987 non-null   object 
 12  img_num             1987 non-null   int64  
 13  p1                  1987 non-null   object 
 14  p1_conf             1987 non-null   float64
 15  p1_dog              1987 non-null   bool   
 16  retwee

#### 4.1.2 create new columns `Month`, `DayofWeek`, `Time`

In [73]:
# create a year column:
df_clean.timestamp = pd.to_datetime(df_clean.timestamp)

In [74]:
df_clean.timestamp.min(), df_clean.timestamp.max()

(Timestamp('2015-11-15 22:32:08+0000', tz='UTC'),
 Timestamp('2017-08-01 16:23:56+0000', tz='UTC'))

In [75]:
df_clean['Month'] = df_clean.timestamp.dt.month

In [76]:
df_clean['DayofWeek'] = df_clean.timestamp.dt.dayofweek

In [77]:
df_clean['PartsOfDay'] = df_clean.timestamp.dt.hour

In [78]:
df_clean[['Month','DayofWeek','PartsOfDay']].describe()

Unnamed: 0,Month,DayofWeek,PartsOfDay
count,1987.0,1987.0,1987.0
mean,7.162557,2.842476,9.719175
std,4.123624,2.002226,8.618912
min,1.0,0.0,0.0
25%,3.0,1.0,1.0
50%,7.0,3.0,4.0
75%,11.0,5.0,18.0
max,12.0,6.0,23.0


### 4.2 Data analysis and Visualization 

#### 4.2.1  Which tweet has the most retweets and/or the most likes?

In [85]:
def top_10(df,column):
    top_10 = df.sort_values(by=[column],ascending=False)[[column,
                                           'Full_text','p1']].head(10)
    return top_10

In [86]:
top_10(df_clean,'retweet_count')

Unnamed: 0,retweet_count,Full_text,p1
769,78486,Here's a doggo realizing you can stand in a pool. enlightened af (vid by Tina Conrad),Labrador_retriever
804,58202,Here's a doggo blowing bubbles. It's downright legendary. would watch on repeat forever (vid by Kent Duryee),Eskimo_dog
396,57246,This is Stephan. He just wants to help. such a good boy,Chihuahua
305,44250,Here's a super supportive puppo participating in the Toronto #WomensMarch today.,Lakeland_terrier
58,40903,This is Duddles. He did an attempt. someone help him (vid by Georgia Felici),English_springer
329,37744,This is Bo. He was a very good First Doggo. would be an absolute honor to pet,standard_poodle
398,35797,"""Good afternoon class today we're going to learn what makes a good boy so good""",Arabian_camel
107,33221,"This is Jamesy. He gives a kiss to every other pupper he sees on his walk. such passion, much tender",French_bulldog
1417,31633,This made my day. please enjoy,swing
1475,30582,This is Kenneth. He's stuck in a bubble. hang in there Kenneth,bubble


In [87]:
top_10(df_clean,'favorite_count')

Unnamed: 0,favorite_count,Full_text,p1
769,157726,Here's a doggo realizing you can stand in a pool. enlightened af (vid by Tina Conrad),Labrador_retriever
305,134376,Here's a super supportive puppo participating in the Toronto #WomensMarch today.,Lakeland_terrier
396,121747,This is Stephan. He just wants to help. such a good boy,Chihuahua
107,117330,"This is Jamesy. He gives a kiss to every other pupper he sees on his walk. such passion, much tender",French_bulldog
804,116437,Here's a doggo blowing bubbles. It's downright legendary. would watch on repeat forever (vid by Kent Duryee),Eskimo_dog
58,100055,This is Duddles. He did an attempt. someone help him (vid by Georgia Felici),English_springer
329,88869,This is Bo. He was a very good First Doggo. would be an absolute honor to pet,standard_poodle
134,87243,We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you...,Angora
92,79103,This is Zoey. She really likes the planet. Would hate to see willful ignorance and the denial of fairly elemental science destroy it.,golden_retriever
1417,79020,This made my day. please enjoy,swing
