# Data Wrangling for WeRateDogs Twitter archive

## Table of Contents

<ul>
<li><a href=\"#intro\">1 Introduction</a></li>
<li><a href=\"#wrangling\">2 Data Wrangling</a></li>
<li><a href=\"#eda\">3 Exploratory Data Analysis</a></li>
<li><a href=\"#conclusions\">4 Conclusion and limitations</a></li>
<li><a href=\"#Appendix\">5 Appendix</a></li>
</ul>



<a id='intro'></a>
## 1 Introduction
> This sub project is for data wrangling process of the 'Wrangling and Analyze Data' project. As the rule of thumb, this project has three components, gathering data, assessing data and cleaning data. Moreover, in the vary end of this notebook, I will store the cleaning data in .csv files for analysis and visualization later on.

In [558]:
import numpy as np
import pandas as pd
import requests
import io
import tweepy
from tweepy import OAuthHandler
import json
import timeit
import config # info of twitter API secrets and keys
import re
import datetime

## 2 Gathering Data
There are three data resources:
* Manually download: `twitter_archive_enhanced.csv`
* Derive from Udacity's servers: `image_predictions.tsv`
* Derive by Tweepy: `tweet_json.txt`

>`twitter_archive_enhanced.csv`: This file is downloaded manually and stores under the same path of this notebook for accessibility.

>`image_predictions.tsv`: This file is obtained using requests library in [section 2.1](need a html link here)

>`tweet_json.txt`: This file is obtained using requests library in [section 2.2](need a html link here)

### 2.1 Read `twitter_archive_enhanced.csv` from Udacity's servers 

In [559]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

### 2.1 Extract `image_predictions.tsv` from Udacity's servers 

In [560]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
urlData = requests.get(url).content
img_pred = pd.read_csv(io.StringIO(urlData.decode('utf-8')),sep='\t')

In [561]:
img_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### 2.2 Extract data using twitter API

In [562]:
consumer_key = config.consumer_key
consumer_secret = config.consumer_secret
access_token = config.access_token
access_secret = config.access_secret

In [563]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [564]:
# start = timeit.timeit() # set up a timer
# fails_dict={} # collect deleted ids
# count = 0 # get the processing status
# with open('tweet_json.txt', 'w') as outfile:
#     for twt_id in img_pred['tweet_id']:
#         try:
#             tweet = api.get_status(twt_id,tweet_mode='extended',wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
#             print('{} record success'.format(count),end="\r")
#             json.dump(tweet._json, outfile)
#             outfile.write('\n')
#         except tweepy.TweepError as e:
#             print('Fail',end="\r")
#             fails_dict[twt_id] = e
#             pass
#         count += 1
# end = timeit.timeit()

In [565]:
df_api = pd.DataFrame(columns=['id','display_text_range','retweet_count','favorite_count'])
with open('tweet_json.txt') as json_file:
    for line in json_file:
        data_str = json.loads(line)
        data_parse = pd.DataFrame.from_dict(data_str,orient="index")
        data_interested = data_parse[0][['id','display_text_range','retweet_count','favorite_count']]
        df_api = df_api.append(data_interested,ignore_index=True)

In [566]:
df_api.head()

Unnamed: 0,id,display_text_range,retweet_count,favorite_count
0,666020888022790149,"[0, 131]",466,2434
1,666029285002620928,"[0, 139]",42,121
2,666033412701032449,"[0, 130]",41,113
3,666044226329800704,"[0, 137]",133,274
4,666049248165822465,"[0, 120]",41,99


## 3 Data Wrangling
In the previous section, three tables are generated from different sources. In this sections, each table is carefully assessed and cleaned. The three dataframes are listed below:

* `twitter_archive`: retrieved from a .csv file
* `img_pred`: retrieved from Udacity server
* `df_api`: retrieved from twitter

### 3.1 Data Assesssing

#### 3.1 Data Assessing: `twitter_archive` table
**Quality issues**
* more than 50% of NAN values in columns related `in_reply_to` and `retweeted_status`
* redundant information in `source` column.
* Nones in columns `['doggo','floofer','pupper','puppo']`
* incorrect ratings
* incorrect ['doggo','floofer','pupper','puppo']
* Erroneous datatypes(timestamp,source,doggo,floofer,pupper,puppo)
* incorrect name for the dogs with name 'a' or 'None', some of them have a name and so of them do not have a name.
* contains retweeted tweets. (contains RT)
**Tidniess issues**
* Text column contains multiple variables: text, rate and url


In [567]:
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq,,,,https://twitter.com/dog_rates/status/666049248165822465/photo/1,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx,,,,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR,,,,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI,,,,https://twitter.com/dog_rates/status/666029285002620928/photo/1,7,10,a,,,,


In [568]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [569]:
twitter_archive.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                        91  
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                     33  
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>    11  
Name: source, dtype: int64

In [570]:
for i in range(0,100):
    print('record index: '+ str(i) + '\n'+ twitter_archive.text[i] + '\nstage: ' + twitter_archive.puppo[i])

record index: 0
This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
stage: None
record index: 1
This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV
stage: None
record index: 2
This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB
stage: None
record index: 3
This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
stage: None
record index: 4
This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
stage: None
record index: 5
Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fD

In [571]:
twitter_archive['doggo'].value_counts()

None     2259
doggo    97  
Name: doggo, dtype: int64

In [572]:
twitter_archive['name'].value_counts()

None         745
a            55 
Charlie      12 
Cooper       11 
Oliver       11 
             .. 
Kendall      1  
Kathmandu    1  
Flurpson     1  
Divine       1  
light        1  
Name: name, Length: 957, dtype: int64

In [573]:
pd.set_option('display.max_colwidth', -1)
twitter_archive[twitter_archive.name == 'a'].text

  """Entry point for launching an IPython kernel.


56      Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF
649     Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq                           
801     Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn               
1002    This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys. 8/10 https://t.co/TGenMeXreW               
1004    Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R                 
1017    This is a carrot. We only rate dogs. Please only send in dogs. You all really should know this by now ...11/10 https://t.co/9e48aPrBm2                     
1049    This is 

####  3.2 Data Assessing: `img_pred` table

In [574]:
img_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [575]:
img_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [576]:
img_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


####  3.3 Data Assessing: `df_api` table
**Quality issues**
* '[]' and starting point of the range for `display_text_range`
* Erroneous datatypes `display_text_range`, `retweet_count` and `favorite_count`


In [577]:
df_api

Unnamed: 0,id,display_text_range,retweet_count,favorite_count
0,666020888022790149,"[0, 131]",466,2434
1,666029285002620928,"[0, 139]",42,121
2,666033412701032449,"[0, 130]",41,113
3,666044226329800704,"[0, 137]",133,274
4,666049248165822465,"[0, 120]",41,99
...,...,...,...,...
2054,891327558926688256,"[0, 138]",8555,38021
2055,891689557279858688,"[0, 79]",7926,39825
2056,891815181378084864,"[0, 121]",3808,23699
2057,892177421306343426,"[0, 138]",5752,31449


In [578]:
df_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2059 entries, 0 to 2058
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  2059 non-null   object
 1   display_text_range  2059 non-null   object
 2   retweet_count       2059 non-null   object
 3   favorite_count      2059 non-null   object
dtypes: object(4)
memory usage: 64.5+ KB


### 4 Data Cleaning

In [579]:
twitter_archive_clean = twitter_archive.copy()
img_pred_clean = img_pred.copy()
df_api_clean = df_api.copy()

**Issue:** table `twitter_archive_clean` More than 50% of NAN values in columns related `in_reply_to` and `retweeted_status`.

**Define**: since these infomation is trivial for the later analysis, these columns are dropped.

In [580]:
labels =  ['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp']
twitter_archive_clean = twitter_archive_clean.drop(columns = labels, axis = 1 )

In [581]:
# test
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int64 
 6   rating_denominator  2356 non-null   int64 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
dtypes: int64(3), object(9)
memory usage: 221.0+ KB


**Issue:** table `twitter_archive_clean` redundant information in `source` column

**Define:** remove the html tags using regular expression

In [582]:
twitter_archive_clean = twitter_archive_clean.replace({'source': r'<[^>]*>'}, {'source': ''}, regex=True)

In [583]:
# test
twitter_archive_clean.source.value_counts()

Twitter for iPhone     2221
Vine - Make a Scene    91  
Twitter Web Client     33  
TweetDeck              11  
Name: source, dtype: int64

**Issue:** table `twitter_archive_clean`, `text` column contains multiple variables (text, rate, url)

**Define:** add a new column `Full_text` to store text, extract the text using regular expression

In [584]:
twitter_archive_clean['url'] = twitter_archive_clean.text.str.extract('(http.*)')
twitter_archive_clean['rate'] = twitter_archive_clean.text.str.extract('(\d+\/\d+)')

In [585]:
twitter_archive_clean['Full_text'] = twitter_archive_clean.text.str.replace('(http.*)','').str.replace('(\d+\/\d+)','')

In [586]:
# test I
twitter_archive_clean[['text','url','rate','Full_text']].sample(10)

Unnamed: 0,text,url,rate,Full_text
2068,Me running from commitment. 10/10 https://t.co/ycVJyFFkES,https://t.co/ycVJyFFkES,10/10,Me running from commitment.
120,Meet Stanley. He likes road trips. Will shift for you. One ear more effective than other. 13/10 we don't leave until you buckle pup Stanley https://t.co/vmCu3PFCQq,https://t.co/vmCu3PFCQq,13/10,Meet Stanley. He likes road trips. Will shift for you. One ear more effective than other. we don't leave until you buckle pup Stanley
1420,This is Franklin. He's a yoga master. Trying to get rid of those rolls. Dedicated af. 11/10 keep it up pup https://t.co/S712MJXulD,https://t.co/S712MJXulD,11/10,This is Franklin. He's a yoga master. Trying to get rid of those rolls. Dedicated af. keep it up pup
967,13/10 such a good doggo\n@spaghemily,,13/10,such a good doggo\n@spaghemily
1952,This is Shnuggles. I would kill for Shnuggles. 13/10 https://t.co/GwvpQiQ7oQ,https://t.co/GwvpQiQ7oQ,13/10,This is Shnuggles. I would kill for Shnuggles.
607,This is Cooper. His bow tie was too heavy for the front so he moved it to the side. Balanced af now. 13/10 https://t.co/jG1PAFkB81,https://t.co/jG1PAFkB81,13/10,This is Cooper. His bow tie was too heavy for the front so he moved it to the side. Balanced af now.
442,This is Jazzy. She just found out that sandwich wasn't for her. Shocked and puppalled. 13/10 deep breaths Jazzy https://t.co/52cItP0vIO,https://t.co/52cItP0vIO,13/10,This is Jazzy. She just found out that sandwich wasn't for her. Shocked and puppalled. deep breaths Jazzy
1422,This is Lily. She accidentally dropped all her Kohl's cash overboard. Day officially ruined. 10/10 hang in there pup https://t.co/BJbtCqGwZK,https://t.co/BJbtCqGwZK,10/10,This is Lily. She accidentally dropped all her Kohl's cash overboard. Day officially ruined. hang in there pup
1386,This is Vincent. He's the man your girl is with when she's not with you. 10/10 https://t.co/JQGMP7kzjD,https://t.co/JQGMP7kzjD,10/10,This is Vincent. He's the man your girl is with when she's not with you.
172,I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge https://t.co/cUeDMlHJbq,https://t.co/cUeDMlHJbq,13/10,I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All would put it on the fridge


In [587]:
# test II
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int64 
 6   rating_denominator  2356 non-null   int64 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
 12  url                 2286 non-null   object
 13  rate                2356 non-null   object
 14  Full_text           2356 non-null   object
dtypes: int64(3), object(12)
memory usage: 276.2+ KB


**Issue:** table `twitter_archive_clean` has incorrect ratings in `rating_numrating_numerator` and `rating_denominator`
**Define:** replace this two columns by splitting the values in `rates`.

In [588]:
twitter_archive_clean['rating_numerator'], twitter_archive_clean['rating_denominator'] = twitter_archive_clean.rate.str.split(pat = '/').str

  """Entry point for launching an IPython kernel.


In [589]:
twitter_archive_clean['rating_numerator'] = twitter_archive_clean['rating_numerator'].astype(int)
twitter_archive_clean['rating_denominator'] = twitter_archive_clean['rating_denominator'].astype(int)

In [590]:
# test
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int32 
 6   rating_denominator  2356 non-null   int32 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
 12  url                 2286 non-null   object
 13  rate                2356 non-null   object
 14  Full_text           2356 non-null   object
dtypes: int32(2), int64(1), object(12)
memory usage: 257.8+ KB


In [591]:
# test
twitter_archive_clean[['rating_numerator','rating_denominator']].describe()

Unnamed: 0,rating_numerator,rating_denominator
count,2356.0,2356.0
mean,13.126486,10.455433
std,45.876648,6.745237
min,0.0,0.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,1776.0,170.0


**Issue:** table `twitter_archive_clean`, `timestamp` column has the incorrect data type.

**Define:** convert `timestamp` to datetime datatype

In [592]:
twitter_archive_clean.timestamp = pd.to_datetime(twitter_archive_clean.timestamp)

In [593]:
# test
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2356 non-null   int64              
 1   timestamp           2356 non-null   datetime64[ns, UTC]
 2   source              2356 non-null   object             
 3   text                2356 non-null   object             
 4   expanded_urls       2297 non-null   object             
 5   rating_numerator    2356 non-null   int32              
 6   rating_denominator  2356 non-null   int32              
 7   name                2356 non-null   object             
 8   doggo               2356 non-null   object             
 9   floofer             2356 non-null   object             
 10  pupper              2356 non-null   object             
 11  puppo               2356 non-null   object             
 12  url                 2286 non-null 

**Issue:** table `twitter_archive_clean`, `['doggo','floofer','pupper','puppo']` columns have the incorrect values.

**Define:** use contain to find the entries containing the key words in `Full_text` column. The key words are defined by regular expression. Then replace these columns with new values ('True's and 'False's)

In [594]:
pats = {'doggo':'[Dd][Oo][Gg][Gg][Oo]','floofer':'[Ff][Ll][Oo][Oo][Ff][Ee][Rr]',
       'pupper':'[Pp][Uu][Pp][Pp][Ee][Rr]', 'puppo':'[Pp][Uu][Pp][Pp][Oo]'}
for pat in pats.keys():
    pattern = pats[pat]
    twitter_archive_clean[pat] = twitter_archive_clean.Full_text.str.contains(pattern,regex = True)

In [595]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2356 non-null   int64              
 1   timestamp           2356 non-null   datetime64[ns, UTC]
 2   source              2356 non-null   object             
 3   text                2356 non-null   object             
 4   expanded_urls       2297 non-null   object             
 5   rating_numerator    2356 non-null   int32              
 6   rating_denominator  2356 non-null   int32              
 7   name                2356 non-null   object             
 8   doggo               2356 non-null   bool               
 9   floofer             2356 non-null   bool               
 10  pupper              2356 non-null   bool               
 11  puppo               2356 non-null   bool               
 12  url                 2286 non-null 

In [596]:
# test
twitter_archive_clean[twitter_archive_clean.doggo == True].Full_text.sample(10)

489     This is Chubbs. He dug a hole and now he's stuck in it. Dang h*ckin doggo.  would assist                                     
1117    This is Kyle (pronounced 'Mitch'). He strives to be the best doggo he can be.  would pat on head approvingly                 
1051    For anyone who's wondering, this is what happens after a doggo catches it's tail...                                          
822     RT @dog_rates: This is just downright precious af.  for both pupper and doggo                                                
211     RT @dog_rates: This is Astrid. She's a guide doggo in training.  would follow anywhere                                       
363     This is Astrid. She's a guide doggo in training.  would follow anywhere                                                      
300     This is Meera. She just heard about taxes and how much a doghouse in a nice area costs. Not pupared to be a  doggo anymore.  
624     Elder doggo does a splash. Both  incredible stuff     

In [597]:
twitter_archive_clean[twitter_archive_clean.floofer == True].Full_text.sample(10)

1534    Here we are witnessing a rare High Stepping Alaskan Floofer.  dangerously petable (vid by @TheMrsNux)                                   
774     Atlas rolled around in some chalk and now he's a magical rainbow floofer.  please never take a bath                                     
582     This is Doc. He takes time out of every day to worship our plant overlords.  quite the floofer                                          
1091    Just wanted to share this super rare Rainbow Floofer in case you guys haven't seen it yet.  colorful af                                 
1110    This is Moose. He's a Polynesian Floofer. Dapper af.  would pet diligently                                                              
984     This is Blu. He's a wild bush Floofer. I wish anything made me as happy as bushes make Blu.  would frolic with                          
1614    Say hello to Petrick. He's an Altostratus Floofer. Just had a run in with a trash bag. Groovy checkered floor.            

In [598]:
twitter_archive_clean[twitter_archive_clean.pupper == True].Full_text.sample(10)

1660    Here we see a nifty leaping pupper. Feet look deadly. Sad that the holidays are over.  undeniably huggable                    
594     RT @dog_rates: Meet Baloo. He's expecting a fast ground ball, hence the wide stance. Prepared af.  nothing runs like a pupper 
1113    Like father (doggo), like son (pupper). Both                                                                                  
1476    This pupper is afraid of its own feet.  would comfort                                                                         
1063    This is just downright precious af.  for both pupper and doggo                                                                
453     RT @dog_rates: This is Chelsea. She forgot how to dog.  get it together pupper                                                
772     This is Huck. He's addicted to caffeine. Hope it's not too latte to seek help.  stay strong pupper                            
1657    Meet Brandy. She's a member of the Bloods. Mena

In [599]:
twitter_archive_clean[twitter_archive_clean.puppo == True].Full_text.sample(10)

413     Here's a super supportive puppo participating in the Toronto  #WomensMarch today.                                                   
546     RT @dog_rates: This is Reginald. He's one magical puppo. Aerodynamic af.  would catch                                               
438     RT @dog_rates: This is Oliver. He has dreams of being a service puppo so he can help his owner.  selfless af\n\nmake it happen:\n   
1048    This is Kilo. He cannot reach the snackum. Nifty tongue, but not nifty enough.  maybe one day puppo                                 
168     Sorry for the lack of posts today. I came home from school and had to spend quality time with my puppo. Her name is Zoey and she's  
1035    This is Abby. She got her face stuck in a glass. Churlish af.  rookie move puppo                                                    
736     I want to finally rate this iconic puppo who thinks the parade is all for him.  would absolutely attend                             
567     This 

**Issue:** table `twitter_archive_clean`, `name` column contains several incorrect names (assigned 'a')

**Define:** 

In [656]:
# GET ALL CORRECT NAMES
# select 
nan_name = twitter_archive_clean.query('name == "a" or name == "None"')
# pattern 1: named xxx, pattern 2: name is xxx
corrected_names = pd.DataFrame()
patterns = ['(named\s(\w+))','(name\sis\s(\w+))']
for pat in patterns:
    names = nan_name.Full_text.str.extract(pat)[1].dropna().to_frame()
    corrected_names = corrected_names.append(names)

In [658]:
corrected_names = corrected_names.reset_index()
test = corrected_names.copy()

In [659]:
# for ind in range(len(test)):
#         real_name = corrected_names['']
#         twitter_archive_clean.name[ind] = real_name

In [498]:
a = twitter_archive_clean.query('name == "a" or name == "None" and Full_text.str.contains("name|call")',engine='python')
a[a.Full_text.str.contains('call')].Full_text

7       When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy.  
523     I call this one "A Blep by the Sea"                                                                               
600     RT @dog_rates: I shall call him squishy and he shall be mine, and he shall be my squishy.                         
1596    When bae calls your name from across the room.  (vid by @christinemcc98)                                          
1858    I shall call him squishy and he shall be mine, and he shall be my squishy.                                        
2305    My goodness. Very rare dog here. Large. Tail dangerous. Kinda fat. Only eats leaves. Doesn't come when called     
Name: Full_text, dtype: object