## WeRateDogs Data Wrangling project


## Table of Contents

- [Introduction](#intro)
- [Data Wrangling](#wrangling)
    - [The First Dataset: twitter-archive-enhanced](#first)
    - [The Second Dataset: Image Predictions File](#second)
    - [The Third  Dataset: Data via the Twitter API](#third)
- [Conclusion](#Conclusion)



<a id='intro'></a>
## Introduction
The tweet archive of WeRateDogs on Twitter is analyzed here. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog and it has over 4 million followers and has received international media coverage. Addithinal data such as retweet count and favorite count are obtained from Twitter's API. Predictions of breeds of dogs for each tweets are also provided.

<a id='wrangling'></a>
## Data Wrangling

In [76]:
#import the required libraries
import numpy as np
import pandas as pd
import requests
import os
import tweepy
import json

In [134]:

consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
access_token = 'YOUR ACCESS TOKEN'
access_secret = 'YOUR ACCESS SECRET'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

<a id='first'></a>
# The First Dataset (twitter-archive-enhanced)

<a id='wrangling'></a>
## Gathering

In [77]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [78]:
# expand column width to max
pd.set_option('display.max_colwidth', -1)

In [79]:
# View sapmle random 15 rows of twitter-archive DataFrame
twitter_archive.sample(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1140,727685679342333952,,,2016-05-04 02:26:00 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Cilantro. She's a Fellation Gadzooks. Eyes are super magical af. 12/10 could get lost in https://t.co/yJ26LNuyj5,,,,https://twitter.com/dog_rates/status/727685679342333952/photo/1,12,10,Cilantro,,,,
731,781655249211752448,,,2016-09-30 00:41:48 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",This is Combo. The daily struggles of being a doggo have finally caught up with him. 11/10 https://t.co/LOKrNo0OM7,,,,https://vine.co/v/5rt6T3qm7hL,11,10,Combo,doggo,,,


## Assessing

### twitter_archive columns:

- **tweet_id**: the unique identifier for each tweet                   
- **in_reply_to_status_id**         
- **in_reply_to_user_id**:          
- **timestamp**: time of tweet                   
- **source**: Utility used to post the Tweet                      
- **text**: tweet's text                         
- **retweeted_status_id**:  retweet ID         
- **retweeted_status_user_id**: retweet ID user ID     
- **retweeted_status_timestamp**:   time of retweet    
- **expanded_urls**:  tweet urls               
- **rating_numerator**:  actual rating of a dog. Almost always is greater than 10. 11/10, 12/10, 13/10, etc, because ["they're good dogs Brent"](https://knowyourmeme.com/memes/theyre-good-dogs-brent)           
- **rating_denominator**:  These ratings always have a denominator of 10.           
- **name**: The name of the dog                        
- **doggo**, **floofer**, **pupper** & **puppo**:  dog stages                      


In [80]:
twitter_archive.shape

(2356, 17)

In [81]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [82]:
# Check if there are any doplicated tweet_ids
len(twitter_archive.tweet_id.unique())

2356

In [83]:
# Check if there are any doplicated Dogs' names
len(twitter_archive.name.unique())

957

In [84]:
twitter_archive.groupby("name").size().sort_values(ascending=False)
 

name
None         745
a            55 
Charlie      12 
Oliver       11 
Lucy         11 
Cooper       11 
Lola         10 
Tucker       10 
Penny        10 
Bo           9  
Winston      9  
Sadie        8  
the          8  
an           7  
Toby         7  
Daisy        7  
Bailey       7  
Buddy        7  
Leo          6  
Scout        6  
Bella        6  
Dave         6  
Rusty        6  
Jack         6  
Jax          6  
Milo         6  
Koda         6  
Stanley      6  
Oscar        6  
very         5  
            ..  
Jiminus      1  
Jimbo        1  
Jim          1  
Jett         1  
Jessiga      1  
Jessifer     1  
Spencer      1  
Jersey       1  
Josep        1  
Juckson      1  
Kellogg      1  
Julio        1  
Sonny        1  
Keet         1  
Kayla        1  
Kawhi        1  
Katie        1  
Kathmandu    1  
Karma        1  
Karll        1  
Karl         1  
Kara         1  
Kanu         1  
Kane         1  
Kallie       1  
Kaiya        1  
Kaia         1  
Sora     

In [85]:
twitter_archive.name.unique()

array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
       'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver',
       'Jim', 'Zeke', 'Ralphus', 'Canela', 'Gerald', 'Jeffrey', 'such',
       'Maya', 'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey',
       'Lilly', 'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella',
       'Grizzwald', 'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey',
       'Gary', 'a', 'Elliot', 'Louis', 'Jesse', 'Romeo', 'Bailey',
       'Duddles', 'Jack', 'Emmy', 'Steven', 'Beau', 'Snoopy', 'Shadow',
       'Terrance', 'Aja', 'Penny', 'Dante', 'Nelly', 'Ginger', 'Benedict',
       'Venti', 'Goose', 'Nugget', 'Cash', 'Coco', 'Jed', 'Sebastian',
       'Walter', 'Sierra', 'Monkey', 'Harry', 'Kody', 'Lassie', 'Rover',
       'Napolean', 'Dawn', 'Boomer', 'Cody', 'Rumble', 'Clifford',
       'quite', 'Dewey', 'Scout', 'Gizmo', 'Cooper', 'Harold', 'Shikha',
       'Jamesy', 'Lili', 'Sammy', 'Meatball', 'Paisley', 'Albus',
       'Nept

In [86]:
twitter_archive.nunique()    

tweet_id                      2356
in_reply_to_status_id         77  
in_reply_to_user_id           31  
timestamp                     2356
source                        4   
text                          2356
retweeted_status_id           181 
retweeted_status_user_id      25  
retweeted_status_timestamp    181 
expanded_urls                 2218
rating_numerator              40  
rating_denominator            18  
name                          957 
doggo                         2   
floofer                       2   
pupper                        2   
puppo                         2   
dtype: int64

In [87]:
# View descriptive statistics of twitter-archive DataFrame
twitter_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [88]:
twitter_archive['rating_numerator'].isnull().sum() 

0

In [89]:
# Total number of records with zero rating_numerator
(twitter_archive['rating_numerator']== 0).sum()

2

In [90]:
#(twitter_archive['rating_numerator']== 0).sum()
twitter_archive['rating_numerator'].max()
# That will affect outlier

1776

In [91]:
(twitter_archive['rating_numerator']== 1776).sum()

1

In [92]:
# Show the information for the rating_numerator == 1776
twitter_archive.query("rating_numerator == 1776 ")

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
979,749981277374128128,,,2016-07-04 15:00:45 +0000,"<a href=""https://about.twitter.com/products/tweetdeck"" rel=""nofollow"">TweetDeck</a>",This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh,,,,https://twitter.com/dog_rates/status/749981277374128128/photo/1,1776,10,Atticus,,,,


In [93]:
twitter_archive.rating_numerator.unique()

array([  13,   12,   14,    5,   17,   11,   10,  420,  666,    6,   15,
        182,  960,    0,   75,    7,   84,    9,   24,    8,    1,   27,
          3,    4,  165, 1776,  204,   50,   99,   80,   45,   60,   44,
        143,  121,   20,   26,    2,  144,   88], dtype=int64)

In [94]:
twitter_archive.rating_denominator.unique()

array([ 10,   0,  15,  70,   7,  11, 150, 170,  20,  50,  90,  80,  40,
       130, 110,  16, 120,   2], dtype=int64)

In [95]:
twitter_archive.doggo.unique()

array(['None', 'doggo'], dtype=object)

In [96]:
twitter_archive.floofer.unique()

array(['None', 'floofer'], dtype=object)

In [97]:
twitter_archive.pupper.unique()

array(['None', 'pupper'], dtype=object)

In [98]:
twitter_archive.puppo.unique()

array(['None', 'puppo'], dtype=object)

### Quality

- Erroneous Datatype: tweet_id, timestamp
- source column contains <> tag
- Denominator have differnt values, not only 10
- One record have so big nominator's value (1776), "This is Atticus. He's quite simply America af. 1776/10". This rating goes for dog stage. There are also some other big values.
- Some expanded_urls contain more than one URL and some have missing value
- Invalid names
- rate checks
- There are retweets data
- Not needed columns:  in_reply_to_status_id  , in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp


### Tidiness
- **Dog Stages** (i.e doggo, floofer, pupper & puppo) should be one column

# The Second Dataset (Image Predictions File)

## Gathering

In [99]:
# Download Image Predictions File from Udacity's servers 
prediction = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

In [100]:
# check the request 
prediction.status_code

200

In [101]:
# save Predictions File
with open("image_predictions.tsv",mode="wb") as file:
    file.write(prediction.content)

In [102]:
# open the tsv file as a data frame
prediction=pd.read_csv("image_predictions.tsv",sep="\t")

In [103]:
prediction.sample(15)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
72,667211855547486208,https://pbs.twimg.com/media/CUJppKJWoAA75NP.jpg,1,golden_retriever,0.462556,True,Labrador_retriever,0.454937,True,kuvasz,0.024193,True
1722,819711362133872643,https://pbs.twimg.com/media/C2AzHjQWQAApuhf.jpg,2,acorn_squash,0.848704,False,toilet_seat,0.044348,False,toy_poodle,0.022009,True
1262,748977405889503236,https://pbs.twimg.com/media/CmTm-XQXEAAEyN6.jpg,1,German_short-haired_pointer,0.742216,True,bluetick,0.15281,True,English_setter,0.051835,True
257,670778058496974848,https://pbs.twimg.com/media/CU8VFhuVAAAQW8B.jpg,1,pug,0.776612,True,Brabancon_griffon,0.112032,True,boxer,0.039051,True
351,672523490734551040,https://pbs.twimg.com/media/CVVIjGbWwAAxkN0.jpg,1,golden_retriever,0.565981,True,chow,0.081212,True,Irish_terrier,0.061596,True
1797,831552930092285952,https://pbs.twimg.com/media/C4pE-I0WQAABveu.jpg,1,Chihuahua,0.257415,True,Pembroke,0.161442,True,French_bulldog,0.092143,True
1127,727644517743104000,https://pbs.twimg.com/media/Chkc1BQUoAAa96R.jpg,2,Great_Pyrenees,0.457164,True,kuvasz,0.39171,True,Labrador_retriever,0.094523,True
204,669753178989142016,https://pbs.twimg.com/media/CUtw9SAVEAAtFUN.jpg,1,Pembroke,0.858494,True,hamster,0.026319,False,Shetland_sheepdog,0.022405,True
1551,793135492858580992,https://pbs.twimg.com/media/CwHIg61WIAApnEV.jpg,1,bakery,0.737041,False,saltshaker,0.052396,False,teddy,0.046593,False
717,685663452032069632,https://pbs.twimg.com/ext_tw_video_thumb/685663358637486080/pu/img/3cXSHFZAgJQ_dDCf.jpg,1,Chesapeake_Bay_retriever,0.171174,True,tennis_ball,0.090644,False,racket,0.048508,False


## Assessing

## column names - prediction

- tweet_id: ID for each tweet
- jpg_url: image url
- img_num: number of image in a tweet
- p1,p2,p3: prediction for the image in the tweet (no.1 prediction, no.2 prediction, no.3 prediction)
- p1_conf,p2_conf,p3_conf: how confident the prediction is for
- p1_dog,p2_dog,p3_dog:  whether or not the prediction is a breed of dog, i.e each prediction 
 each prediction 

In [104]:
prediction.shape

(2075, 12)

In [105]:
prediction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [106]:
prediction.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [107]:
prediction[prediction['p1_dog']== False]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,4.588540e-02,False,terrapin,1.788530e-02,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,1.459380e-02,False,golden_retriever,7.958960e-03,True
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,1,hen,0.965932,False,cock,3.391940e-02,False,partridge,5.206580e-05,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,1,desktop_computer,0.086502,False,desk,8.554740e-02,False,bookcase,7.947970e-02,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,1,three-toed_sloth,0.914671,False,otter,1.525000e-02,False,great_grey_owl,1.320720e-02,False
22,666337882303524864,https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg,1,ox,0.416669,False,Newfoundland,2.784070e-01,True,groenendael,1.026430e-01,True
25,666362758909284353,https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg,1,guinea_pig,0.996496,False,skunk,2.402450e-03,False,hamster,4.608630e-04,False
29,666411507551481857,https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg,1,coho,0.404640,False,barracouta,2.714850e-01,False,gar,1.899450e-01,False
33,666430724426358785,https://pbs.twimg.com/media/CT-jNYqW4AAPi2M.jpg,1,llama,0.505184,False,Irish_terrier,1.041090e-01,True,dingo,6.207120e-02,False
43,666776908487630848,https://pbs.twimg.com/media/CUDeDoWUYAAD-EM.jpg,1,seat_belt,0.375057,False,miniature_pinscher,1.671750e-01,True,Chihuahua,8.695060e-02,True


> some pictures are not predicted as a dog at all as:
> - https://pbs.twimg.com/media/DDMD_phXoAQ1qf0.jpg	(a giraffe)
> - https://pbs.twimg.com/ext_tw_video_thumb/729838572744912896/pu/img/RIl-XYmRxW-YLFSV.jpg (A man)
> - https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg (Chicken)
> - https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg	 (Rabbit)
> - Some Dogs

In [57]:
len(prediction[prediction['p1_dog']== False])

543

In [108]:
len(prediction[prediction['p2_dog']== False])

522

In [109]:
len(prediction[prediction['p3_dog']== False])

576

In [113]:
len(prediction[(prediction['p1_dog']== False) & (prediction['p2_dog']== False) & (prediction['p3_dog']== False)])

324

In [129]:
#Another way
prediction.query("p1_dog == False and p2_dog == False and p3_dog == False").count()[0]

324

In [131]:
len(prediction[prediction['p1_dog'] | prediction['p2_dog'] | prediction['p3_dog']])

1751

In [56]:
prediction.groupby("p1").size().sort_values(ascending=False)

p1
golden_retriever             150
Labrador_retriever           100
Pembroke                     89 
Chihuahua                    83 
pug                          57 
chow                         44 
Samoyed                      43 
toy_poodle                   39 
Pomeranian                   38 
malamute                     30 
cocker_spaniel               30 
French_bulldog               26 
miniature_pinscher           23 
Chesapeake_Bay_retriever     23 
seat_belt                    22 
Siberian_husky               20 
Staffordshire_bullterrier    20 
German_shepherd              20 
Cardigan                     19 
web_site                     19 
Shetland_sheepdog            18 
Eskimo_dog                   18 
beagle                       18 
Maltese_dog                  18 
teddy                        18 
Lakeland_terrier             17 
Rottweiler                   17 
Shih-Tzu                     17 
kuvasz                       16 
Italian_greyhound            16 
       

### Quality

- Erroneous Datatype: tweet_id
- Some predictions have no dog image
- The dataset should only contain p, p_conf

### Tidiness
- The dataset should be merged with twitter-archive datatset

# The Third  Dataset (Data via the Twitter API)

## Gathering

In [39]:
# As mentioned in Project Details:
# Each tweet's JSON data should be written to its own line. 
# Then read this .txt file line by line into a pandas DataFrame with (at minimum):
# tweet ID, retweet count, and favorite count

df_list = []
with open('tweet_json.txt', encoding ='utf-8') as file:
    for line in file:
        json_file = json.loads(line)
        tweet_id = json_file["id"]
        retweet_count = json_file["retweet_count"]
        favorite_count = json_file["favorite_count"]

        df_list.append({"tweet_id":tweet_id,
                        "retweet_count":retweet_count,
                       "favorite_count":favorite_count})
        
df_tweet_json = pd.DataFrame(df_list,columns=["tweet_id","retweet_count","favorite_count"])


## Assessing

In [40]:
df_tweet_json.sample(10)

Unnamed: 0,tweet_id,retweet_count,favorite_count
1204,715758151270801409,1596,4085
1775,677918531514703872,463,1476
914,756998049151549440,2271,6923
1651,683462770029932544,761,2676
842,766693177336135680,918,4484
1592,686377065986265092,637,2433
903,758099635764359168,11550,21302
2155,669583744538451968,1017,1587
2026,671866342182637568,548,1191
1109,733482008106668032,1065,3438


In [41]:
# check info
df_tweet_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
tweet_id          2354 non-null int64
retweet_count     2354 non-null int64
favorite_count    2354 non-null int64
dtypes: int64(3)
memory usage: 55.2 KB


In [42]:
df_tweet_json.shape

(2354, 3)

### Quality
- tweet_id data type from int to string

### Tidiness
- 3 data sets need to be merged

## Cleaning

In [43]:
twitter_archive_clean = twitter_archive.copy()
prediction_clean = prediction.copy()
df_tweet_json_clean = df_tweet_json.copy()

### Quality

#### Define
- Change tweet_id Datatype from Integar to String in three datasets
- Change timestamp Datatype from Integar to date format from twitter_archive dataset

#### Code

In [44]:
# Change the data type of timestamp date from Integar to date format
twitter_archive_clean['timestamp'] = pd.to_datetime(twitter_archive_clean['timestamp']) 

In [45]:
# Change tweet_id Datatype from Intg to String
twitter_archive_clean['tweet_id'] = twitter_archive_clean.tweet_id.astype(str)
prediction_clean['tweet_id'] = prediction_clean.tweet_id.astype(str)
df_tweet_json_clean['tweet_id'] = df_tweet_json_clean .tweet_id.astype(str)

#### Test

In [42]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2356 non-null object
timestamp             2356 non-null datetime64[ns, UTC]
source                2356 non-null object
text                  2356 non-null object
expanded_urls         2356 non-null object
rating_numerator      2356 non-null int64
rating_denominator    2356 non-null int64
name                  2356 non-null object
doggo                 2356 non-null object
floofer               2356 non-null object
pupper                2356 non-null object
puppo                 2356 non-null object
dtypes: datetime64[ns, UTC](1), int64(2), object(9)
memory usage: 221.0+ KB


In [49]:
prediction_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null object
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


In [50]:
df_tweet_json_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
tweet_id          2354 non-null object
retweet_count     2354 non-null int64
favorite_count    2354 non-null int64
dtypes: int64(2), object(1)
memory usage: 55.2+ KB


#### Define
- Solve the NaN values in expanded_urls column from twitter_archive dataset

#### Code

In [118]:
# set url for concatenating
url_main ="https://twitter.com/dog_rates/status/"

In [119]:
# if expanded urls col is na then add the defined url + tweet id for url
twitter_archive_clean.loc[twitter_archive_clean.expanded_urls.isna(),"expanded_urls"]=url_main + twitter_archive_clean["tweet_id"].map(str)

#### Test

In [138]:
twitter_archive_clean['expanded_urls'].isnull().sum() 

0

#### Define
- Drop Not needed columns: in_reply_to_status_id  , in_reply_to_user_id , retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp from twitter_archive dataset

#### Code

In [48]:
# Drop not needed columns: in_reply_to_status_id  , in_reply_to_user_id and retweeted
twitter_archive_clean.drop(['in_reply_to_status_id'  , 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1, inplace=True)  

#### Test

In [133]:
twitter_archive_clean.sample(2)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1436,697255105972801536,2016-02-10 03:05:46+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Charlie. He likes to kiss all the big milk dogs with the rad earrings. Passionate af. 10/10 just a great guy https://t.co/Oe0XSGmfoP,https://twitter.com/dog_rates/status/697255105972801536/photo/1,10,10,Charlie,,,,
579,800513324630806528,2016-11-21 01:37:04+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Chef. Chef loves everyone and wants everyone to love each other. 11/10 https://t.co/ILHGs0e6Dm,https://twitter.com/dog_rates/status/800513324630806528/photo/1,11,10,Chef,,,,


#### Define
- Delete the records with Invalid names from twitter_archive dataset

#### Code

#### Test

#### Define
- Rating_numerator and Rating_denominator

#### Code

#### Test

#### Define
- Some predictions have no dog image

#### Code

#### Test

#### Define
- The dataset should only contain p, p_conf

#### Code

#### Test

### Tidiness

#### Define
Compain Dog stages columns (i.e doggo, floofer, pupper & puppo) so that it has only one column for a dog stage

#### Code

In [44]:
# replace None with nan
twitter_archive_clean.replace("None",np.nan,inplace=True)

In [45]:
# replace nan with an empty space to concatnate strings
twitter_archive_clean[["doggo","floofer","pupper","puppo"]]=twitter_archive_clean[["doggo","floofer","pupper","puppo"]].fillna("")

In [46]:
# check the result
twitter_archive_clean.sample()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
688,787810552592695296,2016-10-17 00:20:47+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Frank. He wears sunglasses and walks himself. 11/10 I'll never be this cool or independent https://t.co/pNNjBtHWPc,"https://twitter.com/dog_rates/status/787810552592695296/photo/1,https://twitter.com/dog_rates/status/787810552592695296/photo/1",11,10,Frank,,,,


In [47]:
# concatnate dog stage columns to create a new column "stage"
twitter_archive_clean["stage"]=(twitter_archive_clean["doggo"] + twitter_archive_clean["floofer"] + twitter_archive_clean["pupper"] + twitter_archive_clean["puppo"])

In [48]:
# check unique stages
twitter_archive_clean.stage.unique()

array(['', 'doggo', 'puppo', 'pupper', 'floofer', 'doggopuppo',
       'doggofloofer', 'doggopupper'], dtype=object)

**There are tweets with multiple dog stages. It needs to be solved.**

In [49]:
# Handle multiple stages
twitter_archive_clean.loc[twitter_archive_clean.stage == 'doggopupper', 'stage'] = 'doggo,pupper'
twitter_archive_clean.loc[twitter_archive_clean.stage == 'doggopuppo', 'stage'] = 'doggo,puppo'
twitter_archive_clean.loc[twitter_archive_clean.stage == 'doggofloofer', 'stage'] = 'doggo,floofer'

# Handle missing values through change empty stages to na
twitter_archive_clean.loc[twitter_archive_clean.stage == '', 'stage'] = np.nan

In [50]:
# check the result
twitter_archive_clean[twitter_archive_clean["stage"].notna()].stage.unique()

array(['doggo', 'puppo', 'pupper', 'floofer', 'doggo,puppo',
       'doggo,floofer', 'doggo,pupper'], dtype=object)

In [51]:
# drop "doggo","floofer","pupper","puppo" columns
twitter_archive_clean.drop(columns=["doggo","floofer","pupper","puppo"],axis=1,inplace=True)

In [52]:
# reset index
twitter_archive_clean.reset_index(inplace=True,drop=True)

## Test

In [53]:
# check number of observations
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 9 columns):
tweet_id              2356 non-null object
timestamp             2356 non-null datetime64[ns, UTC]
source                2356 non-null object
text                  2356 non-null object
expanded_urls         2356 non-null object
rating_numerator      2356 non-null int64
rating_denominator    2356 non-null int64
name                  1611 non-null object
stage                 380 non-null object
dtypes: datetime64[ns, UTC](1), int64(2), object(6)
memory usage: 165.7+ KB


In [54]:
# check data structure
twitter_archive_clean.sample(2)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,stage
1320,706346369204748288,2016-03-06 05:11:12+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Koda. She's a Beneboom Cumberwiggle. 12/10 petable as hell https://t.co/VZV6oMJmU6,"https://twitter.com/dog_rates/status/706346369204748288/photo/1,https://twitter.com/dog_rates/status/706346369204748288/photo/1",12,10,Koda,
1781,677698403548192770,2015-12-18 03:54:25+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Sadie. She got her holidays confused. 9/10 damn it Sadie https://t.co/fm7HxOsuPK,https://twitter.com/dog_rates/status/677698403548192770/photo/1,9,10,Sadie,


In [55]:
# make pick up checks with the original data
twitter_archive_clean[twitter_archive_clean.tweet_id ==881536004380872706].stage

Series([], Name: stage, dtype: object)

In [56]:
twitter_archive[twitter_archive.tweet_id ==881536004380872706].pupper

56    pupper
Name: pupper, dtype: object

#### Define
Merge the three data sets

#### Code

#### Test