# WeRateDogs Twitter Archive Analysis


## Introduction

This project uses [Twitter](https://twitter.com/) API and #WeRateDogs Twitter Archive and focuses on gathering, cleaning data collected and draw insights from it using Data Analysis.


## Table of Contents

1. <a href='#gather'>Data Gathering</a>
2. <a href='#assess'>Assessment</a>
3. <a href='#clean'>Data Cleaning</a>
4. <a href='#analysis'>Data Analysis</a>

<a id='gather'></a>
## Data Gathering

In [1]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tweepy
import json
import requests
import os
from tweepy import OAuthHandler
from timeit import default_timer as timer

%matplotlib inline

In [2]:
# Read In WeRateDogs Twitter archive as we_rd
we_rd = pd.read_csv('twitter-archive-enhanced.csv')

Download tweet image predictions which was generated using a neural network

In [3]:
# First, create a folder to store
folder_name = 'image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [4]:
# Send a request to the necessary URL
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [5]:
# Save the requests response to a .tsv file
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)

In [6]:
# Read in the image-predictions.tsv into a dataframe
predictions = pd.read_csv('image_predictions/image-predictions.tsv', sep='	')

    Note for the instructor: I wanted to do the next step on my own, so I sent my application to Twitter, but as of now, I still have not heard from them. That's why I had to use the ready-made tweet-json.txt. 

In [7]:
# Read the tweet-json.txt file line by line and append the contents to an empty
# list
selected_attr = []
with open('tweet-json.txt', 'r') as json_file:
    for line in json_file:
        json_data = json.loads(line)
        selected_attr.append({
            'tweet_id': json_data['id'],
            'favorites': json_data['favorite_count'],
            'retweets': json_data['retweet_count'],
        })

In [8]:
# Create a dataframe from the list containing tweets data
tweets_selected = pd.DataFrame(selected_attr,
                               columns=['tweet_id', 'favorites', 'retweets'])

<a id='assess'></a>
## Assessing

### Assessing the WeRateDogs archive

In [9]:
we_rd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [10]:
we_rd.tweet_id[:3]

0    892420643555336193
1    892177421306343426
2    891815181378084864
Name: tweet_id, dtype: int64

In [11]:
we_rd.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [12]:
we_rd[we_rd.in_reply_to_status_id.notnull()].head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,


In [13]:
we_rd.timestamp

0       2017-08-01 16:23:56 +0000
1       2017-08-01 00:17:27 +0000
2       2017-07-31 00:18:03 +0000
3       2017-07-30 15:58:51 +0000
4       2017-07-29 16:00:24 +0000
                  ...            
2351    2015-11-16 00:24:50 +0000
2352    2015-11-16 00:04:52 +0000
2353    2015-11-15 23:21:54 +0000
2354    2015-11-15 23:05:30 +0000
2355    2015-11-15 22:32:08 +0000
Name: timestamp, Length: 2356, dtype: object

In [14]:
we_rd[['doggo', 'puppo', 'pupper', 'floofer']].head()

Unnamed: 0,doggo,puppo,pupper,floofer
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


In [15]:
predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [16]:
predictions.duplicated().sum()

0

In [17]:
predictions.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
61,667152164079423490,https://pbs.twimg.com/media/CUIzWk_UwAAfUNq.jpg,1,toy_poodle,0.535411,True,Pomeranian,0.087544,True,miniature_poodle,0.06205,True
1882,847157206088847362,https://pbs.twimg.com/media/C8G0_CMWsAAjjAY.jpg,2,Staffordshire_bullterrier,0.219609,True,American_Staffordshire_terrier,0.178671,True,pug,0.123271,True
895,699446877801091073,https://pbs.twimg.com/media/CbTvNpoW0AEemnx.jpg,3,Pembroke,0.9694,True,Cardigan,0.026059,True,Chihuahua,0.003505,True
1449,776201521193218049,https://pbs.twimg.com/media/CsWfKadWEAAtmlS.jpg,1,Rottweiler,0.502228,True,black-and-tan_coonhound,0.154594,True,bloodhound,0.135176,True
1703,817181837579653120,https://pbs.twimg.com/ext_tw_video_thumb/81596...,1,Tibetan_mastiff,0.506312,True,Tibetan_terrier,0.29569,True,otterhound,0.036251,True


In [18]:
tweets_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   tweet_id   2354 non-null   int64
 1   favorites  2354 non-null   int64
 2   retweets   2354 non-null   int64
dtypes: int64(3)
memory usage: 55.3 KB


In [19]:
tweets_selected.retweets.notnull().sum()

2354

In [20]:
tweets_selected.describe()

Unnamed: 0,tweet_id,favorites,retweets
count,2354.0,2354.0,2354.0
mean,7.426978e+17,8080.968564,3164.797366
std,6.852812e+16,11814.771334,5284.770364
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,1415.0,624.5
50%,7.194596e+17,3603.5,1473.5
75%,7.993058e+17,10122.25,3652.0
max,8.924206e+17,132810.0,79515.0


In [21]:
tweets_selected[tweets_selected.retweets == 0]

Unnamed: 0,tweet_id,favorites,retweets
290,838085839343206401,150,0


### Issues


#### Quality
**WeRateDogs Archive**
1. 'doggo', 'puppo', 'pupper', 'floofer' columns have strings instead of NaN values.
2. 0s in numerator and denominator column
3. Extremely large values for numerator and denominator columns


**Image Predictions**
1. Inconsistent names for p1, p2, p3

#### Tidiness
**WeRateDogs Archive**
1. Too many columns for a single variable > dog stages
2. Text column contains more than one variable
3. Some tweets are retweets (if retweeted_status_id and in_reply_to_user are not null, that would mean it is a retweet)
4. Ratings are given in two columns.
5. in_reply_to_status_id, in_reply_to_user_id, timestamp, expanded urls, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, source columns are not needed

**Image Predictions**
1. Might contain retweet information

**Tweets from API**
1. Might contain retweet information

<a id='clean'></a>
## Data Cleaning

### Cleaning for quality

**WeRateDogs Archive** 

#### Define
- 'doggo', 'puppo', 'pupper', 'floofer' columns have strings instead of NaN values.

#### Code

In [22]:
# Apply an anonymous function to 4 columns above to conver "None" strings to NaN
we_rd['doggo'] = we_rd['doggo'].apply(lambda x: np.nan if x == 'None' else x)
we_rd['puppo'] = we_rd['puppo'].apply(lambda x: np.nan if x == 'None' else x)
we_rd['pupper'] = we_rd['pupper'].apply(lambda x: np.nan if x == 'None' else x)
we_rd['floofer'] = we_rd['floofer'].apply(lambda x: np.nan if x == 'None' else x)

#### Test

In [23]:
# This should create True values for null columns
# Earlier observations of this column did not have null values because of the strings
we_rd[['doggo', 'puppo', 'pupper', 'floofer']].isnull()

Unnamed: 0,doggo,puppo,pupper,floofer
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
...,...,...,...,...
2351,True,True,True,True
2352,True,True,True,True
2353,True,True,True,True
2354,True,True,True,True


#### Define
- 0s in numerator and denominator column

#### Code

In [24]:
# Get the ratings columns with 0 values
nulls = we_rd[(we_rd.rating_numerator == 0) | (we_rd.rating_denominator == 0)].copy()
nulls

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,
315,835152434251116546,,,2017-02-24 15:40:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you're so blinded by your systematic plag...,,,,https://twitter.com/dog_rates/status/835152434...,0,10,,,,,
1016,746906459439529985,7.468859e+17,4196984000.0,2016-06-26 03:22:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...","PUPDATE: can't see any. Even if I could, I cou...",,,,https://twitter.com/dog_rates/status/746906459...,0,10,,,,,


AS there are only 3 records, I will drop those.

In [25]:
we_rd.drop(nulls.index, axis=0, inplace=True)

#### Test

In [26]:
we_rd[(we_rd.rating_numerator == 0) | (we_rd.rating_denominator == 0)]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


#### Define
- Extremely large values for numerator and denominator columns

#### Code

I have checked the original Twitter channel for WeRateDogs and found out that their highest record ever was 15/10. So, I will set all the values higher than 15 and 10 to 15/10

In [27]:
we_rd['rating_numerator'] = we_rd['rating_numerator'].apply(
                                        lambda x: 15 if x > 15 else x)
we_rd['rating_denominator'] = we_rd['rating_denominator'].apply(
                                        lambda x: 10 if x > 10 else x)

#### Test

In [28]:
we_rd.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2353.0,76.0,76.0,181.0,181.0,2353.0,2353.0
mean,7.426913e+17,7.44309e+17,2.067175e+16,7.7204e+17,1.241698e+16,10.746281,9.995325
std,6.85577e+16,7.611756e+16,1.268953e+17,6.236928e+16,9.599254e+16,2.204426,0.176112
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,1.0,2.0
25%,6.783968e+17,6.756548e+17,342194300.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.193678e+17,7.031489e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.992971e+17,8.241444e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,15.0,10.0


**Image Predictions**

#### Define
- Inconsistent names for p1, p2, p3
 
#### Code

In [29]:
# Convert all the values in p1, p2, p3 columns to lowercase
predictions['p1'] = predictions['p1'].str.lower()
predictions['p2'] = predictions['p2'].str.lower()
predictions['p3'] = predictions['p3'].str.lower()

#### Test

In [30]:
predictions.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
79,667453023279554560,https://pbs.twimg.com/media/CUNE_OSUwAAdHhX.jpg,1,labrador_retriever,0.82567,True,french_bulldog,0.056639,True,staffordshire_bullterrier,0.054018,True
1890,848690551926992896,https://pbs.twimg.com/media/C8cnjHuXsAAoZQf.jpg,1,flat-coated_retriever,0.823648,True,newfoundland,0.100571,True,groenendael,0.03831,True
938,703631701117943808,https://pbs.twimg.com/media/CcPNS4yW8AAd-Et.jpg,2,window_shade,0.909533,False,window_screen,0.011427,False,brass,0.008882,False
809,692158366030913536,https://pbs.twimg.com/media/CZsKVxfWQAAXy2u.jpg,1,pug,0.956565,True,swing,0.018907,False,toy_poodle,0.013544,True
644,681579835668455424,https://pbs.twimg.com/media/CXV1Ot_W8AEpkQO.jpg,1,rottweiler,0.760671,True,labrador_retriever,0.096585,True,staffordshire_bullterrier,0.040333,True
290,671166507850801152,https://pbs.twimg.com/media/CVB2TnWUYAA2pAU.jpg,1,refrigerator,0.829772,False,toilet_seat,0.030083,False,shower_curtain,0.015461,False
523,676588346097852417,https://pbs.twimg.com/media/CWO5gmCUYAAX4WA.jpg,1,boston_bull,0.976577,True,french_bulldog,0.014324,True,chihuahua,0.002302,True
828,693590843962331137,https://pbs.twimg.com/media/CaAhMb1XEAAB6Bz.jpg,1,dining_table,0.383448,False,grey_fox,0.103191,False,siamese_cat,0.098256,False
1963,867051520902168576,https://pbs.twimg.com/media/DAhiwb0XcAA8x5Q.jpg,1,samoyed,0.471403,True,pekinese,0.302219,True,pomeranian,0.156606,True
1431,773247561583001600,https://pbs.twimg.com/media/Crsgi9dWEAApQd8.jpg,1,seat_belt,0.713588,False,miniature_pinscher,0.083369,True,brabancon_griffon,0.075696,True


### Cleaning for Tidiness

**WeRateDogs Archive**

#### Define
- Too many columns for a single variable > dog stages

#### Code

Generally, to gather values that belong to a single variable from differet columns, I would use `pandas.melt()` funtion. However, in this case, as there are rows with multiple values for the dog_stage, I will do it in a custom way.

In [70]:
# First add all of the 4 columns as string because they also contain NaN values
we_rd['dog_stage'] = we_rd['doggo'].apply(str) + we_rd['floofer'].apply(str) \
                    + we_rd['pupper'].apply(str) + we_rd['puppo'].apply(str)

In [71]:
# Strip the nan strings from all values
we_rd['dog_stage'] = we_rd['dog_stage'].str.strip('nan')

In [72]:
# Write a custom function
def convert_dog_stages(value):
    """
    Purpose: the function checks if the value being passed is either NaN or
    has multiple dog stages or just one. Returns values accoding to those groups
    """
    dog_stages = ['doggo', 'floofer', 'puppo', 'pupper']
    
    if value == '':
        return np.nan
    elif value not in dog_stages:
        return 'multiple'
    else:
        for name in dog_stages:
            if value == name:
                return name

In [73]:
# Apply the above function to the dog stage column
we_rd['dog_stage'] = we_rd['dog_stage'].apply(convert_dog_stages)

#### Test

In [74]:
we_rd.dog_stage.notnull().sum()

380

In [75]:
we_rd.dog_stage.value_counts()

pupper      245
doggo        83
puppo        29
multiple     14
floofer       9
Name: dog_stage, dtype: int64

Well, even though we created rows with "multiple" values, it does not make sense for a dog to be in multiple dog stages, so I will drop those. Since we do not need the other 4 columns, I will drop them too

In [76]:
we_rd = we_rd[we_rd.dog_stage != 'multiple'].copy()
we_rd.drop(['puppo', 'pupper', 'doggo', 'floofer'], axis=1, inplace=True)

In [77]:
# Test
we_rd.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_stage
373,828376505180889089,,,2017-02-05 22:55:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...","This is Beebop. Her name means ""Good Dog"" in r...",,,,https://twitter.com/dog_rates/status/828376505...,13,10,Beebop,
325,833863086058651648,,,2017-02-21 02:17:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bentley. Hairbrushes are his favorite ...,,,,https://twitter.com/dog_rates/status/833863086...,12,10,Bentley,
650,792883833364439040,,,2016-10-31 00:20:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bailey. She's rather h*ckin hype for H...,,,,https://twitter.com/dog_rates/status/792883833...,12,10,Bailey,
279,839990271299457024,,,2017-03-10 00:04:21 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sojourner. His nose is a Fibonacci Spiral...,,,,https://twitter.com/dog_rates/status/839990271...,13,10,Sojourner,
824,769940425801170949,,,2016-08-28 16:51:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Klein. These pics were taken a month a...,,,,https://twitter.com/dog_rates/status/769940425...,12,10,Klein,


#### Define
- Some tweets are retweets (if retweeted_status_id and in_reply_to_user are not null, that would mean it is a retweet)

#### Code

In [80]:
# Filter the dataframe for notnull retweet status_ids
retweeted = we_rd[we_rd.retweeted_status_id.notnull()].copy()
# Drop the rows with the indexes in retweeted
we_rd.drop(retweeted.index, axis=0, inplace=True)

In [81]:
# Now filter the dataframe for notnull in_reply tweets
in_reply = we_rd[we_rd.in_reply_to_status_id.notnull()].copy()
in_reply

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_stage
30,886267009285017600,8.862664e+17,2.281182e+09,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,15,10,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2038,671550332464455680,6.715449e+17,4.196984e+09,2015-12-01 04:44:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",After 22 minutes of careful deliberation this ...,,,,,1,10,,
2149,669684865554620416,6.693544e+17,4.196984e+09,2015-11-26 01:11:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",After countless hours of research and hundreds...,,,,,11,10,,
2169,669353438988365824,6.678065e+17,4.196984e+09,2015-11-25 03:14:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tessa. She is also very pleased after ...,,,,https://twitter.com/dog_rates/status/669353438...,10,10,Tessa,
2189,668967877119254528,6.689207e+17,2.143566e+07,2015-11-24 01:42:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",12/10 good shit Bubka\n@wane15,,,,,12,10,,


## Links

1. To read the json file and save its content to a DataFrame, I have used this [answer](https://knowledge.udacity.com/questions/68700#68752) on Knowledge because the article from Stack Abuse was not helpful