# WeRateDogs Twitter Archive Analysis


## Introduction

This project uses [Twitter](https://twitter.com/) API and #WeRateDogs Twitter Archive and focuses on gathering, cleaning data collected and draw insights from it using Data Analysis.


## Table of Contents

1. <a href='#gather'>Data Gathering</a>
2. <a href='#assess'>Assessment</a>
3. <a href='#clean'>Data Cleaning</a>
4. <a href='#analysis'>Data Analysis</a>

<a id='gather'></a>
## Data Gathering

In [1]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tweepy
import json
import requests
import os
from tweepy import OAuthHandler
from timeit import default_timer as timer

%matplotlib inline

In [2]:
# Read In WeRateDogs Twitter archive as we_rd
we_rd = pd.read_csv('twitter-archive-enhanced.csv')

Download tweet image predictions which was generated using a neural network

In [3]:
# First, create a folder to store
folder_name = 'image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [4]:
# Send a request to the necessary URL
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [5]:
# Save the requests response to a .tsv file
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)

In [6]:
# Read in the image-predictions.tsv into a dataframe
predictions = pd.read_csv('image_predictions/image-predictions.tsv', sep='	')

    Note for the instructor: I wanted to do the next step on my own, so I sent my application to Twitter, but as of now, I still have not heard from them. That's why I had to use the ready-made tweet-json.txt. 

In [7]:
# Read the tweet-json.txt file line by line and append the contents to an empty
# list
selected_attr = []
with open('tweet-json.txt', 'r') as json_file:
    for line in json_file:
        json_data = json.loads(line)
        selected_attr.append({
            'tweet_id': json_data['id'],
            'favorites': json_data['favorite_count'],
            'retweets': json_data['retweet_count'],
        })

In [8]:
# Create a dataframe from the list containing tweets data
tweets_selected = pd.DataFrame(selected_attr,
                               columns=['tweet_id', 'favorites', 'retweets'])

<a id='assess'></a>
## Assessing

### Assessing the WeRateDogs archive

In [19]:
we_rd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [32]:
we_rd.tweet_id[:3]

0    892420643555336193
1    892177421306343426
2    891815181378084864
Name: tweet_id, dtype: int64

In [51]:
we_rd.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [35]:
we_rd[we_rd.in_reply_to_status_id.notnull()].head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,


In [36]:
we_rd.timestamp

0       2017-08-01 16:23:56 +0000
1       2017-08-01 00:17:27 +0000
2       2017-07-31 00:18:03 +0000
3       2017-07-30 15:58:51 +0000
4       2017-07-29 16:00:24 +0000
                  ...            
2351    2015-11-16 00:24:50 +0000
2352    2015-11-16 00:04:52 +0000
2353    2015-11-15 23:21:54 +0000
2354    2015-11-15 23:05:30 +0000
2355    2015-11-15 22:32:08 +0000
Name: timestamp, Length: 2356, dtype: object

In [39]:
we_rd[['doggo', 'puppo', 'pupper', 'floofer']].head()

Unnamed: 0,doggo,puppo,pupper,floofer
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


In [10]:
predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [41]:
predictions.duplicated().sum()

0

In [42]:
predictions.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
722,686003207160610816,https://pbs.twimg.com/media/CYUsRsbWAAAUt4Y.jpg,1,damselfly,0.190786,False,common_newt,0.098131,False,whiptail,0.088958,False
1052,714141408463036416,https://pbs.twimg.com/media/Cekj0qwXEAAHcS6.jpg,1,Labrador_retriever,0.586951,True,golden_retriever,0.378812,True,redbone,0.003605,True
1475,780476555013349377,https://pbs.twimg.com/tweet_video_thumb/CtTFZZ...,1,pug,0.919255,True,French_bulldog,0.03235,True,bull_mastiff,0.028468,True
926,702598099714314240,https://pbs.twimg.com/media/CcAhPevW8AAoknv.jpg,1,kelpie,0.219179,True,badger,0.133584,False,Siamese_cat,0.07444,False
1971,869227993411051520,https://pbs.twimg.com/media/DBAePiVXcAAqHSR.jpg,1,Pembroke,0.664181,True,Chihuahua,0.169234,True,Cardigan,0.1327,True


In [12]:
tweets_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   tweet_id   2354 non-null   int64
 1   favorites  2354 non-null   int64
 2   retweets   2354 non-null   int64
dtypes: int64(3)
memory usage: 55.3 KB


In [46]:
tweets_selected.retweets.notnull().sum()

2354

In [49]:
tweets_selected.describe()

Unnamed: 0,tweet_id,favorites,retweets
count,2354.0,2354.0,2354.0
mean,7.426978e+17,8080.968564,3164.797366
std,6.852812e+16,11814.771334,5284.770364
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,1415.0,624.5
50%,7.194596e+17,3603.5,1473.5
75%,7.993058e+17,10122.25,3652.0
max,8.924206e+17,132810.0,79515.0


In [50]:
tweets_selected[tweets_selected.retweets == 0]

Unnamed: 0,tweet_id,favorites,retweets
290,838085839343206401,150,0


### Issues


#### Quality
**WeRateDogs Archive**
1. in_reply_to_status_id, in_reply_to_user_id, timestamp, expanded urls, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, source columns are not needed
2. 'doggo', 'puppo', 'pupper', 'floofer' columns have strings instead of NaN values.
3. 0s in numerator and denominator column
4. Extremely large values for numerator and denominator columns


**Image Predictions**
1. Inconsistent names for p1, p2, p3
2. Might contain retweet information

**Tweets from API**
1. Might contain retweet information

#### Tidiness
**WeRateDogs Archive**
1. Too many columns for a single variable > dog stages
2. Text column contains more than one variable
3. Some tweets are retweets (if retweeted_status_id and in_reply_to_user are not null, that would mean it is a retweet)
4. Ratings are given in two columns.

<a id='clean'></a>
## Data Cleaning

<a id='analysis'></a>
## Data Analysis

## Links

1. To read the json file and save its content to a DataFrame, I have used this [answer](https://knowledge.udacity.com/questions/68700#68752) on Knowledge because the article from Stack Abuse was not helpful