# WeRateDogs Twitter Archive Analysis


## Introduction

This project uses [Twitter](https://twitter.com/) API and #WeRateDogs Twitter Archive and focuses on gathering, cleaning data collected and draw insights from it using Data Analysis.


## Table of Contents

1. <a href='#gather'>Data Gathering</a>
2. <a href='#assess'>Assessment</a>
3. <a href='#clean'>Data Cleaning</a>
4. <a href='#link'>Links</a>

<a id='gather'></a>
## Data Gathering

In [1]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tweepy
import json
import requests
import os
from tweepy import OAuthHandler
from timeit import default_timer as timer

%matplotlib inline

In [2]:
# Read In WeRateDogs Twitter archive as we_rd
we_rd = pd.read_csv('twitter-archive-enhanced.csv')

Download tweet image predictions which was generated using a neural network

In [3]:
# First, create a folder to store
folder_name = 'image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [4]:
# Send a request to the necessary URL
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [5]:
# Save the requests response to a .tsv file
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)

In [6]:
# Read in the image-predictions.tsv into a dataframe
predictions = pd.read_csv('image_predictions/image-predictions.tsv', sep='	')

    Note for the instructor: I wanted to do the next step on my own, so I sent my application to Twitter, but as of now, I still have not heard from them. That's why I had to use the ready-made tweet-json.txt. 

In [None]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

In [7]:
# Read the tweet-json.txt file line by line and append the contents to an empty
# list
selected_attr = []
with open('tweet-json.txt', 'r') as json_file:
    for line in json_file:
        json_data = json.loads(line)
        selected_attr.append({
            'tweet_id': json_data['id'],
            'favorites': json_data['favorite_count'],
            'retweets': json_data['retweet_count'],
        })

In [8]:
# Create a dataframe from the list containing tweets data
tweets_selected = pd.DataFrame(selected_attr,
                               columns=['tweet_id', 'favorites', 'retweets'])

<a id='assess'></a>
## Assessing

In [9]:
we_rd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [10]:
we_rd.tweet_id[:3]

0    892420643555336193
1    892177421306343426
2    891815181378084864
Name: tweet_id, dtype: int64

In [11]:
we_rd.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [12]:
we_rd[we_rd.in_reply_to_status_id.notnull()].head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,


In [13]:
we_rd.timestamp

0       2017-08-01 16:23:56 +0000
1       2017-08-01 00:17:27 +0000
2       2017-07-31 00:18:03 +0000
3       2017-07-30 15:58:51 +0000
4       2017-07-29 16:00:24 +0000
                  ...            
2351    2015-11-16 00:24:50 +0000
2352    2015-11-16 00:04:52 +0000
2353    2015-11-15 23:21:54 +0000
2354    2015-11-15 23:05:30 +0000
2355    2015-11-15 22:32:08 +0000
Name: timestamp, Length: 2356, dtype: object

In [14]:
we_rd[['doggo', 'puppo', 'pupper', 'floofer']].head()

Unnamed: 0,doggo,puppo,pupper,floofer
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


In [15]:
predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [16]:
predictions.duplicated().sum()

0

In [17]:
predictions.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1162,734912297295085568,https://pbs.twimg.com/media/CjLuzPvUoAAbU5k.jpg,1,Maltese_dog,0.847292,True,feather_boa,0.059379,False,Old_English_sheepdog,0.052758,True
191,669564461267722241,https://pbs.twimg.com/media/CUrFUvDVAAA9H-F.jpg,1,toy_poodle,0.623685,True,miniature_poodle,0.25992,True,standard_poodle,0.08253,True
916,701545186879471618,https://pbs.twimg.com/media/CbxjnyOWAAAWLUH.jpg,1,Border_collie,0.280893,True,Cardigan,0.11255,True,toy_terrier,0.053317,True
826,693280720173801472,https://pbs.twimg.com/media/CZ8HIsGWIAA9eXX.jpg,1,Labrador_retriever,0.340008,True,bull_mastiff,0.175316,True,box_turtle,0.164337,False
1863,842846295480000512,https://pbs.twimg.com/media/C7JkO0rX0AErh7X.jpg,1,Labrador_retriever,0.461076,True,golden_retriever,0.154946,True,Chihuahua,0.110249,True


In [18]:
tweets_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   tweet_id   2354 non-null   int64
 1   favorites  2354 non-null   int64
 2   retweets   2354 non-null   int64
dtypes: int64(3)
memory usage: 55.3 KB


In [19]:
tweets_selected.retweets.notnull().sum()

2354

In [20]:
tweets_selected.describe()

Unnamed: 0,tweet_id,favorites,retweets
count,2354.0,2354.0,2354.0
mean,7.426978e+17,8080.968564,3164.797366
std,6.852812e+16,11814.771334,5284.770364
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,1415.0,624.5
50%,7.194596e+17,3603.5,1473.5
75%,7.993058e+17,10122.25,3652.0
max,8.924206e+17,132810.0,79515.0


In [21]:
tweets_selected[tweets_selected.retweets == 0]

Unnamed: 0,tweet_id,favorites,retweets
290,838085839343206401,150,0


### Issues


#### Quality
**WeRateDogs Archive**
1. 'doggo' column has strings instead of NaN values.
2. 'puppo' column has strings instead of NaN values.
3. 'pupper' column has strings instead of NaN values.
4. 'floofer' column has strings instead of NaN values.
5. 0s in numerator and denominator column
6. Extremely large values for numerator and denominator columns


**Image Predictions**
1. Inconsistent names for p1, p2, p3

#### Tidiness
**WeRateDogs Archive**
1. Too many columns for a single variable > dog stages
2. Text column contains more than one variable
3. Some tweets are retweets (if retweeted_status_id and in_reply_to_user are not null, that would mean it is a retweet)
4. Ratings are given in two columns.
5. in_reply_to_status_id, in_reply_to_user_id, timestamp, expanded urls, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, source columns are not needed

**Image Predictions**
1. Might contain retweet information

**Tweets from API**
1. Might contain retweet information

<a id='clean'></a>
## Data Cleaning

### Cleaning for quality

**WeRateDogs Archive** 

#### Define
- 'doggo', 'puppo', 'pupper', 'floofer' columns have strings instead of NaN values.

#### Code

In [22]:
# Apply an anonymous function to 4 columns above to conver "None" strings to NaN
we_rd['doggo'] = we_rd['doggo'].apply(lambda x: np.nan if x == 'None' else x)
we_rd['puppo'] = we_rd['puppo'].apply(lambda x: np.nan if x == 'None' else x)
we_rd['pupper'] = we_rd['pupper'].apply(lambda x: np.nan if x == 'None' else x)
we_rd['floofer'] = we_rd['floofer'].apply(lambda x: np.nan if x == 'None' else x)

#### Test

In [23]:
# This should create True values for null columns
# Earlier observations of this column did not have null values because of the strings
we_rd[['doggo', 'puppo', 'pupper', 'floofer']].isnull()

Unnamed: 0,doggo,puppo,pupper,floofer
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
...,...,...,...,...
2351,True,True,True,True
2352,True,True,True,True
2353,True,True,True,True
2354,True,True,True,True


#### Define
- 0s in numerator and denominator column

#### Code

In [24]:
# Get the ratings columns with 0 values
nulls = we_rd[(we_rd.rating_numerator == 0) | (we_rd.rating_denominator == 0)].copy()
nulls

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,
315,835152434251116546,,,2017-02-24 15:40:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you're so blinded by your systematic plag...,,,,https://twitter.com/dog_rates/status/835152434...,0,10,,,,,
1016,746906459439529985,7.468859e+17,4196984000.0,2016-06-26 03:22:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...","PUPDATE: can't see any. Even if I could, I cou...",,,,https://twitter.com/dog_rates/status/746906459...,0,10,,,,,


AS there are only 3 records, I will drop those.

In [25]:
we_rd.drop(nulls.index, axis=0, inplace=True)

#### Test

In [26]:
we_rd[(we_rd.rating_numerator == 0) | (we_rd.rating_denominator == 0)]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


#### Define
- Extremely large values for numerator and denominator columns

#### Code

I have checked the original Twitter channel for WeRateDogs and found out that their highest record ever was 15/10. So, I will set all the values higher than 15 and 10 to 15/10

In [27]:
we_rd['rating_numerator'] = we_rd['rating_numerator'].apply(
                                        lambda x: 15 if x > 15 else x)
we_rd['rating_denominator'] = we_rd['rating_denominator'].apply(
                                        lambda x: 10 if x > 10 else x)

#### Test

In [28]:
we_rd.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2353.0,76.0,76.0,181.0,181.0,2353.0,2353.0
mean,7.426913e+17,7.44309e+17,2.067175e+16,7.7204e+17,1.241698e+16,10.746281,9.995325
std,6.85577e+16,7.611756e+16,1.268953e+17,6.236928e+16,9.599254e+16,2.204426,0.176112
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,1.0,2.0
25%,6.783968e+17,6.756548e+17,342194300.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.193678e+17,7.031489e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.992971e+17,8.241444e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,15.0,10.0


**Image Predictions**

#### Define
- Inconsistent names for p1, p2, p3
 
#### Code

In [29]:
# Convert all the values in p1, p2, p3 columns to lowercase
predictions['p1'] = predictions['p1'].str.lower()
predictions['p2'] = predictions['p2'].str.lower()
predictions['p3'] = predictions['p3'].str.lower()

#### Test

In [30]:
predictions.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
202,669683899023405056,https://pbs.twimg.com/media/CUsx8q_WUAA-m4k.jpg,1,pomeranian,0.998275,True,chihuahua,0.000605,True,pekinese,0.000516,True
1676,813172488309972993,https://pbs.twimg.com/media/C0j4EESUsAABtMq.jpg,1,doormat,0.954844,False,golden_retriever,0.026193,True,cocker_spaniel,0.004386,True
590,679148763231985668,https://pbs.twimg.com/media/CWzSMmAWsAAyB1u.jpg,1,italian_greyhound,0.302685,True,hair_slide,0.124281,False,afghan_hound,0.059846,True
937,703611486317502464,https://pbs.twimg.com/media/CcO66OjXEAASXmH.jpg,1,pembroke,0.756441,True,basenji,0.126621,True,cardigan,0.080117,True
1463,778396591732486144,https://pbs.twimg.com/media/CcG07BYW0AErrC9.jpg,1,hippopotamus,0.581403,False,doormat,0.152445,False,sea_lion,0.026364,False
1918,855459453768019968,https://pbs.twimg.com/media/C98z1ZAXsAEIFFn.jpg,2,blenheim_spaniel,0.389513,True,pekinese,0.18822,True,japanese_spaniel,0.082628,True
1564,793614319594401792,https://pbs.twimg.com/media/CvyVxQRWEAAdSZS.jpg,1,golden_retriever,0.705092,True,labrador_retriever,0.219721,True,kuvasz,0.015965,True
1424,772152991789019136,https://pbs.twimg.com/media/Crc9DEoWEAE7RLH.jpg,2,golden_retriever,0.275318,True,irish_setter,0.100988,True,vizsla,0.073525,True
224,670319130621435904,https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg,1,irish_terrier,0.254856,True,briard,0.227716,True,soft-coated_wheaten_terrier,0.223263,True
1952,863553081350529029,https://pbs.twimg.com/ext_tw_video_thumb/86355...,1,eskimo_dog,0.41333,True,malamute,0.347646,True,siberian_husky,0.149536,True


### Cleaning for Tidiness

**WeRateDogs Archive**

#### Define
- Too many columns for a single variable > dog stages

#### Code

Generally, to gather values that belong to a single variable from differet columns, I would use `pandas.melt()` funtion. However, in this case, as there are rows with multiple values for the dog_stage, I will do it in a custom way.

In [31]:
# First add all of the 4 columns as string because they also contain NaN values
we_rd['dog_stage'] = we_rd['doggo'].apply(str) + we_rd['floofer'].apply(str) \
                    + we_rd['pupper'].apply(str) + we_rd['puppo'].apply(str)

In [32]:
# Strip the nan strings from all values
we_rd['dog_stage'] = we_rd['dog_stage'].str.strip('nan')

In [33]:
# Write a custom function
def convert_dog_stages(value):
    """
    Purpose: the function checks if the value being passed is either NaN or
    has multiple dog stages or just one. Returns values accoding to those groups
    """
    dog_stages = ['doggo', 'floofer', 'puppo', 'pupper']
    
    if value == '':
        return np.nan
    elif value not in dog_stages:
        return 'multiple'
    else:
        for name in dog_stages:
            if value == name:
                return name

In [34]:
# Apply the above function to the dog stage column
we_rd['dog_stage'] = we_rd['dog_stage'].apply(convert_dog_stages)

#### Test

In [35]:
we_rd.dog_stage.notnull().sum()

380

In [36]:
we_rd.dog_stage.value_counts()

pupper      245
doggo        83
puppo        29
multiple     14
floofer       9
Name: dog_stage, dtype: int64

Well, even though we created rows with "multiple" values, it does not make sense for a dog to be in multiple dog stages, so I will drop those. Since we do not need the other 4 columns, I will drop them too

In [37]:
we_rd = we_rd[we_rd.dog_stage != 'multiple'].copy()
we_rd.drop(['puppo', 'pupper', 'doggo', 'floofer'], axis=1, inplace=True)

In [38]:
# Test
we_rd.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_stage
1514,691090071332753408,,,2016-01-24 02:48:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy Saturday here's a dog in a mailbox. 12/1...,,,,https://twitter.com/dog_rates/status/691090071...,12,10,,
544,805932879469572096,,,2016-12-06 00:32:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Major. He put on a tie for his first r...,,,,https://twitter.com/dog_rates/status/805932879...,12,10,Major,
2139,670037189829525505,,,2015-11-27 00:31:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Awesome dog here. Not sure where it is tho. Sp...,,,,https://twitter.com/dog_rates/status/670037189...,5,10,,
1232,713175907180089344,,,2016-03-25 01:29:21 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Opie and Clarkus. Clarkus fell as...,,,,https://twitter.com/dog_rates/status/713175907...,10,10,Opie,
267,841680585030541313,,,2017-03-14 16:01:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Penny. She's a dragon slayer. Feared b...,,,,https://twitter.com/dog_rates/status/841680585...,12,10,Penny,


#### Define
- Some tweets are retweets (if retweeted_status_id and in_reply_to_user are not null, that would mean it is a retweet)

#### Code

In [39]:
# Filter the dataframe for notnull retweet status_ids
retweeted = we_rd[we_rd.retweeted_status_id.notnull()].copy()
# Drop the rows with the indexes in retweeted
we_rd.drop(retweeted.index, axis=0, inplace=True)

In [40]:
# Now filter the dataframe for notnull in_reply tweets
in_reply = we_rd[we_rd.in_reply_to_status_id.notnull()].copy()
# Drop the rows with the indexes in in_reply
we_rd.drop(in_reply.index, axis=0, inplace=True)

#### Test

In [41]:
we_rd.retweeted_status_id.value_counts(), we_rd.in_reply_to_status_id.value_counts()

(Series([], Name: retweeted_status_id, dtype: int64),
 Series([], Name: in_reply_to_status_id, dtype: int64))

#### Define
- Ratings are given in two columns.

#### Code

In [42]:
# As the ratings are given in to columns, I will merge the columns to create a
# new column which represents ratings in "13/10" format as strings
we_rd['rating'] = we_rd.rating_numerator.apply(str) + '/' + we_rd.rating_denominator.apply(str)

#### Test

In [43]:
we_rd.sample().rating

1484    9/10
Name: rating, dtype: object

In [44]:
# Since we don't need the numerator and denominator columns, I will drop them
we_rd.drop(['rating_numerator', 'rating_denominator'], axis=1, inplace=True)

#### Define 
- in_reply_to_status_id, in_reply_to_user_id, timestamp, expanded urls, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, source, name columns are not needed

#### Code

In [45]:
# Drop all the columns from above
we_rd.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 
            'expanded_urls', 'retweeted_status_id', 'retweeted_status_user_id',
            'retweeted_status_timestamp', 'source', 'name'], axis=1, inplace=True)

#### Test

In [46]:
we_rd.columns

Index(['tweet_id', 'text', 'dog_stage', 'rating'], dtype='object')

#### Define

- Image Predictions and Tweets from API might contain retweet information

#### Code

In [47]:
# Drop retweets from predictions dataframe using retweeted and in_reply variables
# defined earlier
all_retweets = list(set(list(retweeted.index) + list(in_reply.index)))
new_retweet_indexes = []
predictions.tweet_id = predictions.tweet_id.apply(lambda x: new_retweet_indexes.append(x) if x in all_retweets else x)
predictions.drop(new_retweet_indexes, axis=0, inplace=True)

In [48]:
# Drop retweets from tweets_selected dataframe using retweeted and in_reply variables
# defined earlier
all_retweets = list(set(list(retweeted.index) + list(in_reply.index)))
new_retweet_indexes = []
tweets_selected.tweet_id = predictions.tweet_id.apply(lambda x: new_retweet_indexes.append(x) if x in all_retweets else x)
tweets_selected.drop(new_retweet_indexes, axis=0, inplace=True)

#### Test

In [49]:
print(predictions[predictions.tweet_id.isin(new_retweet_indexes)])
print(tweets_selected[tweets_selected.tweet_id.isin(new_retweet_indexes)])

Empty DataFrame
Columns: [tweet_id, jpg_url, img_num, p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog]
Index: []
Empty DataFrame
Columns: [tweet_id, favorites, retweets]
Index: []


In [50]:
# Save the image predictions dataframe to a new .csv file
predictions.to_csv('image_predictions_master.csv', index=False)

In [51]:
# Join WeRateDogs and tweets_selected dataframes into a single one to perform
# exploratory analyses later
master_df = pd.merge(we_rd, tweets_selected, on='tweet_id')

In [52]:
# Save master_df to a new .csv file
master_df.to_csv('master_tweets.csv', index=False)

<a id='#link'></a>
## Links

1. To read the json file and save its content to a DataFrame, I have used this [answer](https://knowledge.udacity.com/questions/68700#68752) on Knowledge because the article from Stack Abuse was not helpful
2. Image Predictions was downloaded using this [link]('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

Note: I will be performing visualisations, data analysis in a separate notebook