# WeRateDogs - Query Twitter Data
<ul>
<li><a href="#intro">I. - Introduction</a></li>
<li><a href="#gathering">II. - Data Gathering</a></li>
<li><a href="#assessment">III. - Data Assessment</a></li>
<li><a href="#wrangling">IV. - Data Wrangling</a></li>
<li><a href="#eda">V. - Explanatory Analysis</a></li>
<li><a href="#conclusions">VI. - Conclusions</a></li>
<li><a href="#references">VII. - References</a></li>
</ul>

<a id='intro'></a>
## I. - Introduction

During the course of my Udacity "Data Analyst" Nanodegree I analysed Tweets from [WeRateDogs®](https://twitter.com/dog_rates?lang=eng). WeRateDogs® shows off dog picutes in all variations and consider itself as the only source for professional dog ratings. As of March 2020, over 8,7 Mio. twitter accounts follow the supplier of cute doggo pictures.

This report aims to answer simple and important questions of online marketing: 
##### 1. When is the best time for a tweet?
To be precise, we want to analyze if tweets tweeted during the weekend are more popular than tweets during workdays. The same analysis will be done for hours. Since "popularity" is barely quantifiable # of retweets and # of favorites is used instead.

##### 2. Do some breeds outperform others?
Do some breeds receive a significant higher popularity and if so which breed is the most popular?

<a id='gathering'></a>
## II. - Data Gathering

#### II. A.) Importing Packages
The most important packages were imported including Pandas, Numpy and Matplotlib. In addition, the packages Tweepy, Requests and JSON are needed to query data from twitter. Datetime, Random and Image are optional for the project itself.

In [None]:
from IPython.display import Image

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import datetime
import random

import tweepy
import requests
import json

#### II. B.) Loading CSV Data
Afterwards, the CSV was loaded into the notebook and I had a first look at the data.

In [None]:
df = pd.read_csv('twitter-archive-enhanced.csv', index_col=['tweet_id'], parse_dates=['timestamp','retweeted_status_timestamp'])
df.head()

Some column names aren't intuitive. Hence, we check their meanings according to the [data dictionary](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object).

**The key takeaways are:**

- in_reply_to_status_id and in_reply_to_user_id are filled when the twee was a reply to another tweet.
- retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp represents the original tweet data in case of a retweet.
- doggo, floofer, pupper and puppo have been added by the UdaCity team but are messy.

#### II. C.) Query Twitter Data

In the next steps, the twitter API was used to gather additional information about each of the given tweets. Therefore Tweepy was used to gather tweet data and write each tweets data into a JSON-file. The correct tweets are identified by their respective tweet id.

In [None]:
access_token = 'qqq'
access_secret = 'qqq'
account_key = 'qqq'
account_secret = 'qqq'

auth = tweepy.OAuthHandler(account_key, account_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

try:
    api.verify_credentials()
    print("Authentication successful")
except:
    print("Authentication error")

In the next code cell, tweet data is gathered by Tweepy. In case of an error, the respective tweet id and the error message is written into an CSV. In addition, datetime package was used to measure runtime of the code cell.

In [None]:
start_date = datetime.datetime.today()

tweet_data = []
tweep_errors = []

id_list = list(df.index)

for x in id_list:    
    try:
        tweet_data.append(api.get_status(x, tweet_mode='extended')._json)
    except tweepy.TweepError as error:
        print('Unable to query:', str(x))
        tweep_errors.append({'id':x, 'error':error})
        
print('Summary: '+str(len(tweep_errors))+' Missing Tweets ('+str(round(len(tweep_errors)/len(tweet_data),2))+'%)')

with open('tweet_json.txt', 'w') as x:
    json.dump(tweet_data, x)
    
pd.DataFrame(tweep_errors).to_csv('tweet_errors.csv', index=False)
    
print(str('Duration: '+str(datetime.datetime.today() - start_date)))

Unfortunately, there are 25 errors i.e. these tweets are missing when merging datasets, which needs further investigation.

**Before the tweet data can be used, is was necesarry to flatten the JSON-file**, since tweet objeects are stored in nested JSON-files. The respective code snipped was takes from [TowardsDataScience](https://towardsdatascience.com/how-to-flatten-deeply-nested-json-objects-in-non-recursive-elegant-python-55f96533103d).

In [None]:
with open('tweet_json.txt', 'r') as x:
    load_data = json.load(x)

In [None]:
def flatten_json(x):
    out = {}

    def flatten(y, name=''):
        if type(y) is dict:
            for z in y:
                flatten(y[z], name + z + '_')
        elif type(y) is list:
            i = 0
            for z in y:
                flatten(z, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = y

    flatten(x)
    return out

In [None]:
for x in range(len(load_data)):
    load_data[x] = flatten_json(load_data[x])
    
df_add = pd.DataFrame(load_data).set_index('id')
df_add.head()

In [None]:
df_add.shape

Please note that the diffrence in rows between the two datasets is explained by the 25 errors. 

Since there are 911 columns currently, columns that are not needed for analysis will be dropped.

In [None]:
columns = ['full_text','extended_entities_media_0_media_url_https','source','in_reply_to_status_id','in_reply_to_user_id','user_followers_count','user_favourites_count','retweet_count','favorite_count','retweeted','is_quote_status']

df_add = df_add[columns]
df_add.rename(columns={'full_text':'text','extended_entities_media_0_media_url_https':'expanded_urls'}, inplace=True)

df_add.head()

#### II. D.) Breed Prediction Data

Third source for this project is the image-predictions dataset provided by Udacity in a TSV file.

In [None]:
df_pred = pd.read_csv('image-predictions.tsv', sep='\t', index_col='tweet_id')
df_pred.head()

In [None]:
df_pred.shape

The dataset looks clean so far, thus it can be joined to the rest of the data. However, the breed prediction dataset has fewer observation compared to the other datasets. Therefore, we need to use a left join when merging to keep track of missing and other errors.

#### II. E.) Merging Data

All three datasets were merged together by a pandas left join. A left join was used for merging since we dont want to loose any information - e.g. an inner join would loose the 25 error tweets. The same is true for dog breed prediction data. Nevertheless, missings have to be imputed in data wrangling chapter. 

In [None]:
df_combined_pre = df.join(df_add, how='left', rsuffix='_add')
df_combined_pre.head()

In [None]:
df_combined_pre_c = df_combined_pre.copy()

df_combined = df_combined_pre_c.join(df_pred, how='left', rsuffix='_pred')
df_combined.head()

In [None]:
df_combined.info()

So far, there seem to be no reason for major concern. However, data quality checks are performed in the next step.

#### II. F.) Data Quality Checks

Last but not least, we checked the quality of the merged dataset by comparing duplicated columns and cheking our errors. 

In [None]:
df_errors = pd.read_csv('tweet_errors.csv')
df_errors

A check of the errors revealed, that almost all errors occured due to code 144 i.e. the tweets have been deleted. Only one tweet raised an error because my account is not authorized to see the status. Overall, the number of missings is immaterial and will not bias the following analysis.

To check the quality of our first join, text column was used because the column is filled with unique values.

In [None]:
df_combined[df_combined['text'] != df_combined['text_add']].head()

In [None]:
# In case of any mismatch .min() would evaluate to 0 (False)
(df_combined[df_combined['text'] != df_combined['text_add']].index ==  df_errors['id']).min()

There is a mismatch in texts but this discrepancy can be fully explained by the 25 errors. Therefore, we conclude that our first join is correct and - more important - that **observations match**.

For evaluation of the second join, expanded_urls_add and jpg_url was used.

In [None]:
# Select mismatches
df_combined[df_combined['expanded_urls_add'] != df_combined['jpg_url']][['expanded_urls_add','jpg_url']].head()

In [None]:
len(df_combined[df_combined['expanded_urls_add'] != df_combined['jpg_url']])

In [None]:
# Select mismatches where only 1 image exists  
df_combined[(df_combined['expanded_urls_add'] != df_combined['jpg_url'])&(df_combined['img_num']==1)][['expanded_urls_add','jpg_url']]

In [None]:
checklist = df_combined[(df_combined['expanded_urls_add'] != df_combined['jpg_url'])&(df_combined['img_num']==1)].index

# Check if any observations in checlist is at the same time an error
any(x in df_errors.id for y in checklist)

The data suggests a mismatch of 589 rows. A closer look reveals that there are two types of mismatches. First, both URL's refer to the same tweet but diffrent pictures - an example is provided below.  Second, "expanded_urls_add" is missing, while jpg_url isn't.

As seen before there can be a marginal mismatch, however 13 mismatches dont give raise to major concern. Therefore, we conclude that overall the join worked and number of mismatches is within a tolerable range.

In [None]:
Image(url='https://pbs.twimg.com/media/DFDw2tsUAAEw7XW.jpg', embed=True, width=260, height=260)

In [None]:
Image(url='https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg', embed=True, width=260, height=260)

<a id='assessment'></a>
## III. - Data Assessment

In [None]:
df_combined.head()

In [None]:
df_combined.info()

In [None]:
df_combined.describe()

Histogramms were plotted in addition to the describe-command in order to gain a better understanding of the data. The below plots show that retweet count and favorite count are highly skewed. This is not a problem per se but it has to be considered when plotting.

In [None]:
hist_list = ['rating_numerator','rating_denominator','retweet_count','favorite_count']

for x in hist_list:
            df_combined[x].hist();
            plt.title(str(x).upper())
            plt.ylabel('Frequency')
            plt.show()
            plt.clf

In [None]:
# The index was checked instead of the row, since tweets should be unique 
df_combined.index.duplicated().sum()

A check for duplicates was done, and shows a negative result i.e. there aren't any duplictated rows.

**Summary**
1. As can be seen, in_reply_to_status_id and in_reply_to_user_id contain mostly missings. The same is true for retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp. However, this makes sence, since WeRateDogs® tweets new picutres instead of replies.  Retweets and Replies have to be cleaned, since we only care about original dog ratings.
2. In addition to that, there are some missing URLS's as well as diffrence between expanded_urls (2297), expanded_urls_add (2059) and jpg_url (2075). Hence, further investigation is necessary, although not mandatory because images dont play a major role in our analysis.
3. Dates are formatted as strings instead of datetime objects (Note: this has been fixed by parsing "parse_dates" in the "pd.read_csv"-command)
4. in_reply_to_status_id and in_reply_to_user_id are formatted as floats instead of strings. These columns are filled with numbers but these numbers are unique identifiers. Therefore, they cannot be used for calculation and will be converted to string.
5. retweet_count, favorite_count, user_followers_count and user_favourites_count are formatted as floats instead of integers. For convenience these columns will be converted to integers.
6. The source columns contains full HTML tags. However, the tweet source should be something like "Web" or "Smartphone" - which can be found in between the HTML tags. Hence, the correct information needs to be extracted.
7. Columns doggo, pupper and puppo are reflect the same information, namely age of the dog.
8. rating_denominator can be smaller than 1 and even be 0. This is a problem for calculation of the final rating, since division by 0 will result in and error. In case you divide a number by something between 0 and 1 the result is an multiplication, which doesn't make sense.
9. The final rating is reflected in two columns rating numerator and rating denominator. Since, final rating will be used in the analysis we need to combine both fields, which is related to the previous problem.
10. Due to the left joins there are missing values for retweet_count, favorite_count, user_followers_count and user_favourites_count which have to be imputed. For completeness, please note that retweeted and is_quote_status also contains missing, but these columns are dropped anyways.
11. The same holds true for dog breed predictions. Nonetheless, it might be difficult to replace missing values.
12. To answer our first question a weekday feature must be calculated.

In [None]:
random_dog = random.choice(df_combined[df_combined.expanded_urls_add.notnull()].expanded_urls_add.values)
Image(url=random_dog, embed=True, width=260, height=260)

<a id='wrangling'></a>
## IV. - Data Wrangling

In [None]:
df_combined_c = df_combined.copy()

#### IV. 1.) Retweets and replies

Since we are only interested in original dog rating and no duplicates retweets, replies and quotes will be dropped - since retwees reflect another kind of duplicates, which wasn't not detected by out check.

In [None]:
# Select rows with retweet_status_id
df_combined_clean = df_combined_c.loc[df_combined_c['retweeted_status_id'].isna()==True]
df_combined_clean['retweeted_status_id'].isna().sum()

In [None]:
# Select rows with is_quote_status
df_combined_cleaned = df_combined_clean.loc[df_combined_clean['is_quote_status']==False]
df_combined_cleaned['is_quote_status'].sum()

#### IV. 2.) Missing URL's

In [None]:
# Select missings
df_combined_cleaned[df_combined_cleaned['expanded_urls'].isna()]
# Select missings + replies
df_combined_cleaned[(df_combined_cleaned['expanded_urls'].isna())&(df_combined_cleaned['in_reply_to_status_id'].isna())]

A quick check revealed, that missing URLs are replies.

#### IV. 4.) Formates - Floats to Strings

In [None]:
convert_strings = ['in_reply_to_status_id','in_reply_to_user_id','in_reply_to_status_id_add','in_reply_to_user_id_add','retweeted_status_id','retweeted_status_user_id']

for x in convert_strings:
    df_combined_cleaned[x] =  df_combined_cleaned[x].apply(lambda y: str(y))
    print(str(x)+': '+str(df_combined_cleaned[x].dtype))

#### IV. 5.) Formates - Floats to Integers

These columns are clearly numeric values that could be used for calcultion.

At the same time median values were imputed for missings, since we already called the correct objects.

In [None]:
convert_ints = ['retweet_count','favorite_count','user_followers_count','user_favourites_count']

for x in convert_ints:
    df_combined_cleaned[x].fillna(df_combined_cleaned[x].median(), inplace=True)
    df_combined_cleaned[x] =  df_combined_cleaned[x].apply(lambda y: int(y))
    print(str(x)+': '+str(df_combined_cleaned[x].dtype))

#### IV. 6.) Source column

Since this column contains full HTML tags, it was decided to split the string by bracktes. Thus, we receive the information between the HTML tags.

In [None]:
df_combined_cleaned['source'] =  df_combined_cleaned['source'].apply(lambda y: y.split('>')[1].split('<')[0])
df_combined_cleaned['source'].unique()

#### IV. 7.) Age - doggo, puppo, pupper

As mentioned before, doggo, floofer, pupper and puppo are slang terms for dogs. A dog is mapped to the respective category by age. Hence, the true information in all three columns is the same - age.

Thus we combine the information doggo, puppo and pupper in a new categorical column called age.

Before merging doggo, puppo and pupper it was checked that columns dont overlap.

In [None]:
# Select rows where pupper is true and check wether puppo is empty and vice versa  
(df_combined_cleaned[df_combined_cleaned['pupper']=='pupper'].puppo != 'None').sum() == (df_combined_cleaned[df_combined_cleaned['puppo']=='puppo'].pupper != 'None').sum() == 0

In [None]:
# Select rows where pupper is true and check wether doggo is empty and vice versa
(df_combined_cleaned[df_combined_cleaned['pupper']=='pupper'].doggo != 'None').sum() == (df_combined_cleaned[df_combined_cleaned['doggo']=='doggo'].pupper != 'None').sum() == 0

In [None]:
df_combined_cleaned[(df_combined_cleaned['pupper']=='pupper')&(df_combined_cleaned['doggo']=='doggo')].head()

In [None]:
# Select rows where doggo is true and check wether puppo is empty and vice versa
(df_combined_cleaned[df_combined_cleaned['doggo']=='doggo'].puppo != 'None').sum() == (df_combined_cleaned[df_combined_cleaned['puppo']=='puppo'].doggo != 'None').sum() == 0

Since columns overlap, merging is not possible without loosing information. In addition, these columns are optional for our analysis. Hence, they were dropped in the next step.

In [None]:
df_combined_cleaned.drop(['doggo','pupper','puppo','floofer'], axis=1, inplace=True)

#### IV. 8.) Rating denominator

Rating is calculated in the next step. Before calculation, rating denominators were floored at 1 to prevent division by 0 and amplification in case a denominator is between 0 and 1.

In [None]:
df_combined_cleaned.loc[df_combined['rating_denominator']<=0, 'rating_denominator'] = 1
(df_combined_cleaned['rating_denominator']<=0).sum()

#### IV. 9.) Final rating

The final dog ratings consist of two elements a numerator and a denominator. Since the denominator can change it would be misleading to compare dogs based on the numerator. In case of a constant denominator there is no diffrence between the final rating and the numerator, since the rank ordering would be the same.

In [None]:
df_combined_cleaned['rating'] = df_combined_cleaned['rating_numerator'] / df_combined_cleaned['rating_denominator']

In [None]:
df_combined_cleaned['rating'].hist()
plt.title(str('rating').upper())
plt.ylabel('Frequency')
plt.show()
plt.clf;

In [None]:
len(df_combined_cleaned[df_combined_cleaned['rating']>df_combined_cleaned['rating'].quantile(0.99)])

The histogramm above is biased due to 14 outliers with a rating greater than the 99% confidence intervall. Thus, values were capped at the 99% confidence intervall.

In [None]:
df_combined_cleaned.loc[df_combined_cleaned['rating']>df_combined_cleaned['rating'].quantile(0.99), 'rating'] = df_combined_cleaned['rating'].quantile(0.99)

df_combined_cleaned['rating'].hist()
plt.title(str('rating').upper())
plt.ylabel('Frequency')
plt.show()
plt.clf;

In [None]:
df_combined_cleaned.drop(['rating_numerator','rating_denominator'], axis=1, inplace=True)

#### IV. 10.) Missing imputation

retweet_count, favorite_count, user_followers_count and user_favourites_count have already been imputed in IV. 5.).

#### IV. 11.) Missing dog breeds

Due to the left join used before, there are missing dog breeds. Nonetheless, it is quite difficult to impute 152 values by median or any other concept, since it's a categorical variable. 

In [None]:
print('Missing ratio: '+str(round(df_combined_cleaned.p1.isna().sum() / len(df_combined_cleaned)*100,2))+'%')

Unfortunately, there are to many missing to drop them. Therefore, it was decided to keep these observations for the first question, but to omit them for the third question.

#### IV. 12.) Weekday feature

To anser our first question it is important to know the exact weekday of a tweet.

In [None]:
df_combined_cleaned['weekday'] = df_combined_cleaned['timestamp'].apply(lambda x: x.date().weekday()+1)
df_combined_cleaned['weekday']

**Summary:**

Overall, problems were solved. However, some features will not be used in the final analysis, since their use might lead to unreliable results due to marginal data. 

Finally, columns related to retweets, quotes and replies can therefore be dropped. In addition, columns that will not be used e.g. img_num in the analysis are dropped as well columns used for quality checks - expanded_urls_add is preferred over jpg_url, due to less missings.

In [None]:
df_combined_cleaned.drop(['in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp','text_add','source_add','in_reply_to_status_id_add','in_reply_to_user_id_add','retweeted','is_quote_status','img_num','jpg_url'], axis=1, inplace=True)

In [None]:
df_combined_cleaned.to_csv('cleaned_data.csv', index=True)

In [None]:
random_dog = random.choice(df_combined[df_combined.expanded_urls_add.notnull()].expanded_urls_add.values)
Image(url=random_dog, embed=True, width=260, height=260)

<a id='eda'></a>
## V. - Explanatory Analysis

WeRateDogs® is a popular twitter blog, which tweets on a daily basis. A very important question for each blogger is: What drives my popularity?

In the following analysis we want to dive into the number and answer simple but important questions. Popularity is measured by number of favorites or - more important - number of retweets. [Why are retweets more important?](https://medium.com/@Encore/favorites-vs-retweets-and-why-one-is-more-important-than-the-other-ba12ee20e9ba) For these reasons we will focus on number of retweets in this analysis.


**1. When is the best time for a tweet?**
To be precise, we want to analyze if tweets tweeted during the weekend are more popular than tweets during workdays. The same analysis will be done for hours. Since "popularity" is barely quantifiable # of retweets and # of favorites is used instead.

**2. Do some breeds outperform others?**
Do some breeds receive a significant higher popularity and if so which breed is the most popular?

In [None]:
df_analysis = pd.read_csv('cleaned_data.csv', index_col=['tweet_id'], parse_dates=['timestamp'])

##### 1. When is the best time for a tweet?

In order to answer our first question, line plots were drawn below. The plots show the weekdays on their x-axis, starting with Monday (1), till Sunday (7). The number of tweets is plotted on the y-axis.

In [None]:
aggr_weekday_sum = df_analysis.groupby('weekday').retweet_count.sum()

aggr_weekday_sum.plot();
plt.grid(which='major', axis='both')
plt.title('Total number of retweets grouped by day')
plt.ylabel('Number of retweets')
plt.xlabel('Weekday')
plt.show()
plt.clf;

aggr_weekday_max = df_analysis.groupby('weekday').retweet_count.max()

aggr_weekday_max.plot();
plt.grid(which='major', axis='both')
plt.title('Maximum number of retweets grouped by day')
plt.ylabel('Number of retweets')
plt.xlabel('Weekday')
plt.show()
plt.clf;

By looking at the first plot, we see that the best day for tweeting is Wednesday, followed by Monday. This is interesting because my first guess would have been saturday, since people have more free time compared to working days. On the other hand, Saturday is the day most popular tweet overall, as can be seen in the second plot. Hence, we checked the retweet volatility in the next two plots,to validate these result.

In [None]:
df_analysis.plot.scatter(x='weekday', y='retweet_count')
plt.grid(which='major', axis='both')
plt.title('Number of retweets grouped by day')
plt.ylabel('Number of retweets')
plt.xlabel('Weekday')
plt.show()
plt.clf;

df_analysis.plot.scatter(x='weekday', y='retweet_count')
plt.grid(which='major', axis='both')
plt.title('Number of retweets grouped by day')
plt.ylabel('Number of retweets')
plt.xlabel('Weekday')
plt.ylim([0,25000])
plt.show()
plt.clf;

The first plot suggests that volatility decreases. However, if we remove outliers the story is diffrent. In the second plot we see that tweets during weekdays perform better than tweets during weekends. Moreover, volatility seems to be almost constant.

**To check the results, an A|B test was performed.**

In [None]:
weekday_df = []

for _ in range(10000):
    weekday_df.append({'weekdays':df_analysis.sample(2500, replace=True).query('weekday<6').retweet_count.mean(),
               'weekends':df_analysis.sample(2500, replace=True).query('weekday>5').retweet_count.mean(),
               'mean_diff':df_analysis.sample(2500, replace=True).query('weekday<6').retweet_count.mean() - df_analysis.sample(2500, replace=True).query('weekday>5').retweet_count.mean()})
    
weekday_df = pd.DataFrame(weekday_df)

In [None]:
plt.hist(weekday_df['mean_diff'], alpha = 0.5, label='weekdays')
plt.axvline(weekday_df['mean_diff'].mean(), color='darkred')

plt.title('Average diffrence in number of retweets grouped by weekday/weekend')
plt.xlabel=('Average number of retweets')
plt.ylabel('Frequency')
plt.grid(True, which='major', axis='both')

plt.show()

print('Mean diffrence: '+str(weekday_df['mean_diff'].mean()))

The A|B test contradicts our previous results and shows that there is no statistical significant diffrence in number of retweets - and by far no practical significance. Thus, we conclude that the day doens't matter when tweeting. **It seems like tweets spread over time, an information not covered in our analysis - future research has to proove this hypothesis.**

##### 2. Do some breeds outperform others?
Is there a dog breed, which is more popular in general and lead to more popular tweets. This question is answered by the following bar plot, which shows the total number of retweets per breed for the five most popular breeds.



In [None]:
aggr_pred = df_analysis.groupby('p1').retweet_count.sum().nlargest(5)

aggr_pred.plot.bar();
plt.grid(which='major', axis='both')
plt.title('Total number of retweets grouped by breed - 5 largest')
plt.ylabel('Total number of retweets')
plt.xlabel('Breed')
plt.show()
plt.clf;

**The most popular breeds are:**

| Rank | Dog breed |
| :--- | :--- |
| 1 | Golden Retriever |
| 2 | Labrador Retriever |
| 3 | Welsh Corgi Pembroke |
| 4 | Chihuahua |
| 5 | Samoyed |

**Our data supports the cliche of the dolgen retriever as the most popular breed.** Thus, we conclude that tweets featuring golden retriever are more likely to get retweeted and become more popular. Nevertheless, it is wrong for WeRateDogs® to tweet golden retrievers only. What makes a blog interesting in the long run is change and diversity. But every now and then WeRateDogs® should think about tweeting golden retrievers.

In [None]:
golden_retriever = df_analysis[df_analysis['p1']=='golden_retriever']

random_retriever = random.choice(golden_retriever[golden_retriever.expanded_urls_add.notnull()].expanded_urls_add.values)
Image(url=random_retriever, embed=True, width=260, height=260)

<a id='conclusions'></a>
## VI. - Conclusions

The above analysis made three points clear:

**1. There is no significant diffrence between tweeting during working week or tweeting during the weekend.**

**2. Tweets featuring Golden Retriever perform on average better than tweets featuring other breeds.**

**3. However, every blog should offer diversity - but every now and then a golden retriever!** 

In [None]:
golden_retriever = df_analysis[df_analysis['p1']=='golden_retriever']

random_retriever = random.choice(golden_retriever[golden_retriever.expanded_urls_add.notnull()].expanded_urls_add.values)
Image(url=random_retriever, embed=True, width=260, height=260)

<a id='references'></a>
## VII. - References


- [Writing to a JSOn file](https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/)


- [Tweepy Error Messages](https://www.programcreek.com/python/example/13279/tweepy.TweepError)


- [Flatten Nested JSON files](https://towardsdatascience.com/how-to-flatten-deeply-nested-json-objects-in-non-recursive-elegant-python-55f96533103d)


- [Why are retweets more important?](https://medium.com/@Encore/favorites-vs-retweets-and-why-one-is-more-important-than-the-other-ba12ee20e9ba)