# Wrangle Project - WeRateDogs
## Michael Mosin

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Reference: https://stackoverflow.com/questions/25351968/how-to-display-full-non-truncated-dataframe-information-in-html-when-convertin/25352191
pd.set_option('display.max_colwidth', -1)

# Wrangling:

## Gather Data:

### Import WeRateDogs Twitter archive

In [None]:
df_WRD_twitter = pd.read_csv('twitter-archive-enhanced.csv')

### Import image prediction file from url

In [None]:
# Reference: https://www.codementor.io/aviaryan/downloading-files-from-urls-in-python-77q3bs0un

import requests
import os
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)
open('image-predictions.tsv', 'wb').write(r.content)

In [None]:
df_img_pred = pd.read_csv('image-predictions.tsv', sep='\t')

### Import Tweet JSON data

In [None]:
# Reference: Dhaval P's answer on his question: https://knowledge.udacity.com/questions/47704
import json
data = []
with open('tweet_json.txt') as f:    
        for line in f:         
            data.append(json.loads(line))
df_twit_JSON = pd.DataFrame(data)


## Assess Data:

### Assessing WeRateDogs Twitter Archive Data:

#### Quality Issues:
- Dog names ('name') has 745 extracted as a non-null 'None', and several dog names extracted as 'a', 'the', and 'an'. Most of the Nones are appropriate, and most of the 'a', 'the', and 'an' entries should also be changed to 'None'. These are the 'a' or 'an' names that need to be changed to real ones:
    - 649 a - Forrest
    - 1853 a - Wylie
    - 1955 a - Kip
    - 2034 a - Yacōb
    - 2066 a - Rufus
    - 2116 a - Spork
    - 2125 a - Cherokee
    - 2128 a - Hemry
    - 2146 a - Alphred
    - 2161 a - Alfredo
    - 2191 a - Leroi
    - 2198 a - Toblerone
    - 2218 a - Chuk
    - 2235 a - Alfonso
    - 2249 a - Cheryl
    - 2255 a - Jessiga
    - 2264 a - Klint
    - 2273 a - Kohl
    - 2287 a - Daryl
    - 2304 a - Pepe
    - 2311 a - Octaviath
    - 2314 a - Johm
    - 2204 an - 'Berta'
- There are 181 retweet entries, and the project dictates only having original tweets. Should be removed.
- There are 78 reply tweet entries, and I'm not sure if that fits into the definition of 'original tweet' even if it includes new photo, name and rating. Better to err on the side of caution and remove them.
- Several ratings need to be adjusted, or rows removed due to non-ratings:
    - Entry at index 313 extracted a rating of '960/0', and needs to be changed to the revised rating of '13/10'
    - Entries at index 340 and 695 extracted a rating of '75/10', and needs to be changed to the actual rating of '9.75/10'
    - Entry at index 342 actually doesn't have a rating ('11/15' was extracted, while it was simply a description of time). Row needs to be removed.
    - Entry at index 516 actually doesn't have a rating ('24/7' was extracted, while it was simply a description of time). Row needs to be removed.
    - Entry at index 763 extracted a rating of '27/10', and needs to be changed to the actual rating of '11.27/10'
    - Entry at index 1068 extracted a rating of '9/11', and needs to be changed to the actual rating of '14/10'
    - Entry at index 1165 extracted a rating of '4/20', and needs to be changed to the actual rating of '13/10'
    - Entry at index 1202 extracted a rating of '50/50', and needs to be changed to the actual rating of '11/10'
    - Entries at indices 1598 and 1663 were technically not officially given ratings by WeRateDogs, and should be removed.
    - Entry at index 1662 extracted a rating of '7/11', and needs to be changed to the actual rating of '10/10'
    - Entry at index 1712 extracted a rating of '26/10', and needs to be changed to the actual rating of '11.26/10'
    - Entry at index 2335 extracted a rating of '1/2', and needs to be changed to the actual rating of '9/10'
- Since some correct ratings contain decimal values, 'rating_numerator' and 'rating_denominator' need to be changed from int to float

#### Tidiness Issues:
- Dog types (i.e. doggo, puppo, etc.) are in separate variable columns, where if a dog is described as such, the value is the dogtype, whereas if it isn't, the value is a non-null 'None'. Instead the columns could either be framed as Boolean 1's and 0's, or all placed into one 'dog_type' variable column.
    - Some entries have more than one dog type extracted from the text. This confirms that these columns are not mutually exclusive. This would be an issue had the extraction lumped all the categories for an entry together.

In [None]:
df_WRD_twitter.head()

In [None]:
df_WRD_twitter.info()

#### Assessing dog types:

In [None]:
#Reference: https://stackoverflow.com/questions/33042633/selecting-last-n-columns-and-excluding-last-n-columns-in-dataframe
dog_type_cols = df_WRD_twitter.columns[-5:].values

for i in dog_type_cols:
    print(df_WRD_twitter[i].value_counts())

In [None]:
df_WRD_twitter[(df_WRD_twitter['doggo'] == 'doggo') & (df_WRD_twitter['puppo'] == 'puppo')]['text']

In [None]:
df_WRD_twitter[(df_WRD_twitter['doggo'] == 'doggo') & (df_WRD_twitter['floofer'] == 'floofer')]['text']

In [None]:
df_WRD_twitter[(df_WRD_twitter['doggo'] == 'doggo') & (df_WRD_twitter['pupper'] == 'pupper')]['text']

#### Assessing dog names:

In [None]:
df_WRD_twitter.name.value_counts().nlargest(20)

In [None]:
df_WRD_twitter[df_WRD_twitter['name']=='None']['text'].head(10)

In [None]:
df_WRD_twitter[df_WRD_twitter['name']=='a']['text']

In [None]:
df_WRD_twitter[df_WRD_twitter['name']=='an']['text']

In [None]:
df_WRD_twitter[df_WRD_twitter['name']=='the']['text']

#### Assessing dog ratings:

In [None]:
df_WRD_twitter.rating_numerator.value_counts()

In [None]:
df_WRD_twitter[df_WRD_twitter['rating_numerator'] >= 20][['rating_numerator','rating_denominator','text']]

In [None]:
df_WRD_twitter.rating_denominator.value_counts()

In [None]:
df_WRD_twitter[df_WRD_twitter['rating_denominator'] != 10][['rating_numerator','rating_denominator','text']]

### Assessing Image Prediction Data:

#### Quality Issues:
- There are 324 images that returned predictions that were not dogs ('p1_dog', 'p2_dog', 'p3_dog' all False). These rows are either evidence of the neural network discovering pictures that indeed don't contain dogs, or of the neural network having a difficult time finding the dog in the image.

In [None]:
df_img_pred.head()

In [None]:
df_img_pred.info()

In [None]:
df_img_pred.img_num.value_counts()

In [None]:
df_img_pred.tweet_id.duplicated().value_counts()

In [None]:
all_pict_false = df_img_pred[(df_img_pred.p1_dog == False) & (df_img_pred.p2_dog == False) & (df_img_pred.p3_dog == False)]
all_pict_false.shape[0]

### Assessing Twitter JSON Data:

#### Quality Issues:
- There are 179 tweets that are retweets. These should be removed, as they are not originals.
- There are 28-29 tweets that are original responses to other tweets. As they are not necessarily stand-alone originals, so may be up for removal, unless the image dataset has extracted photos associated with the tweet.
- Essentially empty columns that should be dropped or ignored from merging: 'contributors', 'coordinates', 'geo', 'place'.

#### Tidiness Issues:
- Columns in which entries contain multiple pieces of information: 'entities', 'extended_entities', 'quoted_status', 'retweeted_status', 'user'. These columns could be made into their own datasets, or their contents could be sorted into unique variables that would be attached onto the end of the main JSON dataset entries to which they belong.

In [None]:
df_twit_JSON.head()

In [None]:
df_twit_JSON.info()

In [None]:
df_twit_JSON[df_twit_JSON.retweeted_status.isna() != True]['retweeted_status'].iloc[0]

In [None]:
df_twit_JSON[df_twit_JSON.retweeted_status.isna() != True].iloc[0]

In [None]:
df_twit_JSON[df_twit_JSON.quoted_status.isna() != True].iloc[0]

In [None]:
df_twit_JSON.favorited.value_counts()

### Summary of Quality Issues:

#### WeRateDogs Twitter Archive Data:
- Dog names ('name') has 745 extracted as a non-null 'None', and several dog names extracted as 'a', 'the', and 'an'. Most of the Nones are appropriate, and most of the 'a', 'the', and 'an' entries should also be changed to 'None'. These are the 'a' or 'an' names that need to be changed to real ones:
    - 649 a - Forrest
    - 1853 a - Wylie
    - 1955 a - Kip
    - 2034 a - Yacōb
    - 2066 a - Rufus
    - 2116 a - Spork
    - 2125 a - Cherokee
    - 2128 a - Hemry
    - 2146 a - Alphred
    - 2161 a - Alfredo
    - 2191 a - Leroi
    - 2198 a - Toblerone
    - 2218 a - Chuk
    - 2235 a - Alfonso
    - 2249 a - Cheryl
    - 2255 a - Jessiga
    - 2264 a - Klint
    - 2273 a - Kohl
    - 2287 a - Daryl
    - 2304 a - Pepe
    - 2311 a - Octaviath
    - 2314 a - Johm
    - 2204 an - 'Berta'
- There are 181 retweet entries, and the project dictates only having original tweets. Should be removed.
- There are 78 reply tweet entries, and I'm not sure if that fits into the definition of 'original tweet' even if it includes new photo, name and rating. Better to err on the side of caution and remove them.
- Several ratings need to be adjusted, or rows removed due to non-ratings:
    - Entry at index 313 extracted a rating of '960/0', and needs to be changed to the revised rating of '13/10'
    - Entries at index 340 and 695 extracted a rating of '75/10', and needs to be changed to the actual rating of '9.75/10'
    - Entry at index 342 actually doesn't have a rating ('11/15' was extracted, while it was simply a description of time). Row needs to be removed.
    - Entry at index 516 actually doesn't have a rating ('24/7' was extracted, while it was simply a description of time). Row needs to be removed.
    - Entry at index 763 extracted a rating of '27/10', and needs to be changed to the actual rating of '11.27/10'
    - Entry at index 1068 extracted a rating of '9/11', and needs to be changed to the actual rating of '14/10'
    - Entry at index 1165 extracted a rating of '4/20', and needs to be changed to the actual rating of '13/10'
    - Entry at index 1202 extracted a rating of '50/50', and needs to be changed to the actual rating of '11/10'
    - Entries at indices 1598 and 1663 were technically not officially given ratings by WeRateDogs, and should be removed.
    - Entry at index 1662 extracted a rating of '7/11', and needs to be changed to the actual rating of '10/10'
    - Entry at index 1712 extracted a rating of '26/10', and needs to be changed to the actual rating of '11.26/10'
    - Entry at index 2335 extracted a rating of '1/2', and needs to be changed to the actual rating of '9/10'
- Since some correct ratings contain decimal values, 'rating_numerator' and 'rating_denominator' need to be changed from int to float

#### Image Prediction Data:
- There are 324 images that returned predictions that were not dogs ('p1_dog', 'p2_dog', 'p3_dog' all False). These rows are either evidence of the neural network discovering pictures that indeed don't contain dogs, or of the neural network having a difficult time finding the dog in the image.

#### Twitter JSON Data:
- There are 179 tweets that are retweets. These should be removed, as they are not originals.
- There are 28-29 tweets that are original responses to other tweets. As they are not necessarily stand-alone originals, so may be up for removal, unless the image dataset has extracted photos associated with the tweet.
- Essentially empty columns that should be dropped or ignored from merging: 'contributors', 'coordinates', 'geo', 'place'.

### Summary of Tidiness Issues:

#### WeRateDogs Twitter Archive Data:
- Dog types (i.e. doggo, puppo, etc.) are in separate variable columns, where if a dog is described as such, the value is the dogtype, whereas if it isn't, the value is a non-null 'None'. Instead the columns could either be framed as Boolean 1's and 0's, or all placed into one 'dog_type' variable column.
    - Some entries have more than one dog type extracted from the text. This confirms that these columns are not mutually exclusive. This would be an issue had the extraction lumped all the categories for an entry together.

#### Twitter JSON Data:
- Columns in which entries contain multiple pieces of information: 'entities', 'extended_entities', 'quoted_status', 'retweeted_status', 'user'. These columns could be made into their own datasets, or their contents could be sorted into unique variables that would be attached onto the end of the main JSON dataset entries to which they belong.

## Clean:

#### Define

Convert existing dog type columns in WeRateDogs Twitter Archive Data into boolean variables, then consolidate dog type column data into singular column 'dog_type' which contains the summary of dog types extracted per entry.

#### Code

In [None]:
df_WRD_twitter2 = df_WRD_twitter.copy()
dog_type_cols = df_WRD_twitter2.columns[-4:].values

In [None]:
for i in dog_type_cols:
    df_WRD_twitter2[i] = np.where(df_WRD_twitter2[i] == i, i, None)
    df_WRD_twitter2[i] = df_WRD_twitter2[i].astype('bool')
    
subset = df_WRD_twitter2[df_WRD_twitter2.columns[-4:]].copy()

# Reference: https://stackoverflow.com/questions/26762100/reconstruct-a-categorical-variable-from-dummies-in-pandas
dog_type = []
dog_type = subset.dot(subset.columns)
dog_type = pd.Series(np.where(dog_type == None, None, dog_type))

df_WRD_twitter2['dog_type'] = dog_type.astype('category');

In [None]:
df_WRD_twitter2['dog_type'] = np.where(df_WRD_twitter2['dog_type'] == '', None, df_WRD_twitter2['dog_type'])

#### Test

In [None]:
df_WRD_twitter2['dog_type'].value_counts()

In [None]:
df_WRD_twitter2[df_WRD_twitter2['dog_type'].isna() != True].head()

#### Define

Change 'name' values of 'a', 'an', and 'the' to 'None'.

#### Code

In [None]:
df_WRD_twitter2.name = np.where(df_WRD_twitter2.name == 'a', 'None', df_WRD_twitter2.name)
df_WRD_twitter2.name = np.where(df_WRD_twitter2.name == 'the', 'None', df_WRD_twitter2.name)
df_WRD_twitter2.name = np.where(df_WRD_twitter2.name == 'an', 'None', df_WRD_twitter2.name)

#### Test

In [None]:
df_WRD_twitter2.name.value_counts().nlargest(20)

#### Define

Rename some of the entries that had names 'a' or 'an' to their actual names.

#### Code

In [None]:
# Reference: https://www.tutorialspoint.com/How-to-create-Python-dictionary-from-list-of-keys-and-values
keys = [649,1853,1955,2034,2066,2116,2125,2128,2146,2161,2191,2198,2204,2218,
        2235,2249,2255,2264,2273,2287,2304,2311,2314]
values = ['Forrest', 'Wylie', 'Kip', 'Yacōb', 'Rufus', 'Spork', 'Cherokee',
          'Hemry', 'Alphred', 'Alfredo', 'Leroi', 'Toblerone', 'Berta', 'Chuk',
          'Alfonso', 'Cheryl', 'Jessiga', 'Klint', 'Kohl', 'Daryl', 'Pepe',
          'Octaviath', 'Johm']
d = dict(zip(keys,values))

In [None]:
for k,v in d.items():
    df_WRD_twitter2.name.loc[k] = df_WRD_twitter2.name.loc[k].replace('None', v)

#### Test

In [None]:
for k,v in d.items():
    print(df_WRD_twitter2[['text','name']].iloc[k])

In [None]:
df_WRD_twitter2.name.value_counts().nlargest(20)

#### Define

Change rating variables to dtype float64, and clean up incorrect ratings in the WeRateDogs Archive dataset.

#### Code

In [None]:
df_WRD_twitter2.rating_numerator = df_WRD_twitter2.rating_numerator.astype('float')
df_WRD_twitter2.rating_denominator = df_WRD_twitter2.rating_denominator.astype('float')

In [None]:
df_WRD_twitter2.rating_numerator.loc[313] = 13.00
df_WRD_twitter2.rating_numerator.loc[[340,695]] = 9.75
df_WRD_twitter2.rating_numerator.loc[763] = 11.27
df_WRD_twitter2.rating_numerator.loc[1068] = 14.00
df_WRD_twitter2.rating_numerator.loc[1165] = 13.00
df_WRD_twitter2.rating_numerator.loc[1202] = 11.00
df_WRD_twitter2.rating_numerator.loc[1662] = 10.00
df_WRD_twitter2.rating_numerator.loc[1712] = 11.26
df_WRD_twitter2.rating_numerator.loc[2335] = 9.00

df_WRD_twitter2.rating_denominator.loc[[313,1068,1165,1202,1662,2335]] = 10.00

#### Test

In [None]:
df_WRD_twitter2.info()

In [None]:
df_WRD_twitter2[df_WRD_twitter2['rating_denominator'] != 10][['rating_numerator','rating_denominator','text']]

#### Define

Remove entries with non-ratings: enties at indices, 342, 516, 1598, and 1663.

#### Code

In [None]:
df_WRD_twitter2 = df_WRD_twitter2.drop([342,516,1598,1663])

#### Test

In [None]:
print(342 in df_WRD_twitter2.index)
print(516 in df_WRD_twitter2.index)
print(1598 in df_WRD_twitter2.index)
print(1663 in df_WRD_twitter2.index)

#### Define

Merge the three datasets into one dataset:
- Main DataFrame will be that of the edited WeRateDogs archive - `df_WRD_twitter2`
- Tacked on to it will be the columns of interest from the JSON DataFrame - `df_twit_JSON`
- Tacked on to that will be whole of the Image prediction DataFrame - `df_img_pred`

#### Code

In [None]:
pre_master_df = df_WRD_twitter2.merge(df_twit_JSON[['id',
                                                    'favorite_count', 
                                                    'retweet_count']],
                                     left_on='tweet_id', right_on='id')
pre_master_df = pre_master_df.drop(columns='id')

In [None]:
master_df = pre_master_df.merge(df_img_pred, on = 'tweet_id')

#### Test

In [None]:
df_WRD_twitter2.shape

In [None]:
pre_master_df.shape

In [None]:
master_df.shape

In [None]:
master_df.head()

#### Define

Remove rows which contain retweets and reply tweets, and then drop related empty columns.

#### Code

In [None]:
# Reference: https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression
master_df = master_df.drop(master_df[master_df.retweeted_status_timestamp.isna() == False].index)
master_df = master_df.drop(master_df[master_df.in_reply_to_user_id.isna() == False].index)

In [None]:
master_df = master_df.drop(columns = ['retweeted_status_id',
                                      'retweeted_status_user_id', 
                                      'retweeted_status_timestamp',
                                      'in_reply_to_user_id',
                                      'in_reply_to_status_id'])

#### Test

In [None]:
master_df.info()

## Save Final Dataset to CSV File

In [None]:
master_df.to_csv('twitter_archive_master.csv', index = False)

# Analysis and Visualization

## Read in Master Dataset

In [None]:
df = pd.read_csv('twitter_archive_master.csv')

## Analysis

### Review of Dogtionary Dog Type Usage in Tweet Text

In [None]:
df.dog_type.value_counts()

In [None]:
# Reference: https://stackoverflow.com/questions/32891211/limit-the-number-of-groups-shown-in-seaborn-countplot
ax = sns.countplot(data = df, x ='dog_type', 
                  color = sns.color_palette()[0], 
                  order=df.dog_type.value_counts().index);
plt.xticks(rotation=30)
plt.title('Dogtionary Dog Types in Tweets, \n Sorted by Count')
ax.set_xlabel('Dog Type(s)')
ax.set_ylabel('Number of Tweets Dog Types Appeared In')

#Reference: https://stackoverflow.com/questions/39519609/annotate-bars-with-values-on-pandas-on-seaborn-factorplot-bar-plot
for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), 
                (p.get_x() + p.get_width() / 2.,
                 p.get_height()+2), 
                ha='center')

In [None]:
# Reference: https://stackoverflow.com/questions/41801419/drawing-bar-charts-from-boolean-fields
df1 = df[['pupper','puppo','doggo','floofer']].apply(pd.value_counts)
ax = df1.loc[True].plot.bar();

plt.title('Dogtionary Dog Types Appearances in Tweets, \n Sorted by Dog Size')
ax.set_xlabel('Dog Types')
ax.set_ylabel('Number of Tweets Dog Types Appeared In')

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), 
                (p.get_x() + p.get_width() / 2.,
                 p.get_height()+2), 
                ha='center')

### Insight #1:

It appears that when WeRateDogs does decide to describe dogs using Dogtionary nomenclature, the most common instances of doing so are for describing the smallest / youngest of dogs. This does not necessarily determine whether the majority of WeRateDog tweets are of 'puppers,' however, there is a greater chance of that being the case.

### Review of Averaged Ratings in Dataset

In [None]:
df['avg_rating'] = df.rating_numerator / df.rating_denominator

In [None]:
bins = np.arange(-.25, df.avg_rating.max()+0.25, 0.1)
ax = plt.hist(df.avg_rating, bins = bins)

plt.xlim([-0.05,1.5])
locs, labels = plt.xticks()
plt.xticks(locs, (locs*10).astype('int64'));

plt.title('Histogram of Averaged Ratings,\n sans outliers')
plt.xlabel('Average Numerator (for a rating out of 10)');

### Insight #2:

Outside of a few outliers way out to the right, the general distribution of averaged ratings is left-skewed, with the densest ratings having the numerator value of 10-13. It is interesting to see that, despite the WeRateDogs Twitter account having a trademark of giving out ratings of 10 or greater, they still fairly often give out ratings below 10.

### Review of Favorite Counts vs Retweet Counts

In [None]:
plt.scatter(df.favorite_count, df.retweet_count);
plt.title('Favorite Counts vs Retweet Counts')
plt.xlabel('Favorites Count');
plt.ylabel('Retweet Count');

In [None]:
plt.scatter(df.favorite_count, df.retweet_count, alpha = .25)
plt.xlim([0,45000])
plt.ylim([0,25000])
plt.title('Favorite Counts vs Retweet Counts, \n zoomed in')
plt.xlabel('Favorites Count');
plt.ylabel('Retweet Count');

In [None]:
df[['favorite_count','retweet_count']].corr()

### Insight #3:

It appears that, although 'favorite_count' and 'retweet_count' have a quite high positive correlation (r = 0.913), there is also demonstrable heteroscedasticity.

### Review of Distribution of Dog Type Relative to Averaged Rating

In [None]:
# Reference: https://stackoverflow.com/questions/8671808/matplotlib-avoiding-overlapping-datapoints-in-a-scatter-dot-beeswarm-plot
plt.figure(figsize= (8,4))
sns.swarmplot('dog_type', 'avg_rating', data=df)
locs, labels = plt.yticks()
plt.yticks(locs, (locs*10).astype('int64'));
plt.xticks(rotation=30);
plt.title('Dog Types and Their Spread of Averaged Ratings')
plt.xlabel('Dog Type');
plt.ylabel('Averaged Rating');

In [None]:
print('Pupper - Mean Averaged Rating: ', df[df['pupper']==True]['avg_rating'].mean()*10)
print('Puppo - Mean Averaged Rating: ', df[df['puppo']==True]['avg_rating'].mean()*10)
print('Doggo - Mean Averaged Rating: ', df[df['doggo']==True]['avg_rating'].mean()*10)
print('Floofer - Mean Averaged Rating: ', df[df['floofer']==True]['avg_rating'].mean()*10)

### Insight #4:

Now it might be due to a small non-representative sample size (one that may have lead to inflation in value due to rarity), but of the dogs described with Dogtionary taxonomy, 'puppos' have the highest mean average rating. They are followed by 'floofers,' then 'doggos,' and then 'puppers' by quite a large margin. That may be because 'puppers' appear more often, and therefore have more opportunities to get stale. Or, they are the youngest category, and therefore have the most room to grow, especially after - for instance - having been caught being naughty.

### Review of Averaged Rating vs Retweet Count, then vs Favorite Count, all while incorporating Dog Types

In [None]:
sns.scatterplot('avg_rating', 'retweet_count', data=df, hue = 'dog_type', color = sns.color_palette()[0]);
plt.xlim([0.25,1.5])

In [None]:
sns.scatterplot('avg_rating', 'favorite_count', data=df, hue = 'dog_type', color = sns.color_palette()[0]);
plt.xlim([0.25,1.5])

In [None]:
sns.pairplot(x_vars=['avg_rating'], y_vars=['retweet_count'], data=df, hue = 'dog_type', height = 6);
plt.xlim([0.25,1.5]);
plt.ylim([0,22000]);
locs, labels = plt.xticks()
plt.xticks(locs, (locs*10).astype('int64'));

In [None]:
print('Pupper - Median Retweet Count: ', df[df['pupper']==True]['retweet_count'].median())
print('Puppo - Median Retweet Count: ', df[df['puppo']==True]['retweet_count'].median())
print('Doggo - Median Retweet Count: ', df[df['doggo']==True]['retweet_count'].median())
print('Floofer - Median Retweet Count: ', df[df['floofer']==True]['retweet_count'].median())

In [None]:
sns.pairplot(x_vars=['avg_rating'], y_vars=['favorite_count'], data=df, hue = 'dog_type', height = 6);
plt.xlim([0.25,1.5]);
plt.ylim([0,55000]);
locs, labels = plt.xticks()
plt.xticks(locs, (locs*10).astype('int64'));

In [None]:
print('Pupper - Median Favorite Count: ', df[df['pupper']==True]['favorite_count'].median())
print('Puppo - Median Favorite Count: ', df[df['puppo']==True]['favorite_count'].median())
print('Doggo - Median Favorite Count: ', df[df['doggo']==True]['favorite_count'].median())
print('Floofer - Median Favorite Count: ', df[df['floofer']==True]['favorite_count'].median())

### Insight #5:

We see that as ratings increase, so do favorites and the retweets. However, it's hard to tell from the plots how much more 'pup'-ular the different dog types are compared to one another, so some calculations were in order.

Since favorite counts and retweet counts appear to be rising exponentially, the 'average' for any dog type would be skewed severely by posts that had gone viral. Instead, looking at the median number of favorites and retweet for the dog types, we see that puppers are abysmally low relative to the rest of the pack. My unsubstantiated speculation is that that for the followers of WeRateDogs, the novelty of 'puppers' had worn off.