# Act Report

## Michael Mosin

#### 415-468 word written report that communicates the insights and displays the visualizations produced from my wrangled data.

Hello! Welcome to my WeRateDogs Twitter Data Wrangling Project. My name is Michael Mosin, and I have worked very hard on getting this data cleaned and worthy of analysis. Below are some of my assessments and associated visuals.

My variables of interest were dog types, ratings, count of favorites, and count of retweets.

### Assessment of Dogtionary Dog Type Usage in Tweet Text

#### Insight:

It appears that when WeRateDogs does decide to describe dogs using Dogtionary nomenclature, the most common instances of doing so are for describing the smallest / youngest of dogs. This does not necessarily determine whether the majority of WeRateDog tweets are of 'puppers,' however, there is a greater chance of that being the case.

In [None]:
# Reference: https://stackoverflow.com/questions/32891211/limit-the-number-of-groups-shown-in-seaborn-countplot
ax = sns.countplot(data = df, x ='dog_type', 
                  color = sns.color_palette()[0], 
                  order=df.dog_type.value_counts().index);
plt.xticks(rotation=30)
plt.title('Dogtionary Dog Types in Tweets, \n Sorted by Count')
ax.set_xlabel('Dog Type(s)')
ax.set_ylabel('Number of Tweets Dog Types Appeared In')

#Reference: https://stackoverflow.com/questions/39519609/annotate-bars-with-values-on-pandas-on-seaborn-factorplot-bar-plot
for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), 
                (p.get_x() + p.get_width() / 2.,
                 p.get_height()+2), 
                ha='center')

In [None]:
# Reference: https://stackoverflow.com/questions/41801419/drawing-bar-charts-from-boolean-fields
df1 = df[['pupper','puppo','doggo','floofer']].apply(pd.value_counts)
ax = df1.loc[True].plot.bar();

plt.title('Dogtionary Dog Types Appearances in Tweets, \n Sorted by Dog Size')
ax.set_xlabel('Dog Types')
ax.set_ylabel('Number of Tweets Dog Types Appeared In')

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), 
                (p.get_x() + p.get_width() / 2.,
                 p.get_height()+2), 
                ha='center')

### Assessment of Averaged Ratings in Dataset

#### Insight:

Outside of a few outliers way out to the right, the general distribution of averaged ratings is left-skewed, with the densest ratings having the numerator value of 10-13. It is interesting to see that, despite the WeRateDogs Twitter account having a trademark of giving out ratings of 10 or greater, they still fairly often give out ratings below 10.

In [None]:
df['avg_rating'] = df.rating_numerator / df.rating_denominator

bins = np.arange(-.25, df.avg_rating.max()+0.25, 0.1)
ax = plt.hist(df.avg_rating, bins = bins)

plt.xlim([-0.05,1.5])
locs, labels = plt.xticks()
plt.xticks(locs, (locs*10).astype('int64'));

plt.title('Histogram of Averaged Ratings,\n sans outliers')
plt.xlabel('Average Numerator (for a rating out of 10)');

### Assessment of Favorite Counts vs Retweet Counts

#### Insight:

It appears that, although 'favorite_count' and 'retweet_count' have a quite high positive correlation (r = 0.913), there is also demonstrable heteroscedasticity.

In [None]:
df[['favorite_count','retweet_count']].corr()

In [None]:
plt.scatter(df.favorite_count, df.retweet_count, alpha = .25)
plt.xlim([0,45000])
plt.ylim([0,25000])
plt.title('Favorite Counts vs Retweet Counts, \n zoomed in')
plt.xlabel('Favorites Count');
plt.ylabel('Retweet Count');

### Assessment of Distribution of Dog Type Relative to Averaged Rating

#### Insight:

Now it might be due to a small non-representative sample size (one that may have lead to inflation in value due to rarity), but of the dogs described with Dogtionary taxonomy, 'puppos' have the highest mean average rating. They are followed by 'floofers,' then 'doggos,' and then 'puppers' by quite a large margin. That may be because 'puppers' appear more often, and therefore have more opportunities to get stale. Or, they are the youngest category, and therefore have the most room to grow, especially after - for instance - having been caught being naughty.

In [None]:
# Reference: https://stackoverflow.com/questions/8671808/matplotlib-avoiding-overlapping-datapoints-in-a-scatter-dot-beeswarm-plot
plt.figure(figsize= (8,4))
sns.swarmplot('dog_type', 'avg_rating', data=df)
locs, labels = plt.yticks()
plt.yticks(locs, (locs*10).astype('int64'));
plt.xticks(rotation=30);
plt.title('Dog Types and Their Spread of Averaged Ratings')
plt.xlabel('Dog Type');
plt.ylabel('Averaged Rating');

In [None]:
print('Pupper - Mean Averaged Rating: ', df[df['pupper']==True]['avg_rating'].mean()*10)
print('Puppo - Mean Averaged Rating: ', df[df['puppo']==True]['avg_rating'].mean()*10)
print('Doggo - Mean Averaged Rating: ', df[df['doggo']==True]['avg_rating'].mean()*10)
print('Floofer - Mean Averaged Rating: ', df[df['floofer']==True]['avg_rating'].mean()*10)

### Assessment of Averaged Rating vs Retweet Count, then vs Favorite Count, all while incorporating Dog Types

#### Insight:

We see that as ratings increase, so do favorites and the retweets. However, it's hard to tell from the plots how much more 'pup'-ular the different dog types are compared to one another, so some calculations were in order.

Since favorite counts and retweet counts appear to be rising exponentially, the 'average' for any dog type would be skewed severely by posts that had gone viral. Instead, looking at the median number of favorites and retweet for the dog types, we see that puppers are abysmally low relative to the rest of the pack. My unsubstantiated speculation is that that for the followers of WeRateDogs, the novelty of 'puppers' had worn off.

In [None]:
sns.pairplot(x_vars=['avg_rating'], y_vars=['favorite_count'], data=df, hue = 'dog_type', height = 6);
plt.xlim([0.25,1.5]);
plt.ylim([0,55000]);
locs, labels = plt.xticks()
plt.xticks(locs, (locs*10).astype('int64'));

In [None]:
sns.pairplot(x_vars=['avg_rating'], y_vars=['favorite_count'], data=df, hue = 'dog_type', height = 6);
plt.xlim([0.25,1.5]);
plt.ylim([0,55000]);
locs, labels = plt.xticks()
plt.xticks(locs, (locs*10).astype('int64'));

In [None]:
print('Pupper - Median Retweet Count: ', df[df['pupper']==True]['retweet_count'].median())
print('Puppo - Median Retweet Count: ', df[df['puppo']==True]['retweet_count'].median())
print('Doggo - Median Retweet Count: ', df[df['doggo']==True]['retweet_count'].median())
print('Floofer - Median Retweet Count: ', df[df['floofer']==True]['retweet_count'].median())

In [None]:
print('Pupper - Median Favorite Count: ', df[df['pupper']==True]['favorite_count'].median())
print('Puppo - Median Favorite Count: ', df[df['puppo']==True]['favorite_count'].median())
print('Doggo - Median Favorite Count: ', df[df['doggo']==True]['favorite_count'].median())
print('Floofer - Median Favorite Count: ', df[df['floofer']==True]['favorite_count'].median())

Thank you for taking the time to read this report!
I hope you found that there was something of interest to you along the way. 