## Report: act_report
* Create a **250-word-minimum written report** called "act_report.pdf" or "act_report.html" that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

# Report

After cleaning the data that was stored as "twitter_archive_master", the following questions were answered:

1. Top ten months with the highest favorite_count
2. Top ten months with the highest retweet_count
3. Influence of the day on the favorite_count
4. Finding the correlation between the Favorite and Retweet counts

# Insights

1. favorite_count and retweet_count have been found to both reach their peaks in June. This can be rationally attributed to the fact that the dog festival normally occurrs during this period. Following this peak, January and December rank second and third for both favorite_count and retweet_count respectively. This may be due to an increase in festive activities during these periods.

2. Saturday usually has the highest favorite_count followed by Friday. This is probably due to most people not working on the weekends and having the time to scroll through Twitter.

3. Also, as expected, the correlation between favorite_count and retweet_count is positively very strong (0.86). Hence, favorited tweets are more likely to be retweeted.

4. On the other hand, the correlation between each feature (favorite_count and retweet_count) and both the numerator and denominator ratings show a very weak, positive relationship (for numerator_rating) and negative for denominator_rating.

# Recommendations

1. It is prefferable that posts are posted on Fridays and Saturdays
2. Dog events should be hosted around June, December, or January
3. Another factor should be used in predicting the probability of retweeting as the numerator and denominator ratings are not effective

## Write function for the visualization

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('twitter_archive_master.csv')

In [None]:
def barhplot(x, y, xlabel, title):
    plt.figure(figsize=  (14, 8))
    plt.barh(x, y, align = 'center')
    plt.gca().invert_yaxis()
    plt.xlabel(xlabel, fontsize = 18)
    plt.title(title, fontsize = 18)
    plt.show();

## Top ten months with the highest favorite_count

In [1]:
top = df.sort_values(by = 'favorite_count', ascending = False)
top_10 = top[['tweet_id', 'source', 'favorite_count', 'month']].head(10)
top_10

NameError: name 'df' is not defined

In [None]:
barhplot(top_10.month, top_10.favorite_count, "Favorite Count", "Months with the highest favorite count")

## Top ten months with the highest retweet_count

In [None]:
top = df.sort_values(by = 'retweet_count', ascending = False)
top_10 = top[['tweet_id', 'source', 'text', 'retweet_count', 'month']].head(10)
top_10.head()

In [None]:
barhplot(top_10.month, top_10.retweet_count, "Retweet Count", "Months with the highest tweet count")

## Influence of the day on favorite_count

In [None]:
top = df.groupby(by = 'day')
top = df.sort_values(by = 'favorite_count', ascending = False)
top_10 = top[['tweet_id', 'source', 'text', 'favorite_count', 'day']].head(10)
top_10.head()

In [None]:
barhplot(top_10.day, top_10.favorite_count, "Favorite Count", "Influence of the day on the favourite count")

## Finding the correlation between the Favorite and Retweet counts

In [None]:
df.favorite_count.corr(df.retweet_count)

## Visualization

In [None]:
plt.figure(figsize=(15, 13))
ax = plt.axes()
ax.scatter(df.favorite_count, df.retweet_count)

ax.set_xlabel('Favorite Count')
ax.set_ylabel('Retweet Count')
ax.set_title('Correlatiion Between the Favorite and Retweet Counts')

ax.axis('tight')

plt.show()

In [None]:
df.favorite_count.corr(df.rating_numerator)

In [None]:
df.favorite_count.corr(df.rating_denominator)

In [None]:
df.retweet_count.corr(df.rating_numerator)

In [None]:
df.retweet_count.corr(df.rating_denominator)