# Summaries

## Cleaning:

This project analyzed a single dataset that came relatively organized and needed little cleaning. My only actions in cleaning were:
- Renaming columns
- Filling missing values with 0s
- Creating an engagement_score column, that weighted reactions worth 0.25, comments as 0.5, and shares as 1.0


The Data Frame started like this:

In [1]:
import pandas as pd
original_fb_df = pd.read_csv('csv_collection/facebook-fact-check.csv')
display(original_fb_df.head())
print(original_fb_df.info())
print(original_fb_df.isnull().sum())


Unnamed: 0,account_id,post_id,Category,Page,Post URL,Date Published,Post Type,Rating,Debate,share_count,reaction_count,comment_count
0,184096565021911,1035057923259100,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,video,no factual content,,,146.0,15.0
1,184096565021911,1035269309904628,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,link,mostly true,,1.0,33.0,34.0
2,184096565021911,1035305953234297,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,link,mostly true,,34.0,63.0,27.0
3,184096565021911,1035322636565962,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,link,mostly true,,35.0,170.0,86.0
4,184096565021911,1035352946562931,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,video,mostly true,,568.0,3188.0,2815.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2282 entries, 0 to 2281
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   account_id      2282 non-null   int64  
 1   post_id         2282 non-null   int64  
 2   Category        2282 non-null   object 
 3   Page            2282 non-null   object 
 4   Post URL        2282 non-null   object 
 5   Date Published  2282 non-null   object 
 6   Post Type       2282 non-null   object 
 7   Rating          2282 non-null   object 
 8   Debate          298 non-null    object 
 9   share_count     2212 non-null   float64
 10  reaction_count  2280 non-null   float64
 11  comment_count   2280 non-null   float64
dtypes: float64(3), int64(2), object(7)
memory usage: 214.1+ KB
None
account_id           0
post_id              0
Category             0
Page                 0
Post URL             0
Date Published       0
Post Type            0
Rating               0
Debate    

After cleaning it, it looked like this:

In [3]:
cleaned_df = pd.read_csv('csv_collection/cleaned_buzzfeed_data.csv')

display(cleaned_df.head(3))
print(cleaned_df.info())
print(cleaned_df.isnull().sum())

Unnamed: 0,account_id,post_id,category,page,post_url,date_published,post_type,rating,debate,share_count,reaction_count,comment_count,engagement_score
0,184096565021911,1035057923259100,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,video,no factual content,,0.0,146.0,15.0,44.0
1,184096565021911,1035269309904628,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,link,mostly true,,1.0,33.0,34.0,26.25
2,184096565021911,1035305953234297,mainstream,ABC News Politics,https://www.facebook.com/ABCNewsPolitics/posts...,2016-09-19,link,mostly true,,34.0,63.0,27.0,63.25


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2282 entries, 0 to 2281
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   account_id        2282 non-null   int64  
 1   post_id           2282 non-null   int64  
 2   category          2282 non-null   object 
 3   page              2282 non-null   object 
 4   post_url          2282 non-null   object 
 5   date_published    2282 non-null   object 
 6   post_type         2282 non-null   object 
 7   rating            2282 non-null   object 
 8   debate            298 non-null    object 
 9   share_count       2282 non-null   float64
 10  reaction_count    2282 non-null   float64
 11  comment_count     2282 non-null   float64
 12  engagement_score  2282 non-null   float64
dtypes: float64(4), int64(2), object(7)
memory usage: 231.9+ KB
None
account_id             0
post_id                0
category               0
page                   0
post_url           

## Explorations:

### Format Comparison:

This looks at:
- What formats of posts get the most engagement?
- What formats of posts are the most factual?

Total engagement scores:


<a href="summary_images/total_engagement.png" target="_blank">
  <img src="summary_images/total_engagement.png" alt="General Trend Graph" width="800" height="auto">
</a>


<a href="summary_images/total_counts.png" target="_blank">
  <img src="summary_images/total_counts.png" alt="General Trend Graph" width="800" height="auto">
</a>

This shows us that the most common posts are links, videos, then photos, and that total engagement also follows this pattern. But let's look at average engagement now:

<a href="summary_images/average_engagement.png" target="_blank">
  <img src="summary_images/average_engagement.png" alt="General Trend Graph" width="800" height="auto">
</a>

Average and totals side by side:

<a href="summary_images/total_avg_combined.png" target="_blank">
  <img src="summary_images/total_avg_combined.png" alt="General Trend Graph" width="800" height="auto">
</a>

The above graphs shows us that linked posts increase in engagement as their factuality decreases, and photo posts decrease in engagement as their factuality decreases, and mostly true videos do not fair well at all. 

I did some additional aggregations to see what combinations of factuality and post typed scored by in terms of total and average engagements:

<a href="summary_images/aggregations.png" target="_blank">
  <img src="summary_images/aggregations.png" alt="General Trend Graph" width="800" height="auto">
</a>

Post-format summary:
- Posts with links are the most common, followed by videos then photos, and plain text posts were very rare.
- Posts with links gained the most engagement in total, followed by videos, then photos.
- Posts with videos gain the most engagement per post, followed by photos, then links.

Of these different types of posts, videos were the most factual, and posts with links and photos are equally factual.

In terms of factuality by itself:
- Mostly true posts generate the least amount of engagement on average, and a mixture of true and false gain the most engagement. 
- Mostly true posts are by far the most common, and mostly false are the least common.

When these two factor are combined, factuality and post type tend to create predictable outcomes:
- Mostly true links are most common,.
- Mixture of true and false videos perform the best.
However, there are some slightly suprising outcomes from the combinations:
- Mostly false photos have a higher total engagement, suggesting this is an especially common post-combination.
- The average engagement of; mixture of true and false with videos, is much grater than the sum of their own average engagement scores, making this a particularily powerful post type.

### Duplicating the buzzfeed findings:

In [5]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import HTML
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)
from scipy.stats import spearmanr, kendalltau

buzzfeed_df = pd.read_csv('csv_collection/cleaned_buzzfeed_data.csv')

# Left calculation:
left_misleading = buzzfeed_df[(buzzfeed_df['category'] == 'left') & buzzfeed_df['rating'].isin(['mostly false', 'mixture of true and false' ])]
left_posts = buzzfeed_df[buzzfeed_df['category'] == 'left']
left_percent_misleading = (len(left_misleading) * 100) / len(left_posts)

# Right calculation:
right_misleading = buzzfeed_df[(buzzfeed_df['category'] == 'right') & buzzfeed_df['rating'].isin(['mostly false', 'mixture of true and false' ])]
right_posts = buzzfeed_df[buzzfeed_df['category'] == 'right']
right_percent_misleading = (len(right_misleading) * 100) / len(right_posts)

print(f"{left_percent_misleading:.3f}% of posts from left sources are mostly false or a mixture of true and false.")
print(f"{right_percent_misleading:.3f}% of posts from right sources are mostly false or a mixture of true and false.")
print('This duplicates the results from the BuzzFeed analysis.')

19.108% of posts from left sources are mostly false or a mixture of true and false.
37.688% of posts from right sources are mostly false or a mixture of true and false.
This duplicates the results from the BuzzFeed analysis.


### Disproving the null hypothesis:

Grouping the data by factuality and affiliation, and then graphing that data, gave me this graph, which strongly indicated that I could disprove my null hypothesis:

<a href="summary_images/null_hypothesis.png" target="_blank">
  <img src="summary_images/null_hypothesis.png" alt="General Trend Graph" width="800" height="auto">
</a>

I then ran Spearman's Rho and Kendals Tau tests to test for correlations, and found these results:
- LEFT — Spearman: ρ = -0.500, p = 0.667 | Kendall: τ = -0.333, p = 1.000
- RIGHT — Spearman: ρ = 1.000, p = 0.000 | Kendall: τ = 1.000, p = 0.333

The above calculations show that for the Left publications: Spearman's Rho and Kendals Tau show negative correlations, meaning that as factuality decreases, average engagement tends to decrease as well. However these p values are higher than 0.05, so they are not statiscally significant, they just highlight a possible correlation.

For the Right publications: The Rho and Tau results show perfect 1.0 correlations, with a very low p value from Spearman, but not Kendall. Overall, it is fair to say that there is a statistically significant difference between the types of media, Left and Right. Right facing publications tend to gain more engagement on average, the less factual they are. 

As shown the Spearman and Kendal statistical test, and the clear trends of the above graph, there is a significant difference in the correlation between factuality and post-engagement between Right and Left leaning publications. The null hypothesis has been disproven, and the alternative hypothesis has been proven. 

## Conclusion

Exploring and testing this data delivered some expected results and meaningful insights into how Facebook posts generate engagement. Some types of posts fair better the more factual they are, and some do the reverse. Mostly true posts for the most part incrue the most engagement, but that's mostly because they are the most common – it's the mixture of true and false posts that fair the best on average. Videos that are equally true and false gain by far the most engagemen on average, which shows their ability to become viral and spread disinformation. Combining this knowledge with what I found about the null hypothesis, is that it is this type of post that is the most impactful on average to right leaning consumers of Facebook posts. 

Being able to duplicate the Buzzfeed findings further drives this point home, as not only do right leaning posts generate more engagement as their factuality decreases, but the **percent** of right leaning posts that are less factual is also higher than left-leaning posts. This compounds the disinformation effect, making it seem that Facebook spread right-leaning disinformation at a much higher rate than left-leaning disinformation. It would be interesting to find more current data on Facebook facuality, to see how it has evolved over time in its spread of misinformation.