# What's in a Headline? Sentiment and Bias in Online Headlines


## Summary
The importance of understanding bias in media cannot be overstated. The divide between political left and right has grown in recent years, due at least in part to the consumption of partisan media fed to users via their social media algorithms. Furthermore, around 75% of news articles shared are passed on [based on their headlines alone](https://www.nature.com/articles/s41562-024-02067-4), without the user ever having read the article. As such, this project explores the relationship between bias, reliability, and tone in the headlines of popular media outlets. Using data sources from [NewsAPI](https://newsapi.org/) and bias
and reliability ratings from the [Ad Fontes Media Bias Chart](https://app.adfontesmedia.com/chart/interactive?utm_source=adfontesmedia&utm_medium=website), the goal is analyze whether headline sentiment tends to correlate with political bias or reliability of media sources.

## Data Sourcing
### NewsAPI
[NewsAPI](https://newsapi.org/) is a HTTP REST API that can be used to search for and retrieve articles from more than 100 news sources from all over the internet.  

### Ad Fontes
[Ad Fontes](https://adfontesmedia.com/?utm_source=WEBAPP) is a public benefit corporation based in Colorado whose stated aim is to ["rate all the news to positively transform society"](https://adfontesmedia.com/about-ad-fontes-media/). Using analysts with political views spanning the whoel of the political spectrum, Ad Fontes assigns to each article or outlet a rating for both *bias* and *reliability*. 

Politically neutral sources have a bias score close to 0, while those skewing left and right have negative and positive scores respectively, with the strength of the score correlating to the degree of bias (i.e. the higher the absolute value of the score, the more biased the source is). 

Reliability is rated on scale of 0 to 64 with the highest scores being given to outlets with fact-based reporting and a high degree of effort applied towards neutrality. Toward the middle of the spectrum, media sources have less reliability in their reporting or are a mixture of fact and opinion based articles. And, finally, those sources at the bottom of the scale contain information that is misleading, inaccurate, or fabricated.  



## Data Collection

1. `fetch_newsapi_sources.py` - Retrieves the full list of news sources from NewsAPI, and saves
them in JSON format to `data/raw/newsapi_sources.json`.
2. `ad_fontes_sources.json` - A JSON-formatted list of news sources with their associated 
bias and reliability scores from Ad Fontes. As Ad Fontes only provides data for their paid tiers, it was necessary to compile this manually based on overlapping sources from NewsAPI.
3. `merge_sources.py` - Merges the sources from NewsAPI with their bias and reliability scores from Ad Fontes and saves them in JSON format to `data/raw/merged_sources.json`
4. `fetch_hl_objects_from_api.py` - Fetches the headline/article objects from NewsAPI and saves them in JSON format to `data/raw/headline_objects.json`. This script can be exceuted multiple times, as the new data is integrated with the existing data, and duplicates are overwritten. All data for this analysis was collected between 10/4/2025 and 10/10/2025.
5. `flatten_articles.py` - Flattens all article objects into a csv file saved to `data/flattened_articles.csv`.