# MediaCloud

<div style="text-align:center"><img src="png/mediacloud.png" /></div>

## What is MediaCloud?

[Media Cloud](https://www.mediacloud.org) is an open-source media research project, enabling the study of news and information flow globally. The project is administered as a consortium collaboration between the Media Ecosystems Analysis Group, the University of Massachusetts Amherst, and Northeastern University. It was originally incubated at Harvard University and the Massachusetts Institute of Technology.

In simple terms, it is a platform that **collects** and analyzes news articles from various sources, allowing researchers to study media trends, narratives, and the spread of information over time. It provides tools for data collection, text analysis, and visualization, making it a valuable resource for journalists, academics, and anyone interested in understanding the dynamics of media coverage.

However, Media Cloud is **not** a repository of news articles. Instead, it provides **metadata** and **text analysis** of articles collected from various sources. The actual articles themselves are not stored within Media Cloud; rather, the platform focuses on analyzing the content and patterns within the media landscape.

## Media Cloud API

You can interact with Media Cloud either through their web interface or programmatically using their API. The API allows you to query the Media Cloud database, retrieve metadata about articles, and perform various analyses. In general, there are two main components of the Media Cloud API:

1. Media Cloud API: Search Against Media Cloud's new Online News Archive, with access to 200 million+ stories and growing everyday. Search against hundreads of sources and collections we have developed.
2. Wayback Machine: Search against the Wayback Machine’s database through an API we developed to be able to search against the large number of sources and collections we have developed. Search is against title text.

In this notebook, we will focus on the Media Cloud API. While in this notebook we will focus on using the API, we will start by briefly exploring the web interface.

## Setting Up an Account

To use the Media Cloud API, you need to set up an account and obtain an API key. Follow these steps:

1. Visit the [Media Cloud website](https://www.mediacloud.org) and sign up for an account.
2. Once you are logged in, navigate to your profile settings and press Request API Access (see the image below).

<div style="text-align:center"><img src="png/mediacloud_api.png" /></div>

3. After requesting access, you will receive an email with a link to confirm your API access. Click the link to confirm.
4. Once confirmed, you can find your API key in your profile settings (see the image below).

<div style="text-align:center"><img src="png/mediacloud_api_key.png" /></div>

In the picture above I hid some information because those are the credentials (API Key) you will use to tell Media Cloud who you are. You should never share them with anyone, even your spouse or a firefighter! That is because they serve to identify you. If someone maluses them it will be on you. 

## Storing your credentials

There are multiple ways to store your credentials and passwords safely. We don't want them to be corrupted, right? However, it is one thing to store them [safely](https://youtu.be/MnjQV--o1-0?si=hIlgl9sCyt4JhVUd) and the other to have [strong passwords](https://youtu.be/mQ36sUT77qI?si=hxRw4O4UxKM_WUPy). We all know that we should use strong passwords, but do we really know why? The picture below shows how fast one can crack your password depending on its complexity.

<div style="text-align:center"><img src="png/password_table_2023.jpg"/></div>

Anyhow, the lesson we should take from the graph above is twofold:

1. Use strong passwords.
2. Use password managers to propose strong passwords and store them.

If for any reason, you are still reluctant to trust password managers at least create complex passwords by mixing nonsense words (it is the only place where making spelling errors helps) and special characters, for example:

>`$eating#keyborads-1ncreases_staminA`

In our case, we have already generated passwords and credentials which look pretty strong. How are we going to store them?

### Environmental variables

As you probably rightly suspect, in our case, we will need our credentials to connect to API. We don't really want to store them in the notebook because we want to be able to share the notebook (you want to share it with me and I want to share it with you). We don't want to copy and paste them every time we want to use the notebook cause it would be very inefficient. Also, it will be quite easy to forget about it. What are we going to do then?

We are going to use something which is called environmental variables. In other words, we are going to define some variables either on our computer or in the Colab that will be stored there. In the Notebook, we will just retrieve them by their names. For this purpose, we need to press the key on the left-hand side tab. We need to define the 5 variables:

* `API_KEY` -- this is our API Key.

In [None]:
## Load module
from google.colab import userdata

## Retrive our environmental variables and assing them to names.
API_KEY = userdata.get("API_KEY")

## Media Cloud Module

So, when finally we do have our credentials in the Notebook what are we going to do next? We need to pass it somehow through a request to the Media Cloud API, right? Intuitively, we would do it through a payload and `request` module, right? Yes, this is a good intuition but fortunately, we don't really have to do it this way. That is because most social media have so-called wrappers. Those are modules that allow us to connect to API and send requests. We could still do it through our web browser but the URL would be much more complicated.

That is why, in the case of Media Cloud, we will use the  module. It will serve us to connect and get data from Media Cloud.

In [None]:
## Install praw module
!pip install mediacloud

In [None]:
## Import module
import mediacloud.api
import datetime as dt

## Connect to Reddit
search_api = mediacloud.api.SearchApi(API_KEY)

Once we have established a connection to Media Cloud let's now try to get some data. We will start with searching for the number of articles that contained the name of the best football player ever to live -- [Aitana Bonmati](https://en.wikipedia.org/wiki/Aitana_Bonmatí). There are mutiple ways how we can construct the query string. How to do it is described in the [documentation](https://www.mediacloud.org/documentation/query-guide). Here, we will use the simplest -- we will just type her surname.

In [None]:
## The method story_count returns a dictionary with the number of stories that matched the query.
clark_all = search_api.story_count(
    query="'Caitlin Clark'", start_date=dt.date(2025, 1, 1), end_date=dt.date.today()
)
clark_all

What is more we can specify a resource to search in. For example, New York Times. However, to do that we need to know its media ID. We can find it through the Media Cloud web interface.

In [None]:
## What is more we can specify a resource to search in. For example, El Pais.
clark_all_ny = search_api.story_count(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    source_ids=[1],
)
clark_all_ny

You can see this number also splitted by day.

In [None]:
## What is more we can specify a resource to search in. For example, New York Times.
clark_ny = search_api.story_count_over_time(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    source_ids=[1],
)
clark_ny

### Exercies

Find a day this year when Aitana Bonmati was mentioned the most in New York Times.

In [None]:
## YOUR CODE

Now, we can also search through a collection of soruces. Media Cloud has a set of collections of sources that are specific to a given country. For some reason this work the best with English speaking countries but at this point it should not come as a big surprise. You can find their ids thoughout the Media Cloud web-interface.

In [None]:
clark_collection = search_api.story_count_over_time(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    collection_ids=[34412234],
)
clark_collection

In [None]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

x = [item["date"] for item in clark_ny]
y_count = [item["count"] for item in clark_ny]
y_ratio = [item["ratio"] for item in clark_ny]

fig, axs = plt.subplots(figsize=(9, 5), nrows=2)
axs[0].plot(x, y_count, color="#C8102E")
axs[0].set_ylabel("Count")
axs[1].plot(x, y_ratio, color="#041E42")
axs[1].set_ylabel("Ratio")
for ax in axs:
    ax.xaxis.set_major_locator(mdates.MonthLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.suptitle(
    f"Dynamics of the number articles on Caitlin Clark in NY Times in 2025\n(N = {clark_all_ny['relevant']})"
)