# MediaCloud

<div style="text-align:center"><img src="png/mediacloud.png" /></div>

## What is MediaCloud?

[Media Cloud](https://www.mediacloud.org) is an open-source media research project, enabling the study of news and information flow globally. The project is administered as a consortium collaboration between the Media Ecosystems Analysis Group, the University of Massachusetts Amherst, and Northeastern University. It was originally incubated at Harvard University and the Massachusetts Institute of Technology.

In simple terms, it is a platform that **collects** and analyzes news articles from various sources, allowing researchers to study media trends, narratives, and the spread of information over time. It provides tools for data collection, text analysis, and visualization, making it a valuable resource for journalists, academics, and anyone interested in understanding the dynamics of media coverage.

However, Media Cloud is **not** a repository of news articles. Instead, it provides **metadata** and **text analysis** of articles collected from various sources. The actual articles themselves are not stored within Media Cloud; rather, the platform focuses on analyzing the content and patterns within the media landscape.

## Media Cloud API

You can interact with Media Cloud either through their web interface or programmatically using their API. The API allows you to query the Media Cloud database, retrieve metadata about articles, and perform various analyses. In general, there are two main components of the Media Cloud API:

1. Media Cloud API: Search Against Media Cloud's new Online News Archive, with access to 200 million+ stories and growing every day. Search against hundreds of sources and collections we have developed.
2. Wayback Machine: Search against the Wayback Machineâ€™s database through an API we developed to be able to search against the large number of sources and collections we have developed. Search is against the title text.

In this notebook, we will focus on the Media Cloud API. However, first, we will start by briefly exploring the web interface.

## Setting Up an Account

To use the Media Cloud API, you need to set up an account and obtain an API key. Follow these steps:

1. Visit the [Media Cloud website](https://www.mediacloud.org) and sign up for an account.
2. Once you are logged in, navigate to your profile settings and press Request API Access (see the image below).

<div style="text-align:center"><img src="png/mediacloud_api.png" /></div>

3. After requesting access, you will receive an email with a link to confirm your API access. Click the link to confirm.
4. Once confirmed, you can find your API key in your profile settings (see the image below).

<div style="text-align:center"><img src="png/mediacloud_api_key.png" /></div>

In the picture above, I hid some information because those are the credentials (API Key) you will use to tell Media Cloud who you are. You should never share them with anyone, even your spouse or a firefighter! That is because they serve to identify you. If someone maluses them, it will be on you. 

## Storing your credentials

There are multiple ways to store your credentials and passwords safely. We don't want them to be corrupted, right? However, it is one thing to store them [safely](https://youtu.be/MnjQV--o1-0?si=hIlgl9sCyt4JhVUd) and the other to have [strong passwords](https://youtu.be/mQ36sUT77qI?si=hxRw4O4UxKM_WUPy). We all know that we should use strong passwords, but do we really know why? The picture below shows how fast one can crack your password depending on its complexity.

<div style="text-align:center"><img src="png/password_table_2023.jpg"/></div>

Anyhow, the lesson we should take from the graph above is twofold:

1. Use strong passwords.
2. Use password managers to propose strong passwords and store them.

If, for any reason, you are still reluctant to trust password managers, at least create complex passwords by mixing nonsense words (it is the only place where making spelling errors helps) and special characters, for example:

>`$eating#keyborads-1ncreases_staminA`

In our case, we have already generated passwords and credentials that look pretty strong. How are we going to store them?

### Environmental variables

As you probably rightly suspect, in our case, we will need our credentials to connect to the API. We don't really want to store them in the notebook because we want to be able to share the notebook (you want to share it with me, and I want to share it with you). We don't want to copy and paste them every time we want to use the notebook, cause it would be very inefficient. Also, it will be quite easy to forget about it. What are we going to do then?

We are going to use something called environmental variables. In other words, we are going to define some variables either on our computer or in the Colab that will be stored there. In the Notebook, we will just retrieve them by their names. For this purpose, we need to press the key on the left-hand side tab. We need to define the 5 variables:

* `API_KEY` -- this is our API Key.

In [None]:
## Load module
from google.colab import userdata

## Retrive our environmental variables and assing them to names.
API_KEY = userdata.get("API_KEY")

## Media Cloud Module

So, when we finally do have our credentials in the Notebook, what are we going to do next? We need to pass it somehow through a request to the Media Cloud API, right? Intuitively, we would do it through a payload and `request` module, right? Yes, this is a good intuition, but fortunately, we don't really have to do it this way. That is because most social media have so-called wrappers. Those are modules that allow us to connect to the API and send requests. We could still do it through our web browser, but the URL would be much more complicated.

That is why, in the case of Media Cloud, we will use the module. It will serve us to connect and get data from Media Cloud.

In [None]:
## Install mediacloud module
!pip install mediacloud

In [None]:
## Import modules
import mediacloud.api
import datetime as dt

## Connect to Media Cloud
search_api = mediacloud.api.SearchApi(API_KEY)

This is a bit underwhelming because nothing was printed. To check whether everything worked well, we can just execute the following. It will return information about our user account.

In [None]:
## Print user's information
search_api.user_profile()

The most important information from this dictionary is about the quota. This tells us how many requests we made and how much we have left this week. 

## Counts

Anyway, once we have established a connection to Media Cloud, let's now try to get some data. We will start with searching for the number of articles that contained the name of sensational [Caitlin Clark](https://en.wikipedia.org/wiki/Caitlin_Clark) -- number 1 in the 2024 WNBA Draft. There are multiple ways in which we can construct the query string. They are described in the [documentation](https://www.mediacloud.org/documentation/query-guide). Here, we will use the simplest -- we will just type her name and surname in quotation marks, which will guarantee that we hit only articles with the `"Caitlin Clark"` string.

In [None]:
## The method search_api.story_count() returns a dictionary with the number of stories that matched the query.
clark_all = search_api.story_count(
    query="'Caitlin Clark'", start_date=dt.date(2025, 1, 1), end_date=dt.date.today()
)
clark_all

What is more, we can specify a resource to search in. For example, the New York Times. However, to do that, we need to know its media ID. We can find it through the Media Cloud web interface or through a different API endpoint. However, let's, for now, assume we will use the web interface for this. At the end of the notebook, we will come back to investigating available sources through the API.

In [None]:
## We can add a list of sources.
clark_all_ny = search_api.story_count(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    source_ids=[1],
)
clark_all_ny

However, most of the time, the count sum for the whole search is kind of useless. Probably, it would be much better to aggregate daily. Say no more!

In [None]:
## Data aggregated daily.
clark_ny = search_api.story_count_over_time(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    source_ids=[1],
)
clark_ny

### Exercise

Find a day this year when Caitlin Clark was mentioned the most inThe  New York Times.

In [None]:
## YOUR CODE

Now, we can also search through a collection of sources. Media Cloud has a set of collections of sources that are specific to a given country. Similarly to the source id, we need to know the id of a given collection. There are two ways of learning it, either through the web interface or through the API. Again, we will talk about the API pathway later.

In [None]:
## Example of getting data from a collection of sources.
clark_collection = search_api.story_count_over_time(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    collection_ids=[34412234],
)
clark_collection

Let's now see how it changed over time. 

In [None]:
## Import functions for plotting.
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

## Prepare data.
x = [item["date"] for item in clark_ny]
y_count = [item["count"] for item in clark_ny]
y_ratio = [item["ratio"] for item in clark_ny]

## Define plot.
fig, axs = plt.subplots(figsize=(9, 5), nrows=2)
axs[0].plot(x, y_count, color="#C8102E")
axs[0].set_ylabel("Count")
axs[1].plot(x, y_ratio, color="#041E42")
axs[1].set_ylabel("Ratio")
## Edit xticks so they show the names of the months only.
for ax in axs:
    ax.xaxis.set_major_locator(mdates.MonthLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
## Add title.
plt.suptitle(
    f"Dynamics of the number articles on Caitlin Clark in NY Times in 2025\n(N = {clark_all_ny['relevant']})"
)

Finally, we can search for the sources with the most articles with a given phrase. 

In [None]:
clark_sources = search_api.sources(
    "'Caitlin Clark'", start_date=dt.date(2025, 1, 1), end_date=dt.date.today()
)
clark_sources

## Words

The most analytical thing we can extract from the Media Cloud is the most popular words that were used in the articles on the given topic. Unfortunately, this is only estimated on the sample of a maximum $5000$ articles.

In [None]:
## Get 100 top words from the articles.
clark_words = search_api.words(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    collection_ids=[34412234],
)
clark_words

### Exercise

Create a dictionary in which keys will represent terms and values will be term ratios. Filter out words: `"Caitlin"` and `"Clark"`.

In [None]:
## YOUR CODE
words = {}

Let's now draw a word cloud. This is one of the most useless graphs (I think it is even worse than a pie chart). However, sometimes it looks nice. And people do use it.

In [None]:
## Import modules
from wordcloud import WordCloud
import matplotlib.pyplot as plt

## Define the plot.
wordcloud = WordCloud(
    width=800, height=400, background_color="white"
).generate_from_frequencies(words)

## Define the size of the plot.
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Caitlin Clark Word Cloud")
plt.show()

## Lists

Finally, what is probably the most interesting and useful are lists of metadata. Using Media Cloud, you can find metadata about the articles on a given topic. In simple terms, you can get the titles of all articles on `"Caitlin Clark"` from 2025. The simplest way of getting the list of metadata is by using the `search_api.story_sample()` method. However, as the name suggests, it can only provide us with a sample of metadata. The maximum limit is $1250$.

In [None]:
## Let's get the list of metadata
clark_meta_sample = search_api.story_sample(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    limit=500,
    collection_ids=[34412234],
)

In [None]:
## Let's see the example of the result
clark_meta_sample[0]

There is also a method that returns metadata for all the articles. However, it takes much longer. It gets 1000 articles every 30 seconds. That is because the API will send us all the articles in batches of 1000. Moreover, there is a rate limit at this endpoint. It means that we can send only 2 requests per minute to it. But how to do it?

In [None]:
## Import module to control time
import time

## Output list
clark_meta_all = []

## The id of the next batch of data
pagination_token = None

## Control variable
more_stories = True

## How much data is there to collect
clark_all = search_api.story_count(
    "'Caitlin Clark'",
    start_date=dt.date(2025, 1, 1),
    end_date=dt.date.today(),
    collection_ids=[34412234],
)

while more_stories:
    ## Estimate time left
    estimated_time = 31 * (clark_all["relevant"] - len(clark_meta_all)) / 1000

    ## Print how much data was collected
    print(
        f"Collected {len(clark_meta_all)} out of {clark_all["relevant"]}. Around {estimated_time} seconds left."
    )

    ## Wait to make another request
    time.sleep(31)

    ## search_api.story_list() returns a two elements long tuple. The
    ## first element is a list of dictionaries. The second element is
    ## the id of the next batch of data
    page, pagination_token = search_api.story_list(
        "'Caitlin Clark'",
        start_date=dt.date(2025, 1, 1),
        end_date=dt.date.today(),
        collection_ids=[34412234],
        pagination_token=pagination_token,
    )

    ## Update the list
    clark_meta_all += page

    ## Update the control variable
    more_stories = pagination_token is not None

### Exercise

Find all the sources that are overrepresented in the sample of 500 articles compared to the sample of 1250 articles.

In [None]:
## YOUR CODE