# News Domain YouTube Mapping

This notebook queries **Wikidata** for information about news and media organisations and their YouTube channels.

The goal is to produce a cleaned and deduplicated dataset that maps each organisation to its YouTube channel information (channel ID, handle, title) and website domain.  The final dataset is exported to CSV and JSON formats for easy reuse.

In [1]:

# Import standard libraries for HTTP requests, CSV/JSON handling and URL parsing
import requests
import csv
import json
from urllib.parse import urlparse

# In a Google Colab environment, `google.colab.files` provides helpers for downloading files.
# If you are not using Colab, you can remove this import and the calls to `files.download()`.
from google.colab import files

# Define a SPARQL query to retrieve news organisations and their YouTube data
SPARQL = """
SELECT DISTINCT ?item ?itemLabel ?website ?ytid ?yhandle ?channelTitle WHERE {
    # 1. News/media organisations (instance-of subtree of news organisation)
    ?item wdt:P31/wdt:P279* wd:Q1193236 ;
          wdt:P856 ?website ;
          wdt:P2397 ?ytid .
    # 2. Optional YouTube handle (main value)
    OPTIONAL { ?item wdt:P11245 ?yhandle . }
    # 3. Optional channel title (qualifier on the P2397 statement)
    OPTIONAL {
        ?item p:P2397 ?ytStmt .
        ?ytStmt ps:P2397 ?ytid .
        OPTIONAL { ?ytStmt pq:P1932 ?channelTitle . }
    }
    # 4. Get the organisation's label in the UI language
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
# Set the endpoint for Wikidata and HTTP request headers (identifying our script)
ENDPOINT = "https://query.wikidata.org/sparql"
HEADERS = {
    "Accept": "application/sparql-results+json",
    # Identify yourself with a descriptive user-agent; replace the email with your contact
    "User-Agent": "NewsYTMapping/1.2 (Google Colab; contact: your_email@example.com)",
}



## Understanding the SPARQL query

The SPARQL query above retrieves:

- Each news or media organisation item and its label (using the `item` and `itemLabel` variables).
- The official website (`P856`) associated with each organisation.  We later use the domain of this URL for deduplication.
- The YouTube channel identifier (`P2397`) for the organisation (stored in `ytid`).
- An optional YouTube handle (`P11245`), if recorded (`yhandle`).
- An optional YouTube channel title qualifier (`P1932` on the statement of `P2397`), stored in `channelTitle`.

It also uses the `wikibase:label` service to return labels in the user's interface language (falling back to English when necessary).

By specifying `SELECT DISTINCT` we avoid duplicate rows at the SPARQL level, although further deduplication is needed because multiple rows can still refer to the same organisation and channel.


In [2]:

# Query Wikidata for the rows defined by the SPARQL query
response = requests.get(ENDPOINT, params={'query': SPARQL}, headers=HEADERS, timeout=60)
response.raise_for_status()  # Raise an exception if the request failed
rows = response.json()["results"]["bindings"]

# Transform and deduplicate results
# We'll use a dictionary keyed by (domain, channel ID) to ensure each organisation-channel pair only appears once.
unique = {}
for row in rows:
    # Extract the website URL and parse out the domain (remove the common 'www.' prefix)
    site = row["website"]["value"]
    domain = urlparse(site).split("/")[0] if hasattr(urlparse, 'split') else urlparse(site).netloc.lower().replace("www.", "")
    # Note: We'll handle domain extraction more robustly using urlparse below
    from urllib.parse import urlparse as _urlparse
    domain = _urlparse(site).netloc.lower().replace("www.", "")

    # Extract YouTube channel ID
    ytid = row["ytid"]["value"]
    key = (domain, ytid)

    # Extract optional YouTube handle and channel title if available
    handle = row.get("yhandle", {}).get("value", "")
    title  = row.get("channelTitle", {}).get("value", "")

    if key not in unique:
        # Create a new entry when encountering a domain-channel combination for the first time
        unique[key] = {
            "news_name": row["itemLabel"]["value"],
            "domain": domain,
            "youtube_channel_id": ytid,
            "youtube_url": f"https://www.youtube.com/channel/{ytid}",
            "youtube_handle": handle,
            "youtube_channel_title": title,
        }
    else:
        # If the key already exists, fill in any missing fields from additional rows
        record = unique[key]
        if not record["youtube_handle"] and handle:
            record["youtube_handle"] = handle
        if not record["youtube_channel_title"] and title:
            record["youtube_channel_title"] = title

# Convert the dictionary of unique records to a list for saving
data = list(unique.values())
print(f"Retrieved {len(data)} unique outlets from {len(rows)} raw rows")


Retrieved 2264 unique outlets from 2457 raw rows



## Deduplication strategy

After retrieving all rows from Wikidata, there may still be duplicates because a single news organisation can appear multiple times if it has qualifiers or multiple statements.

The deduplication step groups rows by the combination of the organisation's website domain and YouTube channel ID.  For each unique combination, it keeps the first occurrence and then fills in missing values for the YouTube handle or channel title if subsequent rows provide them.

This yields a clean dataset in which each news outlet appears only once.


In [3]:

# Save the cleaned data to CSV and JSON formats
csv_filename = "news_youtube_mapping.csv"
json_filename = "news_youtube_mapping.json"

# Write the CSV file
with open(csv_filename, "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)

# Write the JSON file
with open(json_filename, "w", encoding="utf-8") as jsonfile:
    json.dump(data, jsonfile, indent=2, ensure_ascii=False)

# If running in Google Colab, make the files available for download
files.download(csv_filename)
files.download(json_filename)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


## Summary and next steps

This notebook demonstrates how to use the Wikidata SPARQL endpoint to retrieve a curated list of news organisations and their YouTube channels.  The data is deduplicated by domain and channel ID, ensuring that each outlet appears only once.  You can explore or visualize the resulting CSV/JSON files, or combine them with other datasets for further analysis.

Feel free to adapt the SPARQL query to include additional fields (for example, country of headquarters or organisation type) or to filter for specific types of media outlets.
