### Getting Data for My Final Project

I want to use the data from the 2023_pageviews to get the
- articles titles
- qid
- lang_code
- country_code
- pageviews

So far, I am a little confused about where I should be working on this project and how to use the duckdb databases

Eni said: use the wiki_pageviews duckdb. 
<br> You can follow the second tutorial, but swap the URL and the table name (instead of data_table use wiki_pageviews in the queries) and most of the queries will work, although the ones that do aggregation are a bit slow on this big database. It's better to select the rows that you need and save them in a CSV and then do the operations on the file.

This is from the DuckDB_Tutorial tutorial

I don't know what rows I need though

1. Set up in Google Colab

In [2]:
import duckdb

# Placeholder for the database connection. It will be initialized later with the URL.
conn = duckdb.connect()
conn

<_duckdb.DuckDBPyConnection at 0x10ae0f4b0>

In [3]:
# Install and Load the HTTPFS extension
# This is required to access remote files over the web (HTTP/S)
conn.execute("INSTALL httpfs;")
conn.execute("LOAD httpfs;")

<_duckdb.DuckDBPyConnection at 0x10ae0f4b0>

2. Connect to the Remote Database Source

In [4]:
# This is one of the several DuckDB databases hosted in the CS server.
# This database had the DPDP data for all countries.
database_url = "https://cs.wellesley.edu/~eni/duckdb/all_wiki.duckdb"

# Attach the remote file as a database named 'web_db' and start using it
try:
    conn.execute(f"ATTACH '{database_url}' AS web_db (READ_ONLY);")
    conn.execute("USE web_db;")
    print(f"Successfully attached database from: {database_url}")
except Exception as e:
    print(f"Error attaching database: {e}")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Successfully attached database from: https://cs.wellesley.edu/~eni/duckdb/all_wiki.duckdb


3. Show the Tables

In [5]:
query = "PRAGMA show_tables"
result = conn.sql(query)
result

┌────────────────┐
│      name      │
│    varchar     │
├────────────────┤
│ wiki_pageviews │
└────────────────┘

In [6]:
table_name = "wiki_pageviews"
query = f"PRAGMA table_info('web_db.{table_name}');"

# We can apply the method .df() to the result of the query to convert it into a dataframe
column_info_df = conn.sql(query).df()
column_info_df

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,date,DATE,False,,False
1,1,country_code,VARCHAR,False,,False
2,2,project,VARCHAR,False,,False
3,3,article,VARCHAR,False,,False
4,4,qid,VARCHAR,False,,False
5,5,pageviews,BIGINT,False,,False


I want the date, country_code, article, qid, and pageviews

4. SQL Commands for the Table

In [7]:
query_1 = """
SELECT * FROM wiki_pageviews
LIMIT 10;
"""
result_1 = conn.sql(query_1).df() # after executing, convert to df for better printout

result_1

Unnamed: 0,date,country_code,project,article,qid,pageviews
0,2023-02-06,DZ,ar.wikipedia,ÙØªÙØ§Ø²Ù_Ø£Ø¶ÙØ§Ø¹,Q45867,108
1,2023-02-06,DZ,ar.wikipedia,Ø§ÙØ£ÙØ¯ÙØ³,Q123559,145
2,2023-02-06,AR,en.wikipedia,Robledo_Puch,Q3181149,99
3,2023-02-06,AR,es.wikipedia,Ojo_de_Horus,Q211286,135
4,2023-02-06,AR,es.wikipedia,Estaciones_del_aÃ±o,Q24384,171
5,2023-02-06,AR,es.wikipedia,Isla_de_Alcatraz,Q131354,126
6,2023-02-06,AR,es.wikipedia,Volkswagen_Gol,Q275442,148
7,2023-02-06,AR,es.wikipedia,RÃ­o_Cuarto_(ciudad),Q983451,179
8,2023-02-06,AR,es.wikipedia,Todo_Noticias,Q3244714,325
9,2023-02-06,AR,es.wikipedia,Tres_metros_sobre_el_cielo_(pelÃ­cula_de_2010),Q944385,112


Now that I have my dataframe, I can get the specific information I need:
- date
- articles titles
- qid
- lang_code
- country_code
- pageviews

I also need to pick countries to look at

Let me filter by the date first. For this first part, I only want to look at the data for 1 month, so I am going to pick 2023-3

I think my date column is DATE objects ...

In [32]:
query_2 = """
SELECT date, country_code, project, article, qid, pageviews
FROM wiki_pageviews
WHERE DATE_TRUNC('month', date) = DATE '2023-03-01'
"""
df = conn.sql(query_2).df()

In [27]:
df.tail()

Unnamed: 0,date,country_code,project,article,qid,pageviews
11913520,2023-03-31,US,en.wikipedia,Jerry_Nadler,Q505598,512
11913521,2023-03-31,US,en.wikipedia,68â95â99.7_rule,Q847822,530
11913522,2023-03-31,US,fr.wikipedia,France,Q142,685
11913523,2023-03-31,US,uk.wikipedia,YouTube,Q866,3071
11913524,2023-03-31,US,zh.wikipedia,æ­æ´åÂ·ç¦å°æ©æ¯,Q4653,557


Let me get the lang_code from the project title with pandas first

In [33]:
df[['lang_code', 'project']] = df['project'].str.split('.', n=1, expand=True)
df.tail()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
11913520,2023-03-31,US,wikipedia,Jerry_Nadler,Q505598,512,en
11913521,2023-03-31,US,wikipedia,68â95â99.7_rule,Q847822,530,en
11913522,2023-03-31,US,wikipedia,France,Q142,685,fr
11913523,2023-03-31,US,wikipedia,YouTube,Q866,3071,uk
11913524,2023-03-31,US,wikipedia,æ­æ´åÂ·ç¦å°æ©æ¯,Q4653,557,zh


In [25]:
df["lang_code"] = df["project"][0][0:2]
df = df.drop('project', axis=1)
df.tail()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
11913520,2023-03-31,US,Jerry_Nadler,Q505598,512,ar
11913521,2023-03-31,US,68â95â99.7_rule,Q847822,530,ar
11913522,2023-03-31,US,France,Q142,685,ar
11913523,2023-03-31,US,YouTube,Q866,3071,ar
11913524,2023-03-31,US,æ­æ´åÂ·ç¦å°æ©æ¯,Q4653,557,ar


Okay, now I have my dataframe for March 2023, now I need to get my country dataframes

I want to include the US, Japan, UK, India, Germany - these are supposedly the countries that use wikipedia the most

In [34]:
USdf = df[(df['country_code'] == 'US')]
USdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
1681,2023-03-01,US,wikipedia,Kawasaki_disease,Q265936,684,en
1682,2023-03-01,US,wikipedia,The_Elder_Scrolls_IV:_Oblivion,Q49607,530,en
1683,2023-03-01,US,wikipedia,Marathon_Man_(film),Q1195727,523,en
1684,2023-03-01,US,wikipedia,Eleanor_Tomlinson,Q1582005,697,en
1685,2023-03-01,US,wikipedia,Alice_Neel,Q460186,1044,en


In [35]:
JPdf = df[(df['country_code'] == 'JP')]
JPdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
951,2023-03-01,JP,wikipedia,Compartment_No._6,Q107092356,104,en
952,2023-03-01,JP,wikipedia,çå®é«ç°æ´¾,Q10437214,167,ja
953,2023-03-01,JP,wikipedia,ãã«ã¨ãã¹ãã¨å¬åç£,Q483263,285,ja
954,2023-03-01,JP,wikipedia,ä¼½è¶,Q28084,299,ja
955,2023-03-01,JP,wikipedia,å½é72ç³»é»è»,Q11421672,181,ja


In [36]:
UKdf = df[(df['country_code'] == 'GB')]
UKdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
1501,2023-03-01,GB,wikipedia,Bill_Murray,Q29250,386,en
1502,2023-03-01,GB,wikipedia,Russian_cruiser_Moskva,Q2992278,95,en
1503,2023-03-01,GB,wikipedia,Ashes_to_Ashes_(British_TV_series),Q725195,124,en
1504,2023-03-01,GB,wikipedia,Green_Boots,Q3541506,162,en
1505,2023-03-01,GB,wikipedia,Red_Dead_Redemption,Q548203,194,en


In [37]:
INdf = df[(df['country_code'] == 'IN')]
INdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
633,2023-03-01,IN,wikipedia,à¤à¤¶à¥à¤,Q8589,177,bh
634,2023-03-01,IN,wikipedia,à¦ªà¦¾à¦ à¦¾à¦¨_(à¦à¦²à¦à§à¦à¦¿à¦¤à§à¦°),Q114620212,98,bn
635,2023-03-01,IN,wikipedia,Hussain_Kuwajerwala,Q5949546,225,en
636,2023-03-01,IN,wikipedia,Resident_Evil_(film),Q153484,145,en
637,2023-03-01,IN,wikipedia,Sherilyn_Fenn,Q229993,109,en


In [38]:
DEdf = df[(df['country_code'] == 'DE')]
DEdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
470,2023-03-01,DE,wikipedia,Carles_Puigdemont,Q4740163,101,br
471,2023-03-01,DE,wikipedia,Liste_von_Pistolen,Q60526,149,de
472,2023-03-01,DE,wikipedia,Priyanka_Chopra_Jonas,Q158957,215,de
473,2023-03-01,DE,wikipedia,Denis_Wladimirowitsch_Puschilin,Q16514790,109,de
474,2023-03-01,DE,wikipedia,Dominica,Q784,211,de


Now I can create my csv file

In [40]:
import pandas as pd

In [41]:
csv_data = pd.concat([USdf, JPdf, UKdf, INdf, DEdf], ignore_index=True)

In [43]:
csv_data.tail()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
5636456,2023-03-31,DE,wikipedia,The_Glory_(TV_series),Q113197148,211,en
5636457,2023-03-31,DE,wikipedia,Evan_Gershkovich,Q117337455,1032,en
5636458,2023-03-31,DE,wikipedia,ÙØ±ÛÙÛÙ_ÙÙÙØ±Ù,Q4616,121,fa
5636459,2023-03-31,DE,wikipedia,ÙØ¯ÛÙ_Ø¨Ø§Ø²ÙÙØ¯,Q106396209,93,fa
5636460,2023-03-31,DE,wikipedia,Fabio_Cannavaro,Q102027,142,it


In [44]:
csv_data.to_csv("final-project-data.csv", index=False)

Now, my next step is to get all the qids for my articles

In [45]:
qid_df = pd.read_csv('final-project-data.csv')

In [48]:
qid_df = qid_df[["article", "qid"]]

In [49]:
qid_df.head()

Unnamed: 0,article,qid
0,Kawasaki_disease,Q265936
1,The_Elder_Scrolls_IV:_Oblivion,Q49607
2,Marathon_Man_(film),Q1195727
3,Eleanor_Tomlinson,Q1582005
4,Alice_Neel,Q460186


Next, we can get script information about the articles using the code from the 4_get_wikidata file

In [51]:
import requests
import json, os

WIKIDATA_API_ENDPOINT = "https://www.wikidata.org/w/api.php"

def fetch_complete_entity_data(qid):
    """
    Fetches all available structured data for a single Wikidata entity (QID)
    using the official Wikibase API action=wbgetentities.

    Args:
        qid (str): The Wikidata Item ID (e.g., 'Q83285' for Durres).

    Returns:
        dict: The complete raw JSON data for the entity, or an error dictionary.
    """

    # Parameters for the MediaWiki API, using the 'wbgetentities' action
    params = {
        'action': 'wbgetentities',
        'ids': qid,
        'format': 'json',
        # Request all relevant data: claims (properties), labels, descriptions, sitelinks (Wikipedia links)
        'props': 'claims|labels|descriptions|sitelinks|aliases',
    }

    # Add a User-Agent header as recommended by Wikidata API policies
    # https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#User-Agent
    headers = {
        'User-Agent': 'Colab-Wikidata-Example/1.0 (https://colab.research.google.com; colab-user@example.com)'
    }

    try:
        response = requests.get(WIKIDATA_API_ENDPOINT, 
                                params=params, 
                                headers=headers, 
                                timeout=10)
        response.raise_for_status()  # Raises an HTTPError for bad responses (4xx or 5xx)

        data = response.json()

        # Check for potential errors in the API response structure
        if 'error' in data:
            return {"error": f"API Error for {qid}: {data['error']['info']}"}

        # The core data is nested under ['entities'][qid]
        entity_data = data.get('entities', {}).get(qid)

        if entity_data:
            return entity_data
        else:
            return {"error": f"Entity {qid} not found or no data returned."}

    except requests.exceptions.RequestException as e:
        return {"error": f"Network or API request error: {e}"}
    except json.JSONDecodeError:
        return {"error": "Failed to decode JSON response."}

I am going to run this on 1 qid to understand better how this code works: 

In [52]:
fetch_complete_entity_data("Q265936")

{'type': 'item',
 'id': 'Q265936',
 'labels': {'de': {'language': 'de', 'value': 'Kawasaki-Syndrom'},
  'ar': {'language': 'ar', 'value': 'داء كاواساكي'},
  'ca': {'language': 'ca', 'value': 'malaltia de Kawasaki'},
  'dv': {'language': 'dv', 'value': 'ކަވަސާކީ ސިންޑްރޯމް'},
  'en': {'language': 'en', 'value': 'Kawasaki disease'},
  'es': {'language': 'es', 'value': 'Enfermedad de Kawasaki'},
  'et': {'language': 'et', 'value': 'Kawasaki haigus'},
  'fa': {'language': 'fa', 'value': 'نشانگان کاوازاکی'},
  'fi': {'language': 'fi', 'value': 'Kawasakin tauti'},
  'fr': {'language': 'fr', 'value': 'maladie de Kawasaki'},
  'he': {'language': 'he', 'value': 'מחלת קווסאקי'},
  'hu': {'language': 'hu', 'value': 'Kawasaki-szindróma'},
  'it': {'language': 'it', 'value': 'sindrome di Kawasaki'},
  'ja': {'language': 'ja', 'value': '川崎病'},
  'ms': {'language': 'ms', 'value': 'Penyakit Kawasaki'},
  'nl': {'language': 'nl', 'value': 'Ziekte van Kawasaki'},
  'pl': {'language': 'pl', 'value': 'Cho

I am not totally sure what all of this means

I also have this code?

In [54]:
def _chunk_list(lst, n):
    """
    Yields successive n-sized chunks from lst.
    """
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [55]:
def fetch_labels_for_qids(qids: list[str], lang='en'):
    """
    Fetches labels for a list of Wikidata QIDs or Property IDs.
    Handles API limits by chunking the requests.

    Args:
        qids (list[str]): A list of Wikidata Item IDs or Property IDs (e.g., ['Q515', 'P31']).
        lang (str): The language code for the labels (default is 'en').

    Returns:
        dict: A dictionary mapping QID to its label, or an error dictionary.
    """
    if not qids:
        return {}

    # Wikidata API limit for 'ids' parameter is typically 50
    MAX_IDS_PER_REQUEST = 50
    all_labels_map = {}

    # Chunk the QID list to respect the API limit
    for qid_chunk in _chunk_list(qids, MAX_IDS_PER_REQUEST):
        params = {
            'action': 'wbgetentities',
            'ids': '|'.join(qid_chunk), # Join QIDs with '|' for multiple requests
            'format': 'json',
            'props': 'labels',
            'languages': lang,
        }

        headers = {
            'User-Agent': 'Colab-Wikidata-Example/1.0 (https://colab.research.google.com; colab-user@example.com)'
        }

        try:
            response = requests.get(WIKIDATA_API_ENDPOINT, params=params, headers=headers, timeout=10)
            response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx)

            data = response.json()

            if 'error' in data:
                # If an error occurs in one chunk, return it immediately or log and continue
                return {"error": f"API Error fetching labels for chunk {qid_chunk}: {data['error']['info']}"}

            for qid_key, entity_info in data.get('entities', {}).items():
                label = entity_info.get('labels', {}).get(lang, {}).get('value')
                if label:
                    all_labels_map[qid_key] = label

        except requests.exceptions.RequestException as e:
            return {"error": f"Network or API request error for chunk {qid_chunk}: {e}"}
        except json.JSONDecodeError:
            return {"error": "Failed to decode JSON response for a label chunk."}

    return all_labels_map

In [53]:
def extract_labeled_claim_values(claims: dict, property_labels: dict) -> dict:
    """
    Extracts the main value for each claim, resolves QID values to labels,
    and returns a dictionary of 'property_label': 'value' pairs.

    Args:
        claims (dict): The 'claims' section of a Wikidata entity's data.
        property_labels (dict): A dictionary mapping Property IDs (P-numbers) to their labels.

    Returns:
        dict: A dictionary where keys are property labels and values are their extracted/resolved values.
    """
    labeled_values = {}
    qids_to_resolve = set() # Collect all QIDs that need labels

    # First pass: Extract raw values and collect QIDs
    extracted_raw_values = {}
    for prop_id, statements in claims.items():
        prop_label = property_labels.get(prop_id, prop_id) # Use ID if label not found
        
        # We often care about the primary value of the first statement for simplicity
        if statements:
            main_snak = statements[0].get('mainsnak')
            if not main_snak or 'datavalue' not in main_snak: # Skip if no main value
                continue

            data_value = main_snak['datavalue']
            value_type = data_value.get('type')

            if value_type == 'wikibase-entityid':
                qid_value = data_value['value']['id']
                extracted_raw_values[prop_label] = qid_value # Store QID for later resolution
                qids_to_resolve.add(qid_value)
            elif value_type == 'string' or value_type == 'external-id':
                extracted_raw_values[prop_label] = data_value['value']
            elif value_type == 'quantity':
                # Format quantity with unit if available
                amount = data_value['value']['amount']
                unit = data_value['value'].get('unit', '').replace('http://www.wikidata.org/entity/', '')
                if unit and unit != '1': # '1' is the URI for dimensionless unit
                    # Attempt to add unit to QID list for resolution
                    if unit.startswith('Q'):
                        qids_to_resolve.add(unit)
                        extracted_raw_values[prop_label] = (amount, unit) # Store as tuple for later unit resolution
                    else:
                        extracted_raw_values[prop_label] = f"{amount} {unit}" # Simple string for non-QID units
                else:
                    extracted_raw_values[prop_label] = amount
            elif value_type == 'time':
                # Simple representation for time
                extracted_raw_values[prop_label] = data_value['value']['time']
            elif value_type == 'globecoordinate':
                latitude = data_value['value']['latitude']
                longitude = data_value['value']['longitude']
                extracted_raw_values[prop_label] = f"Lat: {latitude}, Lon: {longitude}"
            elif value_type == 'monolingualtext':
                extracted_raw_values[prop_label] = data_value['value']['text']
            # Add more types as needed
            else:
                # For unhandled types or complex structures, just show the raw datavalue
                extracted_raw_values[prop_label] = f"[Unhandled Type: {value_type}]"

    # Second pass: Resolve QID values and units to labels
    if qids_to_resolve:
        resolved_value_labels = fetch_labels_for_qids(list(qids_to_resolve))
        if "error" in resolved_value_labels:
            print(f"Warning: Could not resolve some value labels: {resolved_value_labels['error']}")
            # Proceed with raw QIDs if resolution fails
            pass

        for prop_label, value in extracted_raw_values.items():
            if isinstance(value, str) and value.startswith('Q'):
                labeled_values[prop_label] = resolved_value_labels.get(value, value) # Use raw QID if label not found
            elif isinstance(value, tuple) and len(value) == 2 and value[1].startswith('Q'): # Handle quantity with QID unit
                amount, unit_qid = value
                unit_label = resolved_value_labels.get(unit_qid, unit_qid)
                labeled_values[prop_label] = f"{amount} {unit_label}"
            else:
                labeled_values[prop_label] = value
    else:
        labeled_values = extracted_raw_values # No QIDs to resolve

    return labeled_values

In [56]:
def test_one(QID):
    """
    Demonstrates fetching the complete JSON data for a given QID string
    and then resolving labels for properties and their values.
    """
    # The entity for the Durres city
    qid_example = QID
    print(f"--- Fetching ALL structured data for {qid_example} \n")

    entity_data = fetch_complete_entity_data(qid_example)

    if "error" in entity_data:
        print(f"Error: {entity_data['error']}")
        return

    # Display main entity's label and description
    print(f"--- Main Entity Details ({qid_example}) ---")
    entity_label = entity_data.get('labels', {}).get('en', {}).get('value', 'No label found')
    entity_description = entity_data.get('descriptions', {}).get('en', {}).get('value', 'No description found')
    print(f"Label: {entity_label}")
    print(f"Description: {entity_description}\n")

    print("--- Full Raw JSON Structure (Truncated for readability) ---")

    # We will print the Claims section specifically to show the attribute:value pairs
    claims = entity_data.get('claims', {}) # This is the full claims dict
    print(f"\nTotal Properties (Claims) Found: {len(claims)}\n")

    # Get labels for the property IDs themselves
    property_ids = list(claims.keys())
    property_labels = fetch_labels_for_qids(property_ids)
    if "error" in property_labels:
        #print(f"Error fetching property labels: {property_labels['error']}")
        property_labels = {pid: pid for pid in property_ids} # Fallback to IDs if labels fail
    else:
        #print("Property IDs found for this entity:")
        # Print property IDs with their labels
        labeled_properties_overview = {pid: property_labels.get(pid, 'Label Not Found') for pid in property_ids}
        #print(json.dumps(labeled_properties_overview, indent=2))

    # Now, extract and label the claim values
    print("\n--- Extracted Labeled Claim Values ---")
    labeled_claim_values = extract_labeled_claim_values(claims, property_labels)
    print(json.dumps(labeled_claim_values, indent=2, ensure_ascii=False))

    print("\n--- Details for 'P31' (instance of) ---")

    if 'P31' in claims:
        # P31 is 'instance of', and it will contain an array of statements
        p31_statements = claims['P31']

        # Iterate over the values found for P31
        extracted_value_qids = []
        for statement in p31_statements:
            # The value is usually nested deep in the datavalue section
            main_snak = statement['mainsnak']
            if main_snak['datavalue']['type'] == 'wikibase-entityid':
                value_qid = main_snak['datavalue']['value']['id']
                extracted_value_qids.append(value_qid)

        # Get the label for the P31 property itself
        p31_label = property_labels.get('P31', 'Label Not Found for P31')
        print(f"Property P31 label: '{p31_label}'")

        # Get the labels for the extracted QID values
        value_labels = fetch_labels_for_qids(extracted_value_qids)

        if "error" in value_labels:
            print(f"Error fetching value labels: {value_labels['error']}")
        else:
            print(f"Raw QID values for 'instance of' (P31): {extracted_value_qids}")
            labeled_values = [value_labels.get(qid, 'Label Not Found') for qid in extracted_value_qids]
            print(f"Labeled values for 'instance of' (P31): {labeled_values}")
    else:
        print("P31 property not found in claims.")

    print("\n------------------------------------------------------------")
    print("This raw data contains every single piece of structured information available for the entity.")



In [63]:
def process_qids_to_jsonl(qid_list, output_filename="entity_data.jsonl"):
    """
    Processes a list of QIDs, fetches structured data, labels it, and stores
    the results (or errors) into a JSONL file.
    
    Args:
        qid_list (list): A list of QID strings (e.g., ['Q534', 'Q142', 'Q999']).
        output_filename (str): The name of the JSONL file to write results to.
    """
    print(f"Starting processing for {len(qid_list)} QIDs.")
    print(f"Results will be written to '{output_filename}'.")
    
    successful_count = 0
    failed_count = 0

    with open(output_filename, 'w', encoding='utf-8') as f:
        for qid in qid_list:
            print(f"Processing {qid}...")
            
            # Initialize the base record structure
            record = {"QID": qid, "status": "failed", "error_message": None}
            
            try:
                # 1. Fetch raw entity data (using your existing function)
                entity_data = fetch_complete_entity_data(qid)

                if "error" in entity_data:
                    # Handle API/Not Found error directly
                    record["error_message"] = entity_data['error']
                    failed_count += 1
                else:
                    # 2. Extract basic details
                    entity_label = entity_data.get('labels', {}).get('en', {}).get('value', 'No label found')
                    entity_description = entity_data.get('descriptions', {}).get('en', {}).get('value', 'No description found')
                    claims = entity_data.get('claims', {})
                    
                    # 3. Get labels for the properties themselves (using your existing function)
                    property_ids = list(claims.keys())
                    property_labels = fetch_labels_for_qids(property_ids)

                    if "error" in property_labels:
                        # Fallback for label fetching error
                        property_labels = {pid: pid for pid in property_ids} 
                        print(f"  Warning: Failed to fetch property labels for {qid}. Using IDs.")
                    
                    # 4. Extract and label all claim values (using your existing function)
                    labeled_claim_values = extract_labeled_claim_values(claims, property_labels)

                    # 5. Structure the final dictionary for successful outcome
                    record.update({
                        "status": "success",
                        "label": entity_label,
                        "description": entity_description,
                        "attributes": labeled_claim_values
                    })
                    record.pop("error_message") # Remove error key on success
                    successful_count += 1
            
            except Exception as e:
                # Catch any unexpected execution errors
                record["error_message"] = f"Unexpected execution error: {type(e).__name__} - {e}"
                failed_count += 1

            # 6. Write the final record (whether success or failure) to the JSONL file
            json_line = json.dumps(record, ensure_ascii=False)
            f.write(json_line + '\n')
    
    print("\n--- Processing Complete ---")
    print(f"Total Processed: {len(qid_list)}")
    print(f"Successful Records: {successful_count}")
    print(f"Failed Records: {failed_count}")
    print("---------------------------\n")

In [57]:
test_one("Q265936")

--- Fetching ALL structured data for Q265936 

--- Main Entity Details (Q265936) ---
Label: Kawasaki disease
Description: human disease in which blood vessels throughout the body become inflamed

--- Full Raw JSON Structure (Truncated for readability) ---

Total Properties (Claims) Found: 53


--- Extracted Labeled Claim Values ---
{
  "Commons category": "Kawasaki disease",
  "OMIM ID": "611775",
  "MedlinePlus ID": "000989",
  "DiseasesDB": "7121",
  "eMedicine ID": "965367",
  "NDL Authority ID": "00565244",
  "Freebase ID": "/m/040k6g",
  "image": "Kawasaki Disease.png",
  "Gran Enciclopèdia Catalana ID (former scheme)": "0262801",
  "Patientplus ID": "kawasaki-disease-pro",
  "Disease Ontology ID": "DOID:13378",
  "NCI Thesaurus ID": "C34825",
  "subclass of": "lymphadenitis",
  "health specialty": "immunology",
  "genetic association": "PPM1L",
  "exact match": "http://purl.obolibrary.org/obo/DOID_13378",
  "UMLS CUI": "C2936917",
  "symptoms and signs": "strawberry tongue",
  "Q

Now I will do this for the first 20 qids in my dataframe

In [59]:
top20 = qid_df.head(20)

In [60]:
top20 = top20['qid'].tolist()

In [61]:
top20

['Q265936',
 'Q49607',
 'Q1195727',
 'Q1582005',
 'Q460186',
 'Q486306',
 'Q869018',
 'Q857634',
 'Q709133',
 'Q675937',
 'Q962932',
 'Q18432',
 'Q3311525',
 'Q2181925',
 'Q122248',
 'Q192814',
 'Q254038',
 'Q4357239',
 'Q30113',
 'Q381941']

In [64]:
if __name__ == "__main__":
    #test_one("Q83285") # Article about Durres
    #test_one("Q7186")  # Article about Marie Kurie

    # I'm putting the list here, but you'll have a file with a list of QIDs here.
    qid_list_to_process = top20
    
    output_file = "entity_results.jsonl"

    # Run the main function
    process_qids_to_jsonl(qid_list_to_process, output_file)
    

Starting processing for 20 QIDs.
Results will be written to 'entity_results.jsonl'.
Processing Q265936...
Processing Q49607...
Processing Q1195727...
Processing Q1582005...
Processing Q460186...
Processing Q486306...
Processing Q869018...
Processing Q857634...
Processing Q709133...
Processing Q675937...
Processing Q962932...
Processing Q18432...
Processing Q3311525...
Processing Q2181925...
Processing Q122248...
Processing Q192814...
Processing Q254038...
Processing Q4357239...
Processing Q30113...
Processing Q381941...

--- Processing Complete ---
Total Processed: 20
Successful Records: 20
Failed Records: 0
---------------------------

