### **APIs with Python: Google Sheets Monitoring - Cocaine Seizures**
#### InSight Crime’s MAD Unit - (June, 2025)

##### Luis Felipe Villota Macías

---------------------



### 1. Goals

* Monitor and validate data in a shared Google Sheet using automated checks to ensure accuracy, consistency, and data quality.

* Highlight invalid or suspicious entries directly in the sheet and optionally generate weekly reports to support governance and oversight.

* Automate the process with Google Apps Script or Python, running validations on a schedule (e.g. every Friday) with minimal manual effort.




________________

### 2. Project Setup

#### 2.1 Version Control

I decided to create a single GitHub repository ([FelipeVillota/db-check-cocaine-seizures](https://github.com/FelipeVillota/db-check-cocaine-seizures)). I keep the repository `private` with the possibility to give access to the online repo at any time. 

#### 2.2 Reproducible Environment

In [None]:
# IMPORTANT
# To create venv
# python -m venv venv-scraping 

# To activate environment, run in Terminal:
# # (optional, temporary auth) 
# Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass 
# venv-scraping\Scripts\activate

# Then select respective kernel --> also install ipykernel package to connect to kernel

# Update list master list
# pip freeze > requirements.txt

In [None]:
# Checking venv-apis works
import sys
print(sys.executable)

c:\Users\USER\Desktop\codebaker\all_multi\ic-ct\apis\venv-apis\Scripts\python.exe


#### 2.3 Loading Libraries

pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib pandas


In [None]:
import os
import re
import requests
import pandas as pd
from datetime import datetime

----------------------

### 3.  Approach

As general background and scheme, 

* I conceive an API as a portal that communicates data between different software systems, so it is key to know how to make `requests` (queries) to different `endpoints` (particular URLs as gateways of the API) to extract data and to know the level of `authentication` (permissions) required to access instances (Goodwin, 2024). Additionally, one has to consider the `status` (success or not) of the eventual `response` and the `data format` of the output (to be processed later) (Goodwin, 2024; IBM Technology, 2020).

* To interact with any API effectively, I try to get familiar with the developer's documentation (in this case also dataset codebooks). This is the guide to how the API works and how I can use its functionalities -it explains the protocols for accessing different software applications, making calls and receiving responses.

* In this case, the UCDP API is `openly available` and has a `REStful architecture` (Representational State Transfer, a client-server dynamic). This is a standard design and it means it uses `HTTP` (Hypertext Transfer Protocol) communication methods (IBM Technology, 2020). So, the `requests` (queries or petitions to the source) are made via operations under the CRUD logic (create, read, update, delete) (ibid).



* I identify that the UCDP API basics are : 

    Base URL: `https://ucdpapi.pcr.uu.se/api/`

    Endpoint structure (RESTful format): `https://ucdpapi.pcr.uu.se/api/<resource>/<version>?<pagesize=x>&<page=x>`

    Where the parameters are: 

    Target dataset: `<resource>` to be replaced with `gedevents` for the UCDP Georeferenced Event Dataset - GED

    Dataset version: `<version>` to be replaced with `24.1` which is the latest and yearly release is `24.01.24.12`

    Pagination parameters: `<pagesize=x>&<page=x>` 

    Format of requested data: `JSON` an array of objects (common notation in APIs, representing GED events)


* And, regarding rate limits (focusing on `gedevents`):

    `Requires paging:  1 to 1,000 rows per page`

    `Allows 5,000 requests/day`

    `Counters reset at midnight (UTC)`



* My general idea for this exercise is to create a modular client (frontend) call that extracts just the subset of data required from the UCDP API server (backend) to answer the questions; -and, make it easily reusable for future queries.

```python

MAIN SCRIPT

# API call 
# Creating a request to extract and analyze conflict data for:
    # Deadliest municipalities (overall and civilians) in 2024
    # Provincial violence changes (max increase and max decrease) since Dec 2023
# Establish parameters (ex. pagination 1000 records/page)
# Establish cache to avoid duplicate calls

# Key objects
base_UCDP_API_url = "URL"

# Focus
target_dataset = "dataset" # gedevents

# Accommodating different versions to filter from
dataset_versions={
  "official",  
  "candidate_yearly",
  "candidate_monthly"}

# Date filters, under UCDP API logic 
dates_full_2024:{
        'startdate': "YYYY-MM-DD",
        'enddate': "YYYY-MM-DD",
 }
dates_since_dec_2023: {
        'startdate': "YYYY-MM-DD",
}

# Further conditions can be defined here too (ex: Id, Country, Dyad, etc)

# Set of functions:

FUNCTION build_api_url(target_dataset, dataset_versions, parameters=x, conditions=z):
    # Construct the body of a custom URL with all relevant objects (empty and concatenated) and store it
    # Returns a "generic" endpoint URL (which is going to store future inputs)
   
FUNCTION get_data(build_api_url):
    # Use the modular endpoint structure
    # Making initial GET request operation by overwriting the model endpoint URL by,
        # LOOPING for the desired filtering instances (dataset_versions, conditions)
            # WHILE they exist
        # Check response status, print
        # Store JSON data from a succesful response
    # Return responses (subsets of UCDP data) as df objects
    # The broad logic would be something like
    data_2024 = get_data(build_api_url(dataset_versions, ("startdate":"2024-01-01", "enddate": "2024-12-31"))) # plus other filters
    data_dec2023 = get_data(build_api_url(dataset_versions, ("startdate":"2023-12-01"))) # plus other filters


# Data Analysis

# Filter each df by the different date ranges and conditionals (can be applied before or here)
# Group by adm_1, adm_2 in respective date-filtered df
# Sum fatalities or count unique events in each according to the question
# Sort by descending results of those metrics (or select desired cases for min or max values)
# Print results

# (NOTE: I keep the latter rationale here as simple as possible. After exploring the dataset versions and respective date ranges I considered different combinations (filters, merges of df) to approach the questions. Please see respective comments below.)
```
____________________

### 4. Execution

Step-by-step implementation.

#### 4.1 Accessing the API

In [None]:
# Defining constant vars 
site = "https://ucdpapi.pcr.uu.se/api/"
resource_dataset = "gedevents"
version = {
    "latest_dataset": "24.1",  # Official yearly release, covers events from 1989-01-01 up until 2023-12-31
    "candidate_yearly_dataset": "24.01.24.12"  # Yearly candidate, includes full 2024
}
pag_max = 1000  

In [None]:
# Building a general function that constructs a full endpoint URL
def ucdp_api_url(version=None, start_date=None, end_date=None, page=1):
    url = f"{site}{resource_dataset}/{version}?pagesize={pag_max}&page={page}"
    if start_date:
        url += f"&StartDate={start_date}"
    if end_date:
        url += f"&EndDate={end_date}"
    return url

# Just an example of the output:
print(ucdp_api_url(version="24.1", start_date="2020-01-01", end_date="2020-12-31"))


https://ucdpapi.pcr.uu.se/api/gedevents/24.1?pagesize=1000&page=1&StartDate=2020-01-01&EndDate=2020-12-31


In [None]:
# Defining a function to get results accommodating a combined filter (version + dates)

def get_filtered_data(version=None, start_date=None, end_date=None):
    
    print(f"\nUCDP GED data from version {version} between {start_date} and {end_date if end_date else 'latest'}")
    
    all_data = []
    page = 1

    while True:
        
        url = ucdp_api_url(version=version, start_date=start_date, end_date=end_date, page=page) # Overwriting to later input desired values
        response = requests.get(url) # Calling the API endpoint
        
        if response.status_code != 200: # If not succesful
            raise Exception(f"Request failed on page {page} with status {response.status_code}") # Why it was not succesful
        
        data = response.json().get("Result", []) # Storing JSON data
        all_data.extend(data) # appending

        print(f"Fetched page {page} with {len(data)} records.") # Helping report fetched data
        if len(data) < pag_max:
            break
        page += 1

    return pd.DataFrame(all_data)

In [None]:
# Actual filtering and storage of results

# Getting the complete 24.1 official version as context and verification (without any filters)

full_dataset = get_filtered_data(
    version=version["latest_dataset"]
)


full_dataset.info() # RangeIndex: 348,733 entries, 0 to 348,732 x 49 columns

overall_min = full_dataset['date_end'].min()
overall_max = full_dataset['date_end'].max()

print(overall_min) # 1989-01-01 00:00:00
print(overall_max) # 2023-12-31 00:00:00

# Takes ~ 17 mins to fetch all records



UCDP GED data from version 24.1 between None and latest
Fetched page 1 with 1000 records.
Fetched page 2 with 1000 records.
Fetched page 3 with 1000 records.
Fetched page 4 with 1000 records.
Fetched page 5 with 1000 records.
Fetched page 6 with 1000 records.
Fetched page 7 with 1000 records.
Fetched page 8 with 1000 records.
Fetched page 9 with 1000 records.
Fetched page 10 with 1000 records.
Fetched page 11 with 1000 records.
Fetched page 12 with 1000 records.
Fetched page 13 with 1000 records.
Fetched page 14 with 1000 records.
Fetched page 15 with 1000 records.
Fetched page 16 with 1000 records.
Fetched page 17 with 1000 records.
Fetched page 18 with 1000 records.
Fetched page 19 with 1000 records.
Fetched page 20 with 1000 records.
Fetched page 21 with 1000 records.
Fetched page 22 with 1000 records.
Fetched page 23 with 1000 records.
Fetched page 24 with 1000 records.
Fetched page 25 with 1000 records.
Fetched page 26 with 1000 records.
Fetched page 27 with 1000 records.
Fetched

In [None]:
# FULL YEAR 2024, using candidate yearly dataset version

full_2024 = get_filtered_data(
    version=version["candidate_yearly_dataset"], 
    start_date="2024-01-01", # events on this day included
    end_date="2024-12-31"  # events on this day included
)

full_2024.info() # RangeIndex: 29203 entries, 0 to 29202 x 49 columns



UCDP GED data from version 24.01.24.12 between 2024-01-01 and 2024-12-31
Fetched page 1 with 1000 records.
Fetched page 2 with 1000 records.
Fetched page 3 with 1000 records.
Fetched page 4 with 1000 records.
Fetched page 5 with 1000 records.
Fetched page 6 with 1000 records.
Fetched page 7 with 1000 records.
Fetched page 8 with 1000 records.
Fetched page 9 with 1000 records.
Fetched page 10 with 1000 records.
Fetched page 11 with 1000 records.
Fetched page 12 with 1000 records.
Fetched page 13 with 1000 records.
Fetched page 14 with 1000 records.
Fetched page 15 with 1000 records.
Fetched page 16 with 1000 records.
Fetched page 17 with 1000 records.
Fetched page 18 with 1000 records.
Fetched page 19 with 1000 records.
Fetched page 20 with 1000 records.
Fetched page 21 with 1000 records.
Fetched page 22 with 1000 records.
Fetched page 23 with 1000 records.
Fetched page 24 with 1000 records.
Fetched page 25 with 1000 records.
Fetched page 26 with 1000 records.
Fetched page 27 with 1000

In [None]:
# SINCE DEC 1st, 2023, using latest official dataset version
since_dec_2023 = get_filtered_data(
    version=version["latest_dataset"], 
    start_date="2023-12-01" # events on this day included onwards until latest date available (which is 2023-12-31)
)

since_dec_2023.info() # RangeIndex: 778 entries, 0 to 777 x 49 columns


UCDP GED data from version 24.1 between 2023-12-01 and latest
Fetched page 1 with 778 records.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 778 entries, 0 to 777
Data columns (total 49 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 778 non-null    int64  
 1   relid              778 non-null    object 
 2   year               778 non-null    int64  
 3   active_year        778 non-null    bool   
 4   code_status        778 non-null    object 
 5   type_of_violence   778 non-null    int64  
 6   conflict_dset_id   778 non-null    object 
 7   conflict_new_id    778 non-null    int64  
 8   conflict_name      778 non-null    object 
 9   dyad_dset_id       778 non-null    object 
 10  dyad_new_id        778 non-null    int64  
 11  dyad_name          778 non-null    object 
 12  side_a_dset_id     778 non-null    object 
 13  side_a_new_id      778 non-null    int64  
 14  side_a             778 non

In [None]:
# MERGING BOTH: FULL OFFICIAL DATASET + FULL 2024 

combined_dfs = pd.concat([full_2024, full_dataset], ignore_index=True)
combined_dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377936 entries, 0 to 377935
Data columns (total 49 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 377936 non-null  int64  
 1   relid              377936 non-null  object 
 2   year               377936 non-null  int64  
 3   active_year        377936 non-null  bool   
 4   code_status        377936 non-null  object 
 5   type_of_violence   377936 non-null  int64  
 6   conflict_dset_id   377936 non-null  object 
 7   conflict_new_id    377936 non-null  int64  
 8   conflict_name      377936 non-null  object 
 9   dyad_dset_id       377936 non-null  object 
 10  dyad_new_id        377936 non-null  int64  
 11  dyad_name          377936 non-null  object 
 12  side_a_dset_id     377936 non-null  object 
 13  side_a_new_id      377936 non-null  int64  
 14  side_a             377936 non-null  object 
 15  side_b_dset_id     377936 non-null  object 
 16  si

#### 4.2 Data Analysis

In [None]:
# 1. What are the five deadliest municipalities (adm_2) overall in 2024?

# Focus on variable:
# best (integer): "The best (most likely) estimate of total fatalities resulting from an event." (Högbladh, 2024; Sundberg et. al, 2013) 

full_2024.groupby('adm_2')['best'].sum().nlargest(5)

adm_2
Pokrovsk raion                     17849
Deir al-Balah governorate           3929
Bakhmut raion                       3603
Gaza governorate                    3386
Gaza ash Shamaliyah governorate     2851
Name: best, dtype: int64

In [None]:
# 2. What are the five deadliest municipalities (adm_2) just for civilians in 2024?

# Focus on variable:
# deaths_civilians (integer):"The best estimate of dead civilians in the event.  For non-state or state-based events, this is the number of collateral damage resulting in fighting between side a and side integer b. For one-sided violence, it is the number of civilians killed by side a." (Högbladh, 2024; Sundberg et. al, 2013) 

full_2024.groupby('adm_2')['deaths_civilians'].sum().nlargest(5)

adm_2
Deir al-Balah governorate          1801
Gaza governorate                   1234
Gaza ash Shamaliyah governorate     871
El Fasher district                  797
Rafah governorate                   729
Name: deaths_civilians, dtype: int64

In [None]:
# For Q3 & Q4: Given the instructions and the datasets consulted, I interpret different possibilities to answer them,


# Option 1 -> I can only find the highest and lowest count of unique events in DEC 2023 (strictly using just the official latest dataset and the defined date filter)

# 3. Which province (adm_1) has seen the largest increase in overall violence since December 2023?
# id (integer) = "A unique numeric ID identifying each event." (Högbladh, 2024; Sundberg et. al, 2013) 
print("OPTION 1: The province with the largest increase in overall violence (in number of unique events) since December 2023 is:")
print(since_dec_2023.groupby('adm_1')['id'].count().nlargest(1))

#4. Which province (adm_1) has seen the largest decrease in overall violence since December 2023?
print("OPTION 1: The province with the largest decrease in overall violence (in number of unique events) since December 2023 is:")
since_dec_2023.groupby('adm_1')['id'].count().nsmallest(1) # Of course, there must be at least 1 event (lowest count), several have just 1 event 


OPTION 1: The province with the largest increase in overall violence (in number of unique events) since December 2023 is:
adm_1
Gaza Strip    65
Name: id, dtype: int64
OPTION 1: The province with the largest decrease in overall violence (in number of unique events) since December 2023 is:


adm_1
Adamawa state    1
Name: id, dtype: int64

In [None]:
# Option 2 -> using merged df and comparing

combined_dfs['date_end'] = pd.to_datetime(combined_dfs['date_end']) # just to confirm the data type

# Date splits
before = combined_dfs[combined_dfs['date_end'] < '2023-12-01'] # since DEC 2023
after = combined_dfs[combined_dfs['date_end'] >= '2023-12-01'] # onwards

change = (
    after['adm_1'].value_counts() - 
    before['adm_1'].value_counts()).fillna(0) # comparing province series of unique events, gives positive and negative changes

print(f"OPTION 2: Largest increase since DEC 2023 (province level): {change.idxmax()} ({int(change.max())} more events)") # printing index label for largest value (province) and actual count (largest)
print(f"OPTION 2: Largest decrease since DEC 2023 (province level): {change.idxmin()} ({int(change.min())} fewer events)") # printing index label for lowest value (province) and actual count (min)

OPTION 2: Largest increase since DEC 2023 (province level): Kursk oblast (498 more events)
OPTION 2: Largest decrease since DEC 2023 (province level): Rif Dimashq governorate (-17401 fewer events)


### References

Davies, Shawn, Garoun Engström, Therese Pettersson & Magnus Öberg (2024). Organized violence 1989-2023, and the prevalence of organized crime groups. Journal of Peace Research 61(4).

Goodwin, M. (2024, April 9). What Is an API (Application Programming Interface)? IBM. https://www.ibm.com/think/topics/api

Högbladh, Stina. (2024). “UCDP GED Codebook version 24.1”, Department of Peace and Conflict Research, Uppsala University

IBM Technology (Director). (2020, October 23). What is a REST API? [Video recording]. https://www.youtube.com/watch?v=lsMQRaeKNDk

JSON. (n.d.). Retrieved April 7, 2025, from https://www.json.org/json-en.html

Sundberg, Ralph and Erik Melander (2013) Introducing the UCDP Georeferenced Event Dataset. Journal of Peace Research 50(4). 523-532

UCDP Application Programming Interface (API). (n.d.). Retrieved April 7, 2025, from https://ucdp.uu.se/apidocs/


________________