# Extract data from NOMIS

This notebook extracts data from the NOMIS API. The notebook is targetted at the following table:

+ NM_608_1 - Ethnicity by LSOA (Lower Super Output Area)

The notebook downloads data by local super output area (listed in nomis_lsoa_codes.csv), for all ethnicity groups and total across rural and urban areas.

The notebook makes the assumption that all LSOA codes are contiguous and can be batch up to reduce the number of API calls.



Data is exported to the folder "X_Output" which needs to be created.

In [36]:
import math
import pandas as pd

from tqdm.notebook import tqdm
from pyjstat import pyjstat
from typing import List

**Set the upper and lower limit for NOMIS LSOA geography codes, number of columns in returned data and max query size**

In [37]:
NOMIS_LSOA_MIN = 1249902593
NOMIS_LSOA_MAX = 1249935436
NOMIS_COLS_PER_LSOA = 20 # Number of cells of data (columns) returned per LSOA
NOMIS_MAX_QUERY = 20000  # Maximum number of cells that can be returned by NOMIS API per query (note: actually 25,000 but lower limit here for safety)

**Create a Python generator to create ranges of NOMIS geographies**

This generator is used in the main loop to create individual API calls.

In [38]:
def nomis_code_range(nomis_min: int, nomis_max: int, nomis_step: int) -> str:

    nomis_range = ""

    iterations = math.ceil((nomis_max - nomis_min + 1) / nomis_step)
    for i in range(iterations):
        nomis_range = str(max(nomis_min, nomis_min + i * nomis_step)) + "..." + str(min(nomis_max, nomis_min + (i + 1) * nomis_step - 1))
        yield nomis_range

In [39]:
def nomis_url(table_name: str, geography: str) -> str:
    # tables:
    # NM_608_1 - Ethnicity (LSOA)

    url_base = f"https://www.nomisweb.co.uk/api/v01/dataset/{table_name}.jsonstat.json?"
    url_geography_base="geography="
    url_date_base="date="    

    url_params = {}
    url_params["NM_608_1"] = "&rural_urban=0&cell=0...18&measures=20100"

    dates = [
        "latest",
    ]
    date_enc = ",".join(dates)

    url = (
        url_base
        + url_geography_base
        + geography
        + "&"
        + url_date_base
        + date_enc
        + url_params[table_name]
    )

    return url


In [40]:
def write_list(output_list: List, output_filename: str) -> None:
    with open(f"./X_Output/{output_filename}", "w") as textfile:
        for el in output_list:
            textfile.write(el + "\n")

**Iterate of geographic ranges**

In [41]:
first = True
population_urls = []
population = pd.DataFrame()

iterations = math.ceil((NOMIS_LSOA_MAX - NOMIS_LSOA_MIN + 1) / math.floor(NOMIS_MAX_QUERY / NOMIS_COLS_PER_LSOA))
for geography in tqdm(nomis_code_range(NOMIS_LSOA_MIN, NOMIS_LSOA_MAX, math.floor(NOMIS_MAX_QUERY / NOMIS_COLS_PER_LSOA)), total=iterations):
    url = nomis_url("NM_608_1", geography)
    dataset = pyjstat.Dataset.read(url)

    df: pd.DataFrame = dataset.write('dataframe')  # type: ignore

    if first:
        population = df
        first = False
    else:
        population = pd.concat([population, df], axis=0)

  0%|          | 0/33 [00:00<?, ?it/s]

**Output data to CSV file**

In [42]:
population.to_csv("./X_Output/lsoa_ethnicity_all.csv", index=False)