# Nomis API Tutorial

This is a tutorial on accessing data from the NOMIS API. Nomis is a service provided by the Office for National Statistics, ONS, to give you free access to the most detailed and up-to-date UK labour market statistics from official sources.

https://www.nomisweb.co.uk/api/v01/help

In [1]:
import requests as rq
import pandas as pd
from fuzzywuzzy import fuzz

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

## Listing all data in NOMIS

Data sets in NOMIS have an ID, name, and description which are useful parameters to use when searching for different data sets. NOMIS stores this uesful information in objects it calls "key families". We can build a function that will get the ID, name, and description of all the data sets in NOMIS so that we can search them locally on our computer. 

In [2]:
def list_data_sets():
    
    # This function obtains a list of all the NOMIS datasets available
    
    # Make the request to the API
    root = "http://www.nomisweb.co.uk"
    ext = "/api/v01/dataset/def.sdmx.json"
    response = rq.get(root + ext).json()
    
    # Process the response to get the useful info from the key families if it is available
    keyFams = response['structure']['keyfamilies']['keyfamily']
    data_sets = []
    meta_data = {}
    for keyFam in keyFams:

        row = {}
        
        try:
            row['description'] = keyFam['description']['value']         
        except:
            pass
            

        try:
            row['id'] = keyFam['id']
            meta_data[keyFam['id']] = {}
        except:
            pass
            

        try:
            row['name'] = keyFam['name']['value']       
        except:
            pass
        
        
        data_sets.append(row)
           
            
        try:
            annotations = keyFam['annotations']['annotation']
            for anotation in annotations:
                meta_data[keyFam['id']][anotation['annotationtitle']] = anotation['annotationtext']
        except:
            pass

        
        try:
            components = keyFam['components']
            meta_data[keyFam['id']]['dimensions'] = []
            for dimension in components['dimension']:
                meta_data[keyFam['id']]['dimensions'].append(dimension['conceptref'])

            meta_data[keyFam['id']]['primarymeasure'] = components['primarymeasure']['conceptref']
            meta_data[keyFam['id']]['timedimension'] = components['timedimension']['conceptref']
        except:
            pass

                    
    return data_sets, meta_data

In [3]:
data_sets, meta_data = list_data_sets()
df_data_sets_list = pd.DataFrame(data_sets)
df_data_sets_list.head()

Unnamed: 0,description,id,name
0,"Records the number of people claiming Jobseeker's Allowance (JSA) and National Insurance credits at Jobcentre Plus local offices. This is not an official measure of unemployment, but is the only indicative statistic available for areas smaller than Local Authorities.",NM_1_1,Jobseeker's Allowance with rates and proportions
1,A quartery count of claimants who were claiming Jobseeker's Allowance on the count date analysed by their age and the duration.,NM_2_1,claimant count - age and duration
2,A monthly count of Jobseeker's Allowance (JSA) claimants broken down by age and the duration of claim. Totals exclude non-computerised clerical claims (approx. 1%). Available for areas smaller than Local Authorities.,NM_4_1,Jobseeker's Allowance by age and duration
3,A midyear estimate of the workforce (the denominator) which was used for calculating claimant count rates prior to January 2003. The estimates are broken down by sex but not age.,NM_5_1,claimant count denominators - historical workforce series
4,A quarterly count of Jobseeker's Allowance claimants analysed by their sought and usual occupation.,NM_6_1,claimant count - occupation


# Searching the list of data sets

We can search for data on different topics by looking in the list of data sets for key words. We can search either the id, the description, or the name. We can build a function to do help with this.

In [5]:
def search_data_list(df_data_sets_list, search_field, search_term):
    
    # This function will search a filed (e.g., id, description or name) to see 
    # how well it matches the terms in the search_term
    # Results are returned in a table with the best matches at the top
    
    df = df_data_sets_list.copy()
    df['search_score'] = df[search_field].apply(lambda x: fuzz.token_sort_ratio(x, search_term))
    df = df.sort_values('search_score', ascending=False)
    
    return df

In [6]:
df_search_results = search_data_list(df_data_sets_list, 'name', 'unemploy')
df_search_results.head(20)

Unnamed: 0,description,id,name,search_score
96,,NM_127_1,model-based estimates of unemployment,36
166,,NM_509_1,Population,33
244,,NM_589_1,Occupation,33
951,,NM_1613_1,KS011a - Industry of employment,32
11,"A residence based labour market survey encompassing population, economic activity (employment and unemployment), economic inactivity and qualifications. These are broken down where possible by gender, age, ethnicity, industry and occupation. Available at Local Authority level and above. Updated quarterly.",NM_17_1,annual population survey,31
97,,NM_129_1,Annual Civil Service Employment Survey,30
241,,NM_586_1,Economic Position - Females,30
527,,NM_918_1,ST606EWla - Occupation,29
1022,,NM_1684_1,UV068 - Household type,29
952,,NM_1614_1,KS011b - Industry of employment - males,28


## Checking which dimensions are available on a particular data set

In [7]:
meta_data['NM_127_1']

{'Status': 'Current (being actively updated)',
 'Units': 'Persons',
 'contenttype/sources': 'aps',
 'SubDescription': 'previously unavailable',
 'Mnemonic': 'umb',
 'FirstReleased': '2006-03-29 09:30:00',
 'LastUpdated': '2019-07-16 09:30:00',
 'LastRevised': '2019-04-16 09:30:00',
 'dimensions': ['GEOGRAPHY', 'ITEM', 'MEASURES', 'FREQ'],
 'primarymeasure': 'OBS_VALUE',
 'timedimension': 'TIME'}

In [8]:
def dimension_values(data_set_id, dimension):
    
    root = "http://www.nomisweb.co.uk"
    ext = f'/api/v01/dataset/{data_set_id}/{dimension}.def.sdmx.json'
    response = rq.get(root + ext).json()
    dim_vals = [
        {
            'description': item['description']['value'],
            'value': item['value'],
            'dimension': dimension
        } for item in response['structure']['codelists']['codelist'][0]['code']
    ]
    return dim_vals


def all_dimension_values(meta_data, data_set_id):
    
    # This function gets the values that are available for different dimensions on a data set
    
    dimensions = meta_data[data_set_id]['dimensions']
    all_dim_vals = []
    for dimension in dimensions:
        all_dim_vals += dimension_values(data_set_id, dimension)
        
    return all_dim_vals

In [9]:
pd.DataFrame(all_dimension_values(meta_data, 'NM_127_1'))

Unnamed: 0,description,value,dimension
0,Great Britain,2092957698,GEOGRAPHY
1,England,2092957699,GEOGRAPHY
2,Wales,2092957700,GEOGRAPHY
3,Scotland,2092957701,GEOGRAPHY
4,England and Wales,2092957703,GEOGRAPHY
5,Unemployment count (model based),1,ITEM
6,Unemployment rate (model based),2,ITEM
7,value,20100,MEASURES
8,confidence,20701,MEASURES
9,Quarterly,Q,FREQ


# Available geography levels for a data set

In [10]:
pd.DataFrame(dimension_values('NM_127_1', 'geography/2092957699')).head(100)

Unnamed: 0,description,value,dimension
0,England,2092957699,geography/2092957699
1,local authorities: county / unitary (as of April 2021) within England,2092957699TYPE431,geography/2092957699
2,local authorities: district / unitary (as of April 2021) within England,2092957699TYPE432,geography/2092957699
3,local authorities: district / unitary (as of April 2019) within England,2092957699TYPE434,geography/2092957699
4,combined authorities within England,2092957699TYPE442,geography/2092957699
5,local enterprise partnerships (as of April 2021) within England,2092957699TYPE459,geography/2092957699
6,local authorities: county / unitary (prior to April 2015) within England,2092957699TYPE463,geography/2092957699
7,local authorities: district / unitary (prior to April 2015) within England,2092957699TYPE464,geography/2092957699
8,english counties within England,2092957699TYPE469,geography/2092957699
9,regions within England,2092957699TYPE480,geography/2092957699


In [11]:
pd.DataFrame(dimension_values('NM_127_1', 'geography/2092957699TYPE432')).head(100)

Unnamed: 0,description,value,dimension
0,Darlington,1811939329,geography/2092957699TYPE432
1,County Durham,1811939330,geography/2092957699TYPE432
2,Hartlepool,1811939331,geography/2092957699TYPE432
3,Middlesbrough,1811939332,geography/2092957699TYPE432
4,Northumberland,1811939334,geography/2092957699TYPE432
5,Redcar and Cleveland,1811939335,geography/2092957699TYPE432
6,Stockton-on-Tees,1811939336,geography/2092957699TYPE432
7,Gateshead,1811939338,geography/2092957699TYPE432
8,Newcastle upon Tyne,1811939339,geography/2092957699TYPE432
9,North Tyneside,1811939340,geography/2092957699TYPE432


## Available time periods for a data set

Time/date info from nomis API docs.
Useful time options:
 - "latest" - the latest available data for this dataset
 - "previous" - the date prior to "latest"
 - "prevyear" - the date one year prior to "latest"
 - "first" - the oldest available data for this dataset

Using the "time" concept you are limited to entering two dates, a start and end. All dates between these are returned.
        
Date is more flexible for range. With the "date" parameter you can specify relative dates, so for example if you wanted the latest date, three months and six months prior to that you could specify "date=latest,latestMINUS3,latestMINUS6". 
You can use ranges with the "date" parameter, e.g. if you wanted data for 12 months ago, together with all dates in the last six month up to latest you could specify "date=prevyear,latestMINUS5-latest". To illustrate the difference between using "date" and "time"; if you specified "time=first,latest" in your URI you would get all dates from first to latest inclusive, whereas with "date=first,latest" your output would contain only the first and latest dates.

In [12]:
pd.DataFrame(dimension_values('NM_127_1', 'time')).head(100)

Unnamed: 0,description,value,dimension
0,Jan 2004-Dec 2004,2004-12,time
1,Apr 2004-Mar 2005,2005-03,time
2,Jul 2004-Jun 2005,2005-06,time
3,Oct 2004-Sep 2005,2005-09,time
4,Jan 2005-Dec 2005,2005-12,time
5,Apr 2005-Mar 2006,2006-03,time
6,Jul 2005-Jun 2006,2006-06,time
7,Oct 2005-Sep 2006,2006-09,time
8,Jan 2006-Dec 2006,2006-12,time
9,Apr 2006-Mar 2007,2007-03,time


## Downloading a data set

To get a data set from the NOMIS API we put together a query. A query is a string that contains the information on the data we are requesting such as the dataset id, the gegraphical area, and the time period. We can define a function to build the query string based on this information and then pass the query string to pandas to request the data.

In [13]:
def make_query(data_set_id, dimension_filters={}):
    
    # This function makes a query string for requesting data from the NOMIS API
    
    root = "http://www.nomisweb.co.uk"
    ext = "/api/v01/dataset/"
    
    params = '&'.join([f'{dimension.lower()}={dimension_filters[dimension]}' for dimension in dimension_filters.keys()])
    if len(params) > 0:
        params = f'?{params}'
        
    query = f'{root}{ext}{data_set_id}.data.csv{params}'

    return query

In [14]:
dimension_filters = {
    'GEOGRAPHY': '2092957699TYPE432',
    'MEASURES': '20100',
    'TIME': '2020-12,2019-12,2018-12',
    'ITEM': '1'
}

query_string = make_query('NM_127_1', dimension_filters)
query_string

'http://www.nomisweb.co.uk/api/v01/dataset/NM_127_1.data.csv?geography=2092957699TYPE432&measures=20100&time=2020-12,2019-12,2018-12&item=1'

In [15]:
df = pd.read_csv(query_string)

In [16]:
df.shape

(2781, 28)

In [17]:
df.columns

Index(['DATE', 'DATE_NAME', 'DATE_CODE', 'DATE_TYPE', 'DATE_TYPECODE',
       'DATE_SORTORDER', 'GEOGRAPHY', 'GEOGRAPHY_NAME', 'GEOGRAPHY_CODE',
       'GEOGRAPHY_TYPE', 'GEOGRAPHY_TYPECODE', 'GEOGRAPHY_SORTORDER', 'ITEM',
       'ITEM_NAME', 'ITEM_CODE', 'ITEM_TYPE', 'ITEM_TYPECODE',
       'ITEM_SORTORDER', 'MEASURES', 'MEASURES_NAME', 'OBS_VALUE',
       'OBS_STATUS', 'OBS_STATUS_NAME', 'OBS_CONF', 'OBS_CONF_NAME', 'URN',
       'RECORD_OFFSET', 'RECORD_COUNT'],
      dtype='object')

In [18]:
df.head(100)

Unnamed: 0,DATE,DATE_NAME,DATE_CODE,DATE_TYPE,DATE_TYPECODE,DATE_SORTORDER,GEOGRAPHY,GEOGRAPHY_NAME,GEOGRAPHY_CODE,GEOGRAPHY_TYPE,GEOGRAPHY_TYPECODE,GEOGRAPHY_SORTORDER,ITEM,ITEM_NAME,ITEM_CODE,ITEM_TYPE,ITEM_TYPECODE,ITEM_SORTORDER,MEASURES,MEASURES_NAME,OBS_VALUE,OBS_STATUS,OBS_STATUS_NAME,OBS_CONF,OBS_CONF_NAME,URN,RECORD_OFFSET,RECORD_COUNT
0,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939329,Darlington,E06000005,local authorities: district / unitary (as of April 2021),432,0,1,Unemployment count (model based),1,item,0,0,20100,Value,2500.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939329d1d20100,0,2781
1,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939330,County Durham,E06000047,local authorities: district / unitary (as of April 2021),432,1,1,Unemployment count (model based),1,item,0,0,20100,Value,12200.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939330d1d20100,1,2781
2,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939331,Hartlepool,E06000001,local authorities: district / unitary (as of April 2021),432,2,1,Unemployment count (model based),1,item,0,0,20100,Value,3700.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939331d1d20100,2,2781
3,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939332,Middlesbrough,E06000002,local authorities: district / unitary (as of April 2021),432,3,1,Unemployment count (model based),1,item,0,0,20100,Value,4500.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939332d1d20100,3,2781
4,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939334,Northumberland,E06000057,local authorities: district / unitary (as of April 2021),432,4,1,Unemployment count (model based),1,item,0,0,20100,Value,6300.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939334d1d20100,4,2781
5,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939335,Redcar and Cleveland,E06000003,local authorities: district / unitary (as of April 2021),432,5,1,Unemployment count (model based),1,item,0,0,20100,Value,3200.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939335d1d20100,5,2781
6,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939336,Stockton-on-Tees,E06000004,local authorities: district / unitary (as of April 2021),432,6,1,Unemployment count (model based),1,item,0,0,20100,Value,5200.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939336d1d20100,6,2781
7,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939338,Gateshead,E08000037,local authorities: district / unitary (as of April 2021),432,7,1,Unemployment count (model based),1,item,0,0,20100,Value,4800.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939338d1d20100,7,2781
8,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939339,Newcastle upon Tyne,E08000021,local authorities: district / unitary (as of April 2021),432,8,1,Unemployment count (model based),1,item,0,0,20100,Value,8000.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939339d1d20100,8,2781
9,2018-12,Jan 2018-Dec 2018,2018-12,date,0,0,1811939340,North Tyneside,E08000022,local authorities: district / unitary (as of April 2021),432,9,1,Unemployment count (model based),1,item,0,0,20100,Value,4700.0,A,Normal Value,F,Free (free for publication),Nm-127d1d32300e0d1811939340d1d20100,9,2781


In [19]:
len(set(df.GEOGRAPHY))

309

In [20]:
len(set(df.MEASURES_NAME))

1

In [21]:
df.columns

Index(['DATE', 'DATE_NAME', 'DATE_CODE', 'DATE_TYPE', 'DATE_TYPECODE',
       'DATE_SORTORDER', 'GEOGRAPHY', 'GEOGRAPHY_NAME', 'GEOGRAPHY_CODE',
       'GEOGRAPHY_TYPE', 'GEOGRAPHY_TYPECODE', 'GEOGRAPHY_SORTORDER', 'ITEM',
       'ITEM_NAME', 'ITEM_CODE', 'ITEM_TYPE', 'ITEM_TYPECODE',
       'ITEM_SORTORDER', 'MEASURES', 'MEASURES_NAME', 'OBS_VALUE',
       'OBS_STATUS', 'OBS_STATUS_NAME', 'OBS_CONF', 'OBS_CONF_NAME', 'URN',
       'RECORD_OFFSET', 'RECORD_COUNT'],
      dtype='object')