Long-term international migration 2.01a, citizenship, UK and England and Wales

In [1]:
from gssutils import *
scraper = Scraper('https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/' \
                  'internationalmigration/datasets/longterminternationalmigrationcitizenshiptable201a')
scraper

## Long-term international migration 2.01a, citizenship, UK and England and Wales

Nationality of migrants. Estimates of Long-Term International Migration, annual table.

### Distributions

1. Long-term international migration 2.01a, citizenship, UK and England and Wales ([MS Excel Spreadsheet](https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/internationalmigration/datasets/longterminternationalmigrationcitizenshiptable201a/current/2.01altimcitizenship2004to2017.xls))


In [2]:
tab = next(t for t in scraper.distribution().as_databaker() if t.name == 'Table 2.01a')

Observations are in pairs of value +- confidence interval. Also, the table has been revised since the 2011 census and contains details about which observations have been revised and what their original estimates were.

In [3]:
cell = tab.filter('Year')
cell.assert_one()
citizenship = cell.fill(RIGHT).is_not_blank().is_not_whitespace()  | \
            cell.shift(0,1).fill(RIGHT).is_not_blank().is_not_whitespace() | \
            cell.shift(0,2).expand(RIGHT).is_not_blank().is_not_whitespace().is_not_bold() \
            .filter(lambda x: type(x.value) != 'All' not in x.value)
citizenship

{<I13 'European Union2'>, <E12 'British\n(Including Overseas Territories)'>, <T12 'Non-European Union3'>, <M13 'European Union EU8'>, <B12 'All citizenships'>, <K13 'European Union EU15'>, <AI13 'Rest of the World'>, <T13 'All3'>, <X13 'Asia'>, <AK14 'Sub-Saharan Africa'>, <AQ14 'Central and South America'>, <AX12 'All citizenships'>, <Q13 'European Union Other'>, <O13 'European Union EU2'>, <G12 'Non-British'>, <AF14 'South East Asia'>, <AO14 'North America'>, <AX14 'Original Estimates1'>, <AM14 'North Africa'>, <Z14 'Middle East and Central Asia'>, <AB14 'East Asia'>, <V13 'Other Europe3'>, <AS14 'Oceania'>, <AD14 'South Asia'>, <B14 '2011 Census Revisions1'>, <I12 'European Union2'>, <AU13 'Stateless'>}

In [5]:
observations = cell.shift(RIGHT).fill(DOWN).filter('Estimate').expand(RIGHT).filter('Estimate') \
                .fill(DOWN).is_not_blank().is_not_whitespace() 
Str =  tab.filter(contains_string('Significant Change?')).fill(RIGHT).is_not_number()
observations = observations - (tab.excel_ref('A1').expand(DOWN).expand(RIGHT).filter(contains_string('Significant Change')))
original_estimates = tab.filter(contains_string('Original Estimates')).fill(DOWN).is_number()
observations = observations - original_estimates - Str

In [6]:
CI = observations.shift(RIGHT)

In [7]:
Year = cell.fill(DOWN) 
Year = Year.filter(lambda x: type(x.value) != str or 'Significant Change?' not in x.value)

In [8]:
Geography = cell.fill(DOWN).one_of(['United Kingdom', 'England and Wales'])
Flow = cell.fill(DOWN).one_of(['Inflow', 'Outflow', 'Balance'])

In [9]:
csObs = ConversionSegment(observations, [
    HDim(Year,'Year', DIRECTLY, LEFT),
    HDim(Geography,'Geography', CLOSEST, ABOVE),
    HDim(citizenship, 'Citizenship', DIRECTLY, ABOVE),
    HDim(Flow, 'Flow', CLOSEST, ABOVE),
    HDimConst('Measure Type', 'Count'),
    HDimConst('Unit','People (thousands)'),
    HDim(CI,'CI',DIRECTLY,RIGHT),
    HDimConst('Revision', '2011 Census Revision')
])
savepreviewhtml(csObs)
tidy_revised = csObs.topandas()

0,1,2,3,4,5
OBS,Year,Geography,Citizenship,Flow,CI

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,Highlight significant changes over the last year?,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Table 2.01a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Series MN
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Long-Term International Migration,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"United Kingdom,"
"time series, 2004 to 2017",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,England and Wales





In [12]:
csRevs = ConversionSegment(original_estimates, [
    HDim(Year, 'Year', DIRECTLY, LEFT),
    HDim(Geography,'Geography', CLOSEST, ABOVE),
    HDim(citizenship, 'Citizenship', DIRECTLY, ABOVE),
    HDim(Flow, 'Flow', CLOSEST, ABOVE),
    HDimConst('Measure Type', 'Count'),
    HDimConst('Unit','People (thousands)'),
    HDim(original_estimates.shift(RIGHT), 'CI', DIRECTLY, RIGHT),
    HDimConst('Revision', 'Original Estimate')
])
orig_estimates = csRevs.topandas()




In [13]:
tidy = pd.concat([tidy_revised, orig_estimates], axis=0, join='outer', ignore_index=True, sort=False)

Ignore data markers for now and ensure all observations are integers.
**Todo: figure out what to do with data markers.**

In [14]:
import numpy as np
tidy['OBS'].replace('', np.nan, inplace=True)
tidy.dropna(subset=['OBS'], inplace=True)
tidy.drop(columns=['DATAMARKER'], inplace=True)
tidy.rename(columns={'OBS': 'Value', 'Citizenship' : 'LTIM Citizenship'}, inplace=True)
tidy['Value'] = tidy['Value'].astype(int)
tidy['CI'] = tidy['CI'].map(lambda x:'' if x == ':' else int(x[:-2]) if x.endswith('.0') else 'ERR')

Check each observation has a year and use ints.

In [15]:
tidy['Year'] = tidy['Year'].apply(lambda x: pd.to_numeric(x, downcast='integer'))
tidy['Year'] = tidy['Year'].astype(int)

In [16]:
for col in tidy.columns:
    if col not in ['Value', 'Year', 'CI']:
        tidy[col] = tidy[col].astype('category')
        display(col)
        display(tidy[col].cat.categories)

'Geography'

Index(['England and Wales', 'United Kingdom'], dtype='object')

'LTIM Citizenship'

Index(['2011 Census Revisions1', 'All3', 'Asia',
       'British\n(Including Overseas Territories)',
       'Central and South America', 'East Asia', 'European Union EU15',
       'European Union EU2', 'European Union EU8', 'European Union Other',
       'European Union2', 'Middle East and Central Asia', 'Non-British',
       'North Africa', 'North America', 'Oceania', 'Original Estimates1',
       'Other Europe3', 'Rest of the World', 'South Asia', 'South East Asia',
       'Stateless', 'Sub-Saharan Africa'],
      dtype='object')

'Flow'

Index(['Balance', 'Inflow', 'Outflow'], dtype='object')

'Measure Type'

Index(['Count'], dtype='object')

'Unit'

Index(['People (thousands)'], dtype='object')

'Revision'

Index(['2011 Census Revision', 'Original Estimate'], dtype='object')

In [17]:
tidy['Geography'] = tidy['Geography'].cat.rename_categories({
    'United Kingdom': 'K02000001',
    'England and Wales': 'K04000001'
})
tidy['LTIM Citizenship'] = tidy['LTIM Citizenship'].cat.rename_categories({
    'All citizenships' : 'all-citizenships',
    'All3' : 'non-european-union-all',
    'Asia' :'non-european-union-asia-all',
    'British\n(Including Overseas Territories)' : 'british-including-overseas-territories',
    'Central and South America' : 'non-european-union-rest-of-the-world-central-and-south-america', 
    'East Asia' : 'non-european-union-asia-east-asia',
    'European Union EU15' : 'european-union-european-union-eu15',
    'European Union EU2' : 'european-union-european-union-eu2',
    'European Union EU8':'european-union-european-union-eu8',
    'European Union Other' : 'european-union-european-union-other',
    'European Union2' : 'european-union-european-union' , 
    'Middle East and Central Asia' : 'non-european-union-asia-middle-east-and-central-asia', 
    'Non-British' : 'non-british' ,
    'North Africa' : 'non-european-union-rest-of-the-world-north-africa',
    'North America' : 'non-european-union-rest-of-the-world-north-america', 
    'Oceania' : 'non-european-union-rest-of-the-world-oceania', 
    'Other Europe3' : 'non-european-union-other-europe',
    'Rest of the World' : 'non-european-union-rest-of-the-world-all', 
    'South Asia' : 'non-european-union-asia-south-asia', 
    'South East Asia' : 'non-european-union-asia-south-east-asia', 
    'Stateless' : 'non-european-union-stateless',
    'Sub-Saharan Africa' : 'non-european-union-rest-of-the-world-sub-saharan-africa'
            
})
tidy['Flow'] = tidy['Flow'].cat.rename_categories({
    'Balance': 'balance', 
    'Inflow': 'inflow',
    'Outflow': 'outflow'
})

Todo: some values (estimations / CIs) have been rounded to zero and indicated with a `0~`, but this seems to be using conditional formatting of some kind and doesn't come through. We need to add data markers.

For CI we'll use a blank string for these markers, otherwise use the string representation of the int so it comes out in CSV okish.

In [18]:
tidy['CI'] = tidy['CI'].apply(
    lambda x: '' if str(x) in ['', '0', ':', 'z'] else int(float(x))
)
tidy

Unnamed: 0,Value,Year,Geography,LTIM Citizenship,Flow,Measure Type,Unit,CI,Revision
0,589,2004,K02000001,2011 Census Revisions1,inflow,Count,People (thousands),40,2011 Census Revision
1,92,2004,K02000001,british-including-overseas-territories,inflow,Count,People (thousands),14,2011 Census Revision
2,497,2004,K02000001,non-british,inflow,Count,People (thousands),38,2011 Census Revision
3,127,2004,K02000001,european-union-european-union,inflow,Count,People (thousands),22,2011 Census Revision
4,76,2004,K02000001,european-union-european-union-eu15,inflow,Count,People (thousands),15,2011 Census Revision
5,51,2004,K02000001,european-union-european-union-eu8,inflow,Count,People (thousands),16,2011 Census Revision
7,0,2004,K02000001,european-union-european-union-other,inflow,Count,People (thousands),1,2011 Census Revision
8,370,2004,K02000001,non-european-union-all,inflow,Count,People (thousands),30,2011 Census Revision
9,17,2004,K02000001,non-european-union-other-europe,inflow,Count,People (thousands),5,2011 Census Revision
10,192,2004,K02000001,non-european-union-asia-all,inflow,Count,People (thousands),24,2011 Census Revision


Re-order the columns and output as CSV with some metadata.

In [19]:
tidy = tidy[['Geography','Year','LTIM Citizenship','Flow','Measure Type','Value','CI','Unit', 'Revision']]
from pathlib import Path
destinationFolder = Path('out')
destinationFolder.mkdir(exist_ok=True, parents=True)

tidy.to_csv(destinationFolder / ('observations.csv'), index = False)

from gssutils.metadata import THEME

scraper.dataset.family = 'migration'
scraper.dataset.theme = THEME['population']

with open(destinationFolder / 'dataset.trig', 'wb') as metadata:
    metadata.write(scraper.generate_trig())