# Migration between Scotland and Overseas

In [1]:
from gssutils import *
scraper = Scraper('https://www.nrscotland.gov.uk/statistics-and-data/statistics/' \
                      'statistics-by-theme/migration/migration-statistics/migration-between-scotland-and-overseas')
scraper

## Migration between Scotland and Overseas

### Description

Migration between Scotland and overseas refers to people moving between
Scotland and any country outside the UK.

Due to the sources of data used by National Records of Scotland (NRS) to
estimate migration, the country or group of countries that overseas migrants
come from cannot be identified.



### Distributions

1. Migration between administrative areas and overseas by sex ([MS Excel Spreadsheet](https://www.nrscotland.gov.uk/files//statistics/migration/2018-july/tab-z1-overseas-mig-flows-admin-sex-hb-2001-02-latest-july-18.xlsx))
1. Migration between administrative areas and overseas by sex ([text/csv](https://www.nrscotland.gov.uk/files//statistics/migration/2018-july/tab-z1-overseas-mig-flows-admin-sex-hb-2001-02-latest-july-18.zip))
1. Migration between Scotland and overseas by age ([MS Excel Spreadsheet](https://www.nrscotland.gov.uk/files//statistics/migration/2018-july/tab-z2-overseas-mig-flows-by-age-scotland-2001-02-latest-july-18.xlsx))
1. Migration between Scotland and overseas by age ([text/csv](https://www.nrscotland.gov.uk/files//statistics/migration/2018-july/tab-z2-overseas-mig-flows-by-age-scotland-2001-02-latest-july-18.zip))


In [2]:
scraper.dataset.theme = metadata.THEME['population']
scraper.dataset

In [3]:
databaker_sheets = {sheet.name: sheet for sheet in scraper.distribution(
    title='Migration between administrative areas and overseas by sex',
    mediaType=Excel).as_databaker()}

In [4]:
next_table = pd.DataFrame()

In [5]:
%%capture

tab = databaker_sheets['Net-Council Area-Sex']
%run "migration-admin-areas-by-sex-net.ipynb"
next_table = pd.concat([next_table, new_table])

tab = databaker_sheets['In-Council Area-Sex']
%run "migration-admin-areas-by-sex-in.ipynb"
next_table = pd.concat([next_table, new_table])

tab = databaker_sheets['Out-Council Area-Sex']
%run "migration-admin-areas-by-sex-out.ipynb"
next_table = pd.concat([next_table, new_table])



In [6]:
 distribution = scraper.distribution(
    title='Migration between Scotland and overseas by age',
    mediaType='application/vnd.ms-excel')
tabs = distribution.as_databaker()

In [7]:
%run "migration-by-age-2001-to-2017.ipynb"
next_table = pd.concat([next_table, Final_table])



















In [8]:
tab = distribution.as_pandas(sheet_name = 'SYOA Females (2001-)')
%run "migration-by-age-2001-to-2017-females.ipynb"
next_table = pd.concat([next_table, Final_table])

In [9]:
%run "migration-by-age-2001-to-2017-persons.ipynb"
next_table = pd.concat([next_table, Final_table])




In [10]:
%run "migration-by-age-2001-to-2017-males.ipynb"
next_table = pd.concat([next_table, Final_table])

ERROR:File `'\'"migration-by-age-2001-to-2017-males.ipynb"\'.py'` not found.


In [11]:
next_table.count()

Area of Destination or Origin    19439
Mid Year                         19439
Sex                              19439
Age                              19439
Flow                             19439
Measure Type                     19439
Value                            19439
Unit                             19439
dtype: int64

In [12]:
next_table.head()

Unnamed: 0,Area of Destination or Origin,Mid Year,Sex,Age,Flow,Measure Type,Value,Unit
0,Aberdeen City,2001-06-30T00:00:00/P1Y,T,all,Balance,Count,833,People
1,Aberdeen City,2002-06-30T00:00:00/P1Y,T,all,Balance,Count,385,People
2,Aberdeen City,2003-06-30T00:00:00/P1Y,T,all,Balance,Count,874,People
3,Aberdeen City,2004-06-30T00:00:00/P1Y,T,all,Balance,Count,2150,People
4,Aberdeen City,2005-06-30T00:00:00/P1Y,T,all,Balance,Count,2138,People


In [13]:
next_table.tail()

Unnamed: 0,Area of Destination or Origin,Mid Year,Sex,Age,Flow,Measure Type,Value,Unit
4613,Scotland,2016-17,T,year/86,Balance,Count,-10,People
4614,Scotland,2016-17,T,year/87,Balance,Count,-5,People
4615,Scotland,2016-17,T,year/88,Balance,Count,-4,People
4616,Scotland,2016-17,T,year/89,Balance,Count,-4,People
4617,Scotland,2016-17,T,year/90,Balance,Count,-21,People


In [14]:
next_table.columns = ['Area of Destination or Origin1' if x=='Area of Destination or Origin' else x for x in next_table.columns]

In [15]:
import pandas as pd
c=pd.read_csv("scottish-geo-lookup.csv")

In [16]:
c

Unnamed: 0,label,notation
0,Scotland,S92000003
1,Clackmannanshire,S12000005
2,Glasgow City,S12000046
3,Dumfries and Galloway,S12000006
4,East Ayrshire,S12000008
5,East Lothian,S12000010
6,East Renfrewshire,S12000011
7,Falkirk,S12000014
8,Fife,S12000015
9,Highland,S12000017


In [17]:
Final_table = pd.merge(next_table, c, how = 'left', left_on = 'Area of Destination or Origin1', right_on = 'label')

In [18]:
Final_table

Unnamed: 0,Area of Destination or Origin1,Mid Year,Sex,Age,Flow,Measure Type,Value,Unit,label,notation
0,Aberdeen City,2001-06-30T00:00:00/P1Y,T,all,Balance,Count,833,People,Aberdeen City,S12000033
1,Aberdeen City,2002-06-30T00:00:00/P1Y,T,all,Balance,Count,385,People,Aberdeen City,S12000033
2,Aberdeen City,2003-06-30T00:00:00/P1Y,T,all,Balance,Count,874,People,Aberdeen City,S12000033
3,Aberdeen City,2004-06-30T00:00:00/P1Y,T,all,Balance,Count,2150,People,Aberdeen City,S12000033
4,Aberdeen City,2005-06-30T00:00:00/P1Y,T,all,Balance,Count,2138,People,Aberdeen City,S12000033
5,Aberdeen City,2006-06-30T00:00:00/P1Y,T,all,Balance,Count,3752,People,Aberdeen City,S12000033
6,Aberdeen City,2007-06-30T00:00:00/P1Y,T,all,Balance,Count,3030,People,Aberdeen City,S12000033
7,Aberdeen City,2008-06-30T00:00:00/P1Y,T,all,Balance,Count,3669,People,Aberdeen City,S12000033
8,Aberdeen City,2009-06-30T00:00:00/P1Y,T,all,Balance,Count,3742,People,Aberdeen City,S12000033
9,Aberdeen City,2010-06-30T00:00:00/P1Y,T,all,Balance,Count,4130,People,Aberdeen City,S12000033


In [19]:
Final_table.columns = ['Area of Destination or Origin' if x=='notation' else x for x in Final_table.columns]

In [20]:
Final_table['Area of Destination or Origin'].unique()

array(['S12000033', 'S12000034', 'S12000041', 'S12000035', 'S12000036',
       'S12000005', 'S12000006', 'S08000017', 'S12000042', 'S12000008',
       'S12000045', 'S12000010', 'S12000011', 'S12000014', 'S12000015',
       'S08000018', 'S12000046', 'S12000017', 'S08000022', 'S12000018',
       'S12000019', 'S12000020', 'S12000013', 'S12000021', 'S12000044',
       'S12000023', 'S12000024', 'S12000038', 'S12000026', 'S12000027',
       'S12000028', 'S12000029', 'S12000030', 'S12000039', 'S12000040',
       'S92000003'], dtype=object)

In [21]:
Final_table['Area of Destination or Origin'].fillna('None', inplace = True)

In [22]:
def user_perc(x,y):
    
    if x == 'None' :
        return y
    else:
        return x
    
Final_table['Area of Destination or Origin'] = Final_table.apply(lambda row: user_perc(row['Area of Destination or Origin'], row['Area of Destination or Origin1']), axis = 1)


In [23]:
Final_table['Area of Destination or Origin'].unique()

array(['S12000033', 'S12000034', 'S12000041', 'S12000035', 'S12000036',
       'S12000005', 'S12000006', 'S08000017', 'S12000042', 'S12000008',
       'S12000045', 'S12000010', 'S12000011', 'S12000014', 'S12000015',
       'S08000018', 'S12000046', 'S12000017', 'S08000022', 'S12000018',
       'S12000019', 'S12000020', 'S12000013', 'S12000021', 'S12000044',
       'S12000023', 'S12000024', 'S12000038', 'S12000026', 'S12000027',
       'S12000028', 'S12000029', 'S12000030', 'S12000039', 'S12000040',
       'S92000003'], dtype=object)

In [24]:
Final_table.head()

Unnamed: 0,Area of Destination or Origin1,Mid Year,Sex,Age,Flow,Measure Type,Value,Unit,label,Area of Destination or Origin
0,Aberdeen City,2001-06-30T00:00:00/P1Y,T,all,Balance,Count,833,People,Aberdeen City,S12000033
1,Aberdeen City,2002-06-30T00:00:00/P1Y,T,all,Balance,Count,385,People,Aberdeen City,S12000033
2,Aberdeen City,2003-06-30T00:00:00/P1Y,T,all,Balance,Count,874,People,Aberdeen City,S12000033
3,Aberdeen City,2004-06-30T00:00:00/P1Y,T,all,Balance,Count,2150,People,Aberdeen City,S12000033
4,Aberdeen City,2005-06-30T00:00:00/P1Y,T,all,Balance,Count,2138,People,Aberdeen City,S12000033


In [25]:
Final_table = Final_table[['Area of Destination or Origin','Mid Year','Sex','Age', 'Flow','Measure Type','Value','Unit']]

In [26]:
Final_table.head()

Unnamed: 0,Area of Destination or Origin,Mid Year,Sex,Age,Flow,Measure Type,Value,Unit
0,S12000033,2001-06-30T00:00:00/P1Y,T,all,Balance,Count,833,People
1,S12000033,2002-06-30T00:00:00/P1Y,T,all,Balance,Count,385,People
2,S12000033,2003-06-30T00:00:00/P1Y,T,all,Balance,Count,874,People
3,S12000033,2004-06-30T00:00:00/P1Y,T,all,Balance,Count,2150,People
4,S12000033,2005-06-30T00:00:00/P1Y,T,all,Balance,Count,2138,People


In [27]:
Final_table['Value'] = Final_table['Value'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [28]:
Final_table = Final_table[Final_table['Mid Year'] != '']

In [29]:
Final_table = Final_table[Final_table['Mid Year'] != 'Year']

In [30]:
# Final_table.drop_duplicates(keep='first', inplace=True)

In [31]:
from pathlib import Path
out = Path('out')
out.mkdir(exist_ok=True)
Final_table.to_csv(out / 'tidy.csv', index = False)

In [32]:
scraper.dataset.family = 'migration'
scraper.dataset.license = 'http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/'

with open(out / 'dataset.trig', 'wb') as metadata:
    metadata.write(scraper.generate_trig())

In [33]:
Final_table.dtypes

Area of Destination or Origin    object
Mid Year                         object
Sex                              object
Age                              object
Flow                             object
Measure Type                     object
Value                             int32
Unit                             object
dtype: object

In [34]:
Final_table.count()

Area of Destination or Origin    19467
Mid Year                         19467
Sex                              19467
Age                              19467
Flow                             19467
Measure Type                     19467
Value                            19467
Unit                             19467
dtype: int64