# get-recent-migration-stats

The purpose of this script is to wrangle some of the data we have on state-to-state migration and foreign immigration from the census from 2005-2019.

While it isn't currently used, it could be incorporated in a later version to add complexities around state-to-state migration or foreign immigration for record-linkage purposes.

----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2022-12-22</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>

## 0. Import libraries, define fpaths

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import re
from pandas.errors import SettingWithCopyWarning
import warnings

warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

In [None]:
fpath_migration = '../../../SupportingDocs/State-to-State-Migration'

## 1. Define functions

The functions below serve to process different types of migration data taken from the census.
See the [link here](https://www.census.gov/data/tables/time-series/demo/geographic-mobility/state-to-state-migration.html) to that data

We define two seperate functions since the data formatting changes at the year 2010, so the two versions must be wrangled in different ways.

In the earliest iteration of this project, migration is ignored, but having the data present is never a bad thing for future iterations.

### 1.1 Formatting prior to 2010

In [None]:
def extract_info_format_pre2010(year, fpath):
    '''
    (Python language)
  
    Extracts relevant data from our census migration files for years leading up to 2010 (exclusive)
  
    Parameters
    --------------------------------------------
    year -- string, year that we want data for
    fpath -- string, full filepath to data location
    
    Outputs
    --------------------------------------------
    pandas Dataframe
    '''
    # Define rows to skip for this format
    rows_to_skip = [0, 1, 2, 3, 4, 5, 8, 75, 76, 77, 78, 79]

    # read data, skipping relevant rows for this type of format
    df = pd.read_excel(fpath, skiprows=rows_to_skip)

    # Rename state row, drop unnamed columns, drop rows that contain NULL values for specific rows (follows pattern), set index
    df.rename(columns={'Unnamed: 0':'CurrentState'}, inplace=True)
    df = df[df.columns.drop(list(df.filter(regex='Unnamed')))]
    df = df.dropna(subset=['Arizona','CurrentState'], how='any')
    df.set_index('CurrentState',inplace=True)

    # Define the diagonal, represents people who moved WITHIN that state
    df['Diagonal'] = [df.iloc[i,i] for i in np.arange(0,len(df))]

    # Sum of all rows within a column minus the diagonal (moved within state) gives number of emigrants per state
    emigrants = list(df.iloc[:,:-1].sum(axis=0).to_numpy() - df['Diagonal'].to_numpy())
    df['Total Emigrants'] = emigrants

    # Sum of all columns within a row minus the diagonal (moved within state) gives number of immigrants per state
    df['Domestic Immigrants'] = df.sum(axis=1) - df['Diagonal']

    # Format index, copy dataframe key columns
    df.reset_index(inplace=True)
    sub_df = df.copy([['CurrentState','Domestic Immigrants','Total Emigrants']])

    # Define null columns for output, unavailable data given this style of formatting
    sub_df['Foreign Immigrants'] = np.nan
    sub_df['Population Retained'] = np.nan # Population retained impossible to determine; diagonal represents internal moves within state, not actual population retained including non-movers
    sub_df['Assumed Population'] = np.nan # Assumed population impossible to determine given the available variables

    # Define total immigrants (assumed to be same as domestic given our available data), also current year
    sub_df['Total Immigrants'] = sub_df['Domestic Immigrants']
    sub_df['Year'] = year

    cols_to_save = ['CurrentState', 'Year', 'Population Retained', 'Domestic Immigrants',
       'Foreign Immigrants', 'Total Immigrants', 'Total Emigrants',
       'Assumed Population']

    # Return our desired output dataframe
    return sub_df[cols_to_save]    

### 1.2 Formatting 2010 and forward

In [None]:
def extract_info_format_post2010(year, fpath):
    '''
    (Python language)
  
    Extracts relevant data from our census migration files for years after 2010 (inclusive)
  
    Parameters
    --------------------------------------------
    year -- string, year that we want data for
    fpath -- string, full filepath to data location
    
    Outputs
    --------------------------------------------
    pandas Dataframe
    '''
    # Define rows to skip for this format
    rows_to_skip = [0, 1, 2, 3, 4, 5, 8, 43, 44, 45, 46, 47, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
    
    # Read in data
    df = pd.read_excel(fpath, skiprows=rows_to_skip)
    
    # Rename relevant rows
    df.rename(columns={'Unnamed: 0':'CurrentState',
                       'Unnamed: 5':'Diagonal1',
                       'Unnamed: 3':'Diagonal2',
                       'Total.1':'Foreign Immigrants'}, inplace=True)
    
    # Determine "population retained" -> people who do not move + people who move internally within state
    df['Diagonal'] = df['Diagonal1'] + df['Diagonal2']
    
    # Drop columns we don't care about and rows that don't contain any information on state data
    df = df[df.columns.drop(list(df.filter(regex='Unnamed')))]
    cols_to_drop = ['Puerto Rico','U.S. Island Area','Foreign Country','Foreign Country4','Diagonal1','Diagonal2']
    df.drop(columns=cols_to_drop, inplace=True, errors='ignore')
    df = df.dropna(subset=['CurrentState'], how='any')
    
    # Drop data non state specific - all of united states
    df = df.query('(CurrentState != "United States2")&(CurrentState != "United States1")')
    
    # The diagonal, people who were in same state 1 year ago, is NULL.  Fill it with the proper information
    for column in df.columns:
        df.loc[:,column] = df[column].fillna(df['Diagonal'])

    
    # # Show that the "Total" column truley reflects the sum of people who had a different state of residence 1 year ago
    # df.iloc[:,2:-2].sum(axis=1) - df['Diagonal'] == df.Total
    
    # Sum of all rows within a column (exclude a couple columns) minus the diagonal (moved within state) gives number of emigrants per state
    emigrants = list(df.iloc[:-1,2:-2].sum(axis=0).to_numpy() - df.iloc[:-1,:]['Diagonal'].to_numpy())
    emigrants.extend([np.nan]) #extend for puerto rico
    df.loc[:,'Total Emigrants'] = emigrants
    
    # Copy important rows to seperate dataframe, rename columns
    sub_df = df.copy()[['CurrentState','Total','Total Emigrants','Foreign Immigrants','Diagonal']]
    sub_df.rename(columns={'Diagonal':'Population Retained',
                           'Total':'Domestic Immigrants'}, inplace=True)
    
    # Calculate total immigrants, assumed population, and provide year
    sub_df['Total Immigrants'] = sub_df['Domestic Immigrants'] + sub_df['Foreign Immigrants']
    sub_df['Assumed Population'] = sub_df['Population Retained'] + sub_df['Total Immigrants'] - sub_df['Total Emigrants']
    sub_df['Year'] = current_year

    cols_to_save = ['CurrentState', 'Year', 'Population Retained', 'Domestic Immigrants',
       'Foreign Immigrants', 'Total Immigrants', 'Total Emigrants',
       'Assumed Population']
    
    # Return proper cols
    return sub_df[cols_to_save]

## 2. Load, wrangle data

Note again, this data was captured from the census [linked here](https://www.census.gov/data/tables/time-series/demo/geographic-mobility/state-to-state-migration.html)

In [None]:
# Check folder contents
contents_folder = os.listdir(f'{fpath_migration}/01_Raw')

# Initialize empty list for output
df_outputs = []

# Looping through folder contents...
for i in np.arange(0,len(contents_folder)):
    
    # Determine current year and columns output at the end.
    current_year = re.search('\d{4}',contents_folder[i])[0]
    
    # Different years have different formats, split at the year 2010, so for all years before 2010...
    if (current_year < '2010'):
        
        # Extract data using first formatting function
        df = extract_info_format_pre2010(year=current_year, fpath=f'{fpath_migration}/01_Raw/{contents_folder[i]}')
        
        # Append to list of dataframes
        df_outputs.append(df)

    else:
    
        # Extract data using second formatting function
        df = extract_info_format_post2010(year=current_year, fpath=f'{fpath_migration}/01_Raw/{contents_folder[i]}')
        
        # Append to list of dataframes
        df_outputs.append(df)
        
# Concat all of the data
output = pd.concat(df_outputs)

## 3. Save

In [None]:
output.to_csv(f'{fpath_migration}/02_Wrangled/StateMigrationData.csv', index=False, header=True)