## Block Number Interpolation

(currently using output from disambiguous v02)

1. **Sequence Restriction**

This notebook interpolates block numbers by filling in if unknown dwellings are in between known dwellings of the same
    1. block number
    2. block number and distance sequence
    3. block number, distance sequence, and enum_dist
    4. block number, distance sequence, enum_dist, and other sequence

2. **Max Distance Restriction** and **In between number of dwellings**

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from interpolation import interpolation, CensusData, dataprocessing, sequences
from interpolation import disambiguation_analysis as da
from interpolation import interpolation_fillin

# Setup

* Read in datasets and set column names

In [2]:
filled_1850 = pd.read_csv("../../data/dwelling_filled_sum_1850_mn_v02.csv")
cd_1850 = pd.read_csv("../../data/cd_1850_mn_20200918.csv") #For calculating centroids
enumerators = pd.read_csv("../../data/census_1850_enumerationDetail_mn_union_20201202.csv")

filled_1850['CENSUS_PAGENUM'] = filled_1850['CENSUS_PAGENUM']//10

In [3]:
ward_col = "CENSUS_WARD_NUM"
dwelling_col = "dwelling_id"
block_col = "CD_BLOCK_NUM"
cd_ward_col = "CD_WARD_NUM"
cd_block_col = "CD_BLOCK_NUM"
dwelling_num_col = "CENSUS_DWELLING_NUM"
cd_address = "CD_H_ADDRESS"
pagenum = "CENSUS_PAGENUM"
x_col = "CD_X"
y_col = "CD_Y"

In [4]:
filled_1850.columns

Index(['CD_BLOCK_NUM', 'CD_H_ADDRESS', 'CD_X', 'CD_Y', 'CENSUS_AGE',
       'CENSUS_CITY', 'CENSUS_DWELLING_NUM', 'CENSUS_DWELLING_SEQ',
       'CENSUS_DWELLING_SIZE', 'CENSUS_FIRST_NAME', 'CENSUS_GENDER',
       'CENSUS_GEOG', 'CENSUS_HH_NUM', 'CENSUS_IMPREL', 'CENSUS_INDEX',
       'CENSUS_IPUMS_UID', 'CENSUS_LABFORCE', 'CENSUS_LAST_NAME',
       'CENSUS_LINE', 'CENSUS_MARST', 'CENSUS_OCCUPATION', 'CENSUS_PAGENUM',
       'CENSUS_RACE', 'CENSUS_REEL', 'CENSUS_SEQ_NUM', 'CENSUS_SERIAL',
       'CENSUS_WARD_NUM', 'dwelling_id', 'spatial_weight',
       'spatial_weight_sum'],
      dtype='object')

## Append page sequence id for mergering with enum data

Initially, census and enumeration data files are merged using `ward` and `pagenum` columns. We later found that these 2 columns are not unique. `pagenum` can restart from 1 again after running for some rows within a ward. We need to create a label (called `page_sequence_id`) indicating which run a `pagenum` is to both census and enumeration data files. That is, we 2 rows in census and enumeration files represent the same record if they have

1. same ward
2. same pagenum in the same run.

Then, we merge the files using ward, pagenum, and `page_sequence_id` column

In [5]:
filled_1850_2 = append_page_sequence_id(filled_1850, ward_col, pagenum)
enumerators_2 = append_page_sequence_id(enumerators, ward_col, 'CENSUS_PAGENNO')

NameError: name 'append_page_sequence_id' is not defined

In [None]:
print(enumerators.shape)
print(enumerators_2.shape)

In [None]:
filled_ch = filled_1850_2[[ward_col, 'page_sequence_id']].groupby(ward_col)['page_sequence_id'].agg('nunique')
enum_ch = enumerators_2[[ward_col, 'page_sequence_id']].groupby(ward_col)['page_sequence_id'].agg('nunique')
assert (filled_ch != enum_ch).sum() == 0, 'Census and Enumeration dfs are not unique using the same keys'

In [None]:
filled_1850_2.columns

In [None]:
# filled_1850_new_pagenum = filled_1850_2.merge(dwelling_to_new_pagenum, on =[ward_col, dwelling_col])
census_enumerators = filled_1850_2.merge(enumerators_2,  how = "left", 
                                                   left_on= [ward_col, 'CENSUS_PAGENUM', 'page_sequence_id'], 
                                                   right_on = ["CENSUS_WARD_NUM", "CENSUS_PAGENNO", 'page_sequence_id'])
census_enumerators.drop(columns=['CENSUS_PAGENNO', 'CENSUS_PAGENO_HOUSEHOLD', 'Notes'], inplace=True)

In [None]:
# census_enumerators.to_csv('../../data/1850_census_enumerators_120820.csv', index=False)

In [None]:
##check if unique. Must be True
print(filled_1850_2.shape)
print(census_enumerators.shape)

census_enumerators.shape[0] == filled_1850_2.shape[0]

## Ward 12 has no block number

In [None]:
filled_1850_temp = pd.read_csv("../../data/dwelling_filled_sum_1850_mn_v02.csv")
filled_1850_temp.loc[filled_1850_temp[ward_col] == 12][block_col].unique()

## Generate sequences

In [None]:
census_enum_seq = CensusData(census_enumerators, ward_col=ward_col, dwelling_col=dwelling_col, 
                             block_col =  block_col, x_col = x_col, y_col = y_col, pagenum = pagenum)
census_enum_seq.apply_sequencing(enumerator_dist = True, dwelling = True, 
                                 fixed = True, distance = True, d=0.1)

In [None]:
census_all_dwellings = census_enum_seq.df.groupby([ward_col, dwelling_col], as_index = False).first()
dwellings_sequence = census_all_dwellings.dropna(subset=[block_col])

## 1. Get dwellings that are followed by unknown dwellings whose block num can be interpolated
## dwellings_sequence => known dwellings
dwelling_sequence_sames = interpolation.same_next(dwellings_sequence, column = block_col)
# dwelling_sequence_sames = dwellings_sequence.groupby(ward_col).agg(interpolation.same_next, block_col)

"""
2. Merge dwelling_sequence_sames back to all known dwelling df so that `BLOCK_NUM_next` and 
`num_between_real` are included in df of all known dwellings.
"""
dwellings_sequence_with_next_info = dwellings_sequence.merge(dwelling_sequence_sames[[ward_col, dwelling_col,
                                                                        block_col+'_next', 
                                                                                      'num_between_real',
                                                                                      'header']], 
                                                             on=[ward_col, dwelling_col], how='left')

In [None]:
all_dwellings_1 = dataprocessing.all_dwellings_sequenced(census_all_dwellings, dwellings_sequence_with_next_info, 
                                                       block_col = block_col, fill_column = block_col,
                                                       check_column = [block_col], ward_col = ward_col, dwelling_col = dwelling_col)

all_dwellings_2 = dataprocessing.all_dwellings_sequenced(census_all_dwellings, dwellings_sequence_with_next_info, 
                                                       block_col = block_col, fill_column = block_col,
                                                       check_column = [block_col, 'sequence_id'], ward_col = ward_col, dwelling_col = dwelling_col)

all_dwellings_3 = dataprocessing.all_dwellings_sequenced(census_all_dwellings, dwellings_sequence_with_next_info, 
                                                       block_col = block_col, fill_column = block_col,
                                                       check_column = [block_col, 'sequence_id', 'enum_dist_id'], ward_col = ward_col, dwelling_col = dwelling_col)

all_dwellings_4 = dataprocessing.all_dwellings_sequenced(census_all_dwellings, dwellings_sequence_with_next_info, 
                                                       block_col = block_col, fill_column = block_col,
                                                       check_column = [block_col, 'sequence_id', 
                                                                       'enum_dist_id', 'fixed_seq',
                                                                      'dwelling_seq_id'], ward_col = ward_col, dwelling_col = dwelling_col)

In [None]:
## check if sequence_id or enum_dist_id is NaN when the other is not. Should not happen
temp = all_dwellings_3[[ward_col, block_col, dwelling_col, 'sequence_id', 'enum_dist_id', pagenum]]
temp.loc[(~temp['sequence_id'].isnull()) & (temp['enum_dist_id'].isnull())]

## Result

In [None]:
total_num_dwellings = census_all_dwellings.groupby([ward_col, dwelling_col]).ngroups
known_num_dwellings = census_all_dwellings.loc[~census_all_dwellings[block_col].isnull()].groupby([ward_col, dwelling_col]).ngroups

all_dwelling_list = [all_dwellings_1,all_dwellings_2, all_dwellings_3, all_dwellings_4]
in_between_num = 15
num_assigned_dwelling = []

print('\nTotal number of Dwellings: ',total_num_dwellings, '\n')
for all_dwelling_x in all_dwelling_list:
    ##interpolated portion
    total_assigned_dwellings = all_dwelling_x.loc[~all_dwelling_x[block_col].isnull()].groupby([ward_col, dwelling_col]).ngroups
#     total_assigned_dwellings = with_block_num_dwellings.shape[0]
    num_assigned_dwelling.append(round((total_assigned_dwellings - known_num_dwellings)/total_num_dwellings, 5))
    
    print("Maximum of {} dwellings between".format(str(in_between_num)))
    print("Number of dwellings that would be assigned a block:", total_assigned_dwellings - known_num_dwellings)
    print("Proportion increase dwellings assigned a block:", round((total_assigned_dwellings - known_num_dwellings)/total_num_dwellings, 5), "\n")


In [None]:
fig, ax = plt.subplots(1,1, figsize=(5,3))
ax.barh(['Block Num', 'Block Num + Distance', 'Block Num + Distance + Enum_Dist', 'Block Num + All Sequences'], num_assigned_dwelling)
# ax.scatter(num_between, num_assigned_dwelling)
ax.set_title("Increase Proportion of Dwellings assigned a block")

## Proportion increased by wards

In [None]:
total_num_dwellings = census_all_dwellings.groupby(ward_col)[dwelling_col].agg('nunique')
known_num_dwellings = census_all_dwellings.loc[~census_all_dwellings[block_col].isnull()].groupby(ward_col)[dwelling_col].agg('nunique')

fig, ax = plt.subplots(1,1, figsize=(10, 5))
for i in range(4): 
    
    assigned_num_dwellings = all_dwelling_list[i].loc[~all_dwelling_list[i][block_col].isnull()].groupby(ward_col)[dwelling_col].agg('nunique')
    additional_assigned_dwellings = assigned_num_dwellings - known_num_dwellings
    increase_proportion = additional_assigned_dwellings/total_num_dwellings
    
    ax.scatter(increase_proportion.index, increase_proportion.values, label=f'Restriction {i+1}')
    ax.plot(increase_proportion.index, increase_proportion.values)
    
ax.set_title(f'Proportion Increase from Interpolation')
ax.set_xlabel('Ward')
ax.legend()

## Take a look at where they differ

### 1 vs 2

In [None]:
(all_dwellings_1[block_col].replace(np.nan, -1) != all_dwellings_2[block_col].replace(np.nan, -1)).sum()

In [None]:
np.where(all_dwellings_1[block_col].replace(np.nan, -1) != all_dwellings_2[block_col].replace(np.nan, -1))

In [None]:
## sample of where they are different
all_dwellings_2[[ward_col, 'dwelling_id', block_col, 'sequence_id']].iloc[4685:4695]

In [None]:
## sample of where they are different
all_dwellings_2[[ward_col, 'dwelling_id', block_col, 'sequence_id']].iloc[25645:25660]

### 2 vs 3

In [None]:
(all_dwellings_3[block_col].replace(np.nan, -1) != all_dwellings_2[block_col].replace(np.nan, -1)).sum()

In [None]:
np.where(all_dwellings_3[block_col].replace(np.nan, -1) != all_dwellings_2[block_col].replace(np.nan, -1))[0][100:200]

In [None]:
## sample of where they are different
all_dwellings_3[[ward_col, 'dwelling_id', block_col, 'sequence_id', 'enum_dist_id']].iloc[490:505]

* there are 2 same known block number before and after the unknown. We are more confident to fill down in this case. Using enum_dist_id fails to capture this.


In [None]:
## sample of where they are different
all_dwellings_3[[ward_col, 'dwelling_id', block_col, 'sequence_id', 'enum_dist_id']].iloc[3575:3595]#36163

* Quesionable fill down

# 2. Distance Threshold for distance sequence
# 3. In Between Num dwelling

In [None]:
distance_threshold = [0.05, 0.1, 0.2, 0.3, 0.4]
in_between_num_list = [5, 10, 20, 25,40, None]
result_inbetween_num = {}
for in_between_num in in_between_num_list:
    result_max_distance = {}
    for max_dist in distance_threshold:

        census_enum_seq.apply_sequencing(enumerator_dist = True, dwelling = True, 
                                         fixed = True, distance = True, d = max_dist)

        census_all_dwellings = census_enum_seq.df.groupby([ward_col, dwelling_col], as_index = False).first()
        dwellings_sequence = census_all_dwellings.dropna(subset=[block_col])

        ## 1. Get dwellings that are followed by unknown dwellings whose block num can be interpolated
        ## dwellings_sequence => known dwellings
        dwelling_sequence_sames = interpolation.same_next(dwellings_sequence, column = block_col)

        """
        2. Merge dwelling_sequence_sames back to all known dwelling df so that `BLOCK_NUM_next` and 
        `num_between_real` are included in df of all known dwellings.
        """
        dwellings_sequence_with_next_info = dwellings_sequence.merge(dwelling_sequence_sames[[ward_col, dwelling_col,
                                                                                block_col+'_next', 
                                                                                              'num_between_real',
                                                                                              'header']], 
                                                                     on=[ward_col, dwelling_col], how='left')

        all_dwellings = dataprocessing.all_dwellings_sequenced(census_all_dwellings, dwellings_sequence_with_next_info, 
                                                               block_col = block_col, fill_column = block_col,
                                                               check_column = [block_col, 'sequence_id'], ward_col = ward_col, 
                                                                 dwelling_col = dwelling_col, dwelling_max = in_between_num)
        result_max_distance[max_dist] = all_dwellings
    result_inbetween_num[in_between_num] = result_max_distance


In [None]:

marginal_proportion_inbetween = []
for ibtwnum, result in result_inbetween_num.items():
    marginal_proportion_max_dist = {}
    for max_dist, current_all_dwellings in result.items():
        ##interpolated portion
        total_num_dwellings = current_all_dwellings.groupby(ward_col)[dwelling_col].agg('nunique')
        known_num_dwellings = current_all_dwellings.loc[~census_all_dwellings[block_col].isnull()].groupby(ward_col)[dwelling_col].agg('nunique')
        total_assigned_dwellings = current_all_dwellings.loc[~current_all_dwellings[block_col].isnull()].groupby(ward_col)[dwelling_col].agg('nunique')
        marginal_proportion_max_dist[max_dist] = round((total_assigned_dwellings - known_num_dwellings)/total_num_dwellings, 5)
    
    marginal_proportion_inbetween.append(pd.DataFrame(marginal_proportion_max_dist).reset_index())

# print("Maximum of {} miles between each dwelling in a sequence".format(str(max_dist)))
# print("Maximum of {} dwellings in between".format(str(in_between_num)))
# print("Number of dwellings that would be assigned a block:", total_assigned_dwellings - known_num_dwellings)
# print("Proportion increase dwellings assigned a block:", round((total_assigned_dwellings - known_num_dwellings)/total_num_dwellings, 5), "\n")

# result_inbetween_distthreshold[in_between_num] = result_list

In [None]:
result_list = []
for i in range(len(in_between_num_list)):
    temp = marginal_proportion_inbetween[i]
    temp['in_between_num'] = in_between_num_list[i]
    result_list.append(temp)
grid_search_summary = pd.concat(result_list, axis=0)
# grid_search_summary.reset_index(inplace=True)

In [None]:
grid_search_summary

In [None]:
marginal_proportion_inbetween[0]

In [None]:
fig, ax = plt.subplots(len(distance_threshold), 1, figsize=(7, 15))
for i in range(len(distance_threshold)):
    dist = distance_threshold[i]
    current_summ = grid_search_summary[['CENSUS_WARD_NUM', dist, 'in_between_num']]
    for inbtw_num in in_between_num_list:
        current_summ_2 = current_summ.loc[current_summ['in_between_num'] == inbtw_num]
        ax[i].scatter(current_summ_2['CENSUS_WARD_NUM'], current_summ_2[dist], 
                      label = f'inbtw_num = {inbtw_num}')
        ax[i].plot(current_summ_2['CENSUS_WARD_NUM'], current_summ_2[dist])
    ax[i].set_title(f'distance treshold {dist}')
    ax[i].legend(bbox_to_anchor=(1.05, 1), loc='upper left')

* once the number between two known dwellings is over 5, there is not much difference in how much we can interpolate.
* if the threshold of 10 makes sense intuitively, **setting the number to 10 is recommended**, as the interpolation does not improve much and the smaller the threshold, the less likely mis-interpolation

In [None]:
fig, ax = plt.subplots(len(in_between_num_list), 1, figsize=(7, 18))
for i in range(len(in_between_num_list)):
    current_summary = marginal_proportion_inbetween[i]
    for max_dist in distance_threshold:
        ax[i].scatter(current_summary['CENSUS_WARD_NUM'], current_summary[max_dist], 
                      label = f'max_dist={max_dist}')
        ax[i].plot(current_summary['CENSUS_WARD_NUM'], current_summary[max_dist])
    ax[i].set_title(f'in_between_num = {in_between_num_list[i]}')
    ax[i].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
 

* max_dist when generating distance sequence does not producing alarmingly different interpolation result, given in_between_num.
* Once the distance threshold is over 0.05 miles, the interpolation rate is not much different
* The value of the threshold can **solely depend on how we want the distance sequence to be**. Threshold = 0.25 generates the most even sequences (sequence length across the data).