## Public Opinion Data Preprocessing 
### Pew Public Opinion Data 2017 - 2021

The goal of this worksheet is to preprocess the data for easy use moving forward. We hope to: 
- remove unnessecary data 
- clean all data so that it has the same labelling conventions (ordinality & missing data point labelling) 
- combine common data across years for analysis 
- export individual datasets and a common one across years

## Methodology 

- Import all .sav files as csv with their original labels. This ensures that ordinal variables are transformed in the same direction 
    - Note: This has typically been produced by using SPSS in the VCL and exporting data with original labels. You also want to grab a dictionary of all the question definitions while you are in there. 
- Squash down all of the categorical varaibles that are labelled with individual country values 
- Create a series of dictionaries to transform the values of the rest of the dataset 
- Drop irrelevant values in each individual dataframe (this will allow a clean dataset to be used for intracountry comparisons) 
- Determine relevant variables to be used across time (this will allow a clean dataset for intercountry comparisons) 
    - This dataset would contain individual responses logged with time (year) with the same column values for the same responses. It would NOT be aggregated so that statistics could be filtered by demongraphic, region, etc before comparisons across time. 
- Export all of this data (1 dataset per year + 1 common dataset) 

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

import ipysheet as ip
from ipysheet import sheet, cell, row, column, cell_range, from_dataframe, to_dataframe 

import warnings
warnings.filterwarnings('ignore')

### Import all of the data with original labelling

In [2]:
years = [2021, 2020, 2019, 2018, 2017]

# grabbing the data across the years data is found 

d = {}

path_raw_file = '/Users/natalie_kraft/Documents/LAS/raw/PewL'
for year in years: 
    d[year] = pd.read_csv(path_raw_file + str(year) + '.csv')
    d[year].columns = d[year].columns.str.lower()

# access dataframe per year through d[year]

In [3]:
# replace Don't Know and Refused with similar verbage 
# this reduces the number of transformations we need to utilize
    
for year in years:   
    d[year] = d[year].replace({"(VOL) Don\'t know" : "Don't know", 
                 "Don\'t know (DO NOT READ)" : "Don't know", 
                 "Dont know (DO NOT READ)" : "Don't know",
                 "(VOL)\xa0Don't know" : "Don't know",
                 "Refused (DO NOT READ)" : "Refused", 
                 "(VOL) Refused" : "Refused", 
                 '(VOL)\xa0Refused' : "Refused", 
                 ' ': "Don't know", 
                 'Don’t know (DO NOT READ)' : "Don't know",
                 "Don\x92t know (DO NOT READ)" : "Don't know",
                 "Refused (DO NOT READ" : "Refused", 
                 'Dont know' : "Don't know"})
    
# attempted this shortcut, but was affected too high proportion of cells 
#     d[year] = d[year].replace({r"(.*)Don(.*)know(.*)" : "Don't know", 
#             r"(.*)Refused(.*)" : "Refused"}, regex=True)

In [4]:
dOriginal = d.copy()

In [5]:
# some variables are the same, but need slight recoding. Do that here. 
# this block focuses on 'intthreat_uspower' and 'intthreat_chpower'

dOriginal[2019]['us_influ_econ2'].replace({ "Positive" : "Not a threat", 
 "Negative" : "Threat", 
 "Neither/both (DO NOT READ)" : "Not a threat"
}, inplace = True) 

dOriginal[2019].rename(columns={'us_influ_econ2': 'intthreat_uspower'})

dOriginal[2018]['intthreat_uspower'].replace({"Major threat" : "Threat", "Minor threat": "Threat"}, inplace=True)
dOriginal[2017]['intthreat_uspower'].replace({"Major threat" : "Threat", "Minor threat": "Threat"}, inplace=True)

dOriginal[2019]['china_influ_econ2'].replace({ "Positive" : "Not a threat", 
 "Negative" : "Threat", 
 "Neither/both (DO NOT READ)" : "Not a threat"
}, inplace = True) 

dOriginal[2019].rename(columns={'china_influ_econ2': 'intthreat_chpower'})

dOriginal[2018]['intthreat_chpower'].replace({"Major threat" : "Threat", "Minor threat": "Threat"}, inplace=True)
dOriginal[2017]['intthreat_chpower'].replace({"Major threat" : "Threat", "Minor threat": "Threat"}, inplace=True)

In [6]:
# focus on us/china influence scales. Combining variables for consistency. 

infl_map = {"Fair amount of influence" : "Tries to influence the internal affairs of other countries", 
                                             "Great deal of influence" : "Tries to influence the internal affairs of other countries", 
                                             "Not too much influence" : "Mostly stays out of the internal affairs of other countries" , 
                                             "No influence at all" : "Mostly stays out of the internal affairs of other countries"}

dOriginal[2019]['china_influ_econ'].replace(infl_map, inplace=True)
dOriginal[2019]['us_influ_econ'].replace(infl_map, inplace=True)
dOriginal[2019].rename(columns={'china_influ_econ': 'influaffairs_china', 'us_influ_econ' : 'influaffairs_us'}, inplace=True)

In [7]:
dOriginal[2021].rename(columns={'polsys_country': 'polsys_reform'}, inplace=True)

In [8]:
# updating a variable to showcase the relationship countries have with the US

dOriginal[2021]['reliable_us'].replace({'Somewhat reliable': 'Somewhat good', 'Not too reliable' : 'Somewhat bad', 
                                       'Very reliable' : 'Very good', 'Not at all reliable' : 'Very bad' }, inplace=True)

dOriginal[2021].rename(columns={'reliable_us': 'us_relationship'}, inplace=True)
dOriginal[2019].rename(columns={'relations_us': 'us_relationship'}, inplace=True)

In [9]:
# one variable - econ_ties_china - is in 2 years with different meanings. 

dOriginal[2020].rename(columns={'econ_ties_china' : 'econ_ties_usch'}, inplace=True)

In [10]:
# prepare for categorical variables to be kept throughout dataset for legibility in mapping 

categorical_vars = ['children_betteroff2', 'fav_us', 'fav_china', 'fav_russia', 'country_satis', 'satisfied_democracy', 
                   'confid_biden', 'confid_trump', 'confid_xi', 'confid_putin', 'econ_sit', 'econ_power', 'us_or_china', 
                   'religion_import', 'respect_china', 'respect_us', 'intthreat_uspower', 'intthreat_chpower', 'improve_econ', 
                   'intthreat_econcondition', 'us_relationship', 'china_econ', 'china_military']

only_categorical_vars = ['global_trade', 'global_movement', 'global_information', 'china_jobloss', 'china_deficit', 
                         'china_taiwan', 'china_environ', 'china_debt', 'china_terrdisputes', 'us_def_china', 
                         'china_econ_military', 'us_world_role', 'world_role_china', 'world_role_russia', 
                         'usrel_betterworse', 'involved_us', 'influaffairs_russia', 'interest_surveycountry', 
                         'econ_ties_china', 'econ_ties_us', 'threats_new_1', 'allies_new_1', 'foreigncom_buy', 
                         'foreigncom_new', 'china_invest', 'future_usrel', 'close_relationship', 'us_mil_asia', 
                         'china_us_enemy', 'econsys_reform', 'china_tough', 'china_tough_econ', 'intthreat_ruspower', 
                        'influaffairs_china', 'influaffairs_us', 'polsys_reform', 'econ_ties_usch']

for year in years: 
    for c in categorical_vars: 
        if c in d[year].columns: 
            d[year][c + "_categorical"] = d[year][c]
    for c in only_categorical_vars: 
        if c in d[year].columns: 
            d[year][c + "_categorical"] = d[year][c]

### Reduce dimensionality before preprocessing

The following variables are individually labelled per country: 
- 'd_ptyid_' : party identification 
- 'd_relig_' : religious affiliation 
- 'd_income2_' : wealthy (rich/poor binary var based on cost of living for country) 
- 'd_educ_' : level of education respondant recieved

The country name already provides this affiliation, dimensionality will be reduced to one common variable for each categorical one. The variable names will be transformed such that: 
- 'd_ptyid_' : 'political_affiliation'
- 'd_relig_' : 'religious_affiliation' 
- 'd_income2_' : 'wealthy'
- 'd_educ_' : 'education_level'

In [11]:
def categorical_squash(regex_conven, mapping, new_name, y): 
    
    for year in y: 
        p_temp = d[year].filter(regex=regex_conven).replace(mapping)
        p_temp[new_name] = p_temp.iloc[:, 0]

        for index, name in enumerate(p_temp.columns): 
            p_temp[new_name] = p_temp[new_name].combine_first(p_temp.iloc[:, index])
        
        d[year][new_name] = p_temp[new_name]

In [12]:
# reducing political affiliation
# keep in mind, this only is a label of affiliation, not a favorability toward individual parties 
# favorability exists, but it will not be considered in this round of preprocessing 

categorical_squash('d_ptyid_', {"Don't know" : None}, 'political_affiliation', years)

# reducing religious affiliation 
categorical_squash('d_relig_', {"Don't know" : None}, 'religious_affiliation', years)

# reducing income level
# TODO: This needs updating in accordance with 'd_income_'
categorical_squash('d_income2_', {"Don't know" : None}, 'wealthy', years)
# identifing commonalities across respondant's country
for year in years: 
    d[year]['wealthy'].map(lambda x: 1 if (x is not None) and ('More' in x) else 0).value_counts()
    
categorical_squash('d_educ_', {"Don't know" : None}, 'education_level', years)

__Categorical Squash for Regional Geocoding__ 

In [13]:
# geocoding for 2021, 2020 is region

categorical_squash('region_', {"Don't know" : None}, 'regional_location', [2020, 2021])

# geocoding for 2019 is region or qs5 

categorical_squash('region_', {"Don't know" : None}, 'temp_r', [2019])
categorical_squash('qs5', {"Don't know" : None}, 'temp_q', [2019])
d[2019]['regional_location'] = d[2019]['temp_r'].combine_first(d[2019]['temp_q'])
d[2019] = d[2019].drop(columns=['temp_r', 'temp_q'])

# geocoding for 2018/2017 is qs5

categorical_squash('qs5', {"Don't know" : None}, 'regional_location', [2018, 2017] )

### Reduce dimensionality before preprocessing
- If the data is a known categorical variable, squash into a common varaible across regions and don't run it through the preprocesser. Or remove categorical variable in its entirity. 
- If the data is not needed, drop it from the dataset 

In [14]:
# Drop unneeded data from the dataset 
# This is a listing of variable names per dataset that is unneeded 

# Absolute column names can be added here
drop = {
    2021 : ['phone_sample','survey', 'weight', 'diversity_goodbad', 'healthsys_reform', 
            'basic_facts', 'public', 'polsys_countryfu', 'climate_behavior', 'usdemocracy_example', 'eu_germanyinfluence', 
            'biden_relations'], 
    2020 : ['phone_sample', 'cregion_us', 'density_us', 'covid_change', 'covid_ownfaith', 'covid_countryfaith', 'covid_family',
             'covid_united', 'covid_cooperation', 'pray', 'd_political_scale_us', 'd_ptylean_us',
            'qs8', 'survey', 'weight', 'd_born_us', 'compromise'], 
    2019 : ['phone_sample', 'cregion_us', 'density_us', 'fav_hezbollah', 'german_unification', 'germany_standard', 
            'mex_live_us', 'mex_wo_auth', 'survey', 'weight', 'd_born_us', 'relparticipate_story', 'fav_muslims_country',  
            'fav_roma', 'fav_germany', 'receive_money', 'equal_leaders', 'state_us', 
            'influence_finance', 'fav_muslimbulg', 'neighboring_countries', 'eastwest_ger', "influence_relig", 
           'influence_raise', 'econ_integration', 'country_born', 'fav_jews91', "kind_of_marriage", "same_rights", 
           'd_political_scale_us', 'country_national', 'women_rights', 'econ_communism', 'd_political_scale_us', 
           'close_relationship', 'd_ptylean_us', 'nato_def', 'better_gender', 'us_mil_asia', 
           'confid_orban', 'confid_kim', 'confid_salman', 'id_religion', 'id_nationality', 'id_occupation', 
            'id_polparty'], 
    2018 : ['survey', 'weight', 'd_born_us', 'state', 'density', 'usr', 'scregion', 'sstate',
            'susr','sdensity', 'kashmir_military', 'sanc_effrus', 'mex_live_us', 'workauto50yr', 'good_live_us', 
           'receive_money', 'immig_moreless'], 
    2017 : ['survey', 'weight', 'd_born_us', 'dem_stable', 'defense_spending', 'desc_day', 
           'dissol_goodbad', 'eu_leavestay', 'euexit_referendum', 'fav_aap', 'fav_india','fav_pak',
            'fav_japan', 'fav_saudi', 'fav_turkey', 'fav_skorea', 'fav_nkorea', 'fav_cuba', 'fav_boko', 'fav_mex',
           'fav_eu', 'fav_germany', 'fav_britain', 'fav_nato', 'swe_join_nato', 'turkey_eu_member', 'dissol_goodbad', 
            'me_role_egypt', 'me_role_saudiarabia', 'me_role_turkey','me_role_iran','me_role_israel',
           'fav_sisi','fav_erdogan','fav_assad','fav_netanyahu','fav_salman','fav_rouhani','fav_abdullahii','refugee_iraqsyr',
            'war_syria_length', 'fav_adtlpolcnty_rousseff', 'fav_adtlpolcnty_luiz', 'fav_adtlpolcnty_temer',
            'fav_lopez', 'fav_radonski', 'fav_allup', 'fav_pri','fav_pan', 'fav_morena', 'fav_prd','fav_modi',
            'fav_kejriwal', 'fav_bjp', 'fav_inc','isr_pal_coexist', 'jewish_settlements', 'd_numcell', 'kashmir_military', 
            'influence_humanrightsorgs', 'nafta_goodbad', 'qsplit',  'racethn', 'racecmb', 'me_role_us', 'prob_kashmir',
           'd_density', 'receive_money', 'concern_country', 'humanrights_motive']
    
}

# to reduce names, all partial (or sets) of columns can be added here
# if the column name contains any part of this value it will be removed 
drop_inc_all = ['partyfav', "d_income", "d_race", 'd_ethnicity', "d_ptyid_", "d_educ_", "d_relig_"
               'psu_', 'stratum_', 'american_', 'language', 'pray', "abortion", "covid", 'ladder', 
               't.sample', 'homephone', 'confid_johnson', 'confid_macron','confid_merkel', 'confid_castro', 'confid_abe',
               'confid_modi', 'd_hhcell', 'fav_eu', 'fav_un', 'fav_iran', 'fav_nato', 'fav_india', 'fav_japan', 
               'fav_ep', 'fav_ec', 'd_working_cell']

drop_inc = {
    2021 : ['usbest_', 'conflict_', 'climate_intl', 'discrimprob_'], 
    2020 : ['sdlkjafsldjflakjsdlfjl'],
    2019 : ['multiparty', 'churches_', 'language_home', 'ukr_lang', 'brexit_', 'religion20yr',  
           'equal_'], # testing without 'id_'
    2018 : ['qs6', 'qs7', 'qs8', 'qs9', 'qs10', 'qs11', 'cregion', 'robjob4', 'whymove', 
            'fiveyears_', 'indiaus', 'eu_', 'cyberattack_', '20yr', 'planmove', 'modern_educ', 'friends_', 
           'officials_', 'pray_', 'relbehavior', 'pairs_'], 
    2017 : ['brexit_', 'cell_12months', 'church_', 'trump_', 'obama_', 'mfollow_', 'brexit_policy', 'eu_', 'defend_', 
            'smartphone', 'textfreq', 'turkey_', 'maduro', 'nieto', 'mex_', 'gandhi', 'modi', 'putin_', 'duterte', 
           'phil', 'italy_pride', 'stayintouch', 'friends_', 'pray_', 'd_tenure', 'qs6', 'qs7', 'qs8',
            'qs9', 'qs10', 'qs11', 'nkorea', 'd_relig_practice', 'humanrights_priority_'], 
}


In [15]:
# make sure not to drop these varaibles, dispite their missing-ness 
# highly valuable
protected_vars = ['econ_ties_china', 'china_us_enemy', 'us_mil_asia', 'close_relationship',
                  'global_money', 'global_information', 'china_jobloss', 'china_deficit', 
                  'china_taiwan', 'china_environ', 'china_debt', 'china_terrdisputes', 'us_def_china', 
                  'us_world_role', 'world_role_China', 'world_role_russia', 'usrel_betterworse', 'involved_US', 
                  'influaffairs_russia', 'interest_surveycountry',
                  'china_tough', 'china_tough_econ', 'influaffairs_china'
                 ]

protected_vars.extend([x + '_categorical' for x in only_categorical_vars])

for year in years: 
    sizeInit = d[year].shape[1]
    
    # ensure that the rest of the data has proper visibility
    # Note 6/2 originally removed with percent of left over data. Manually selected questions moving forward. 
    for i in d[year].columns:
        try:
            number = d[year][i].value_counts()["Don't know"]
            if (number > (d[year].shape[0] * .9)) & ~(i in protected_vars): 
                drop[year].append(i)
        except KeyError: 
            # do nothing
            number = 0 
            
    # drop all listed and size-constrained variables 
    d[year] = d[year].drop(columns=drop[year])
    for x in drop_inc[year]: 
        d[year] = d[year].drop([col for col in d[year].columns if x in col], axis=1)
    for x in drop_inc_all: 
        d[year] = d[year].drop([col for col in d[year].columns if x in col], axis=1)

    sizeEnd = d[year].shape[1]
    print("The data from year " + str(year) + " was reduced by " + str(sizeInit - sizeEnd) + " columns.")
    print("The data is now " + str(sizeEnd))

The data from year 2021 was reduced by 215 columns.
The data is now 61
The data from year 2020 was reduced by 188 columns.
The data is now 57
The data from year 2019 was reduced by 516 columns.
The data is now 124
The data from year 2018 was reduced by 472 columns.
The data is now 132
The data from year 2017 was reduced by 784 columns.
The data is now 129


 __We remove all categorical variables in the dataset through dummy variable transformations.__

In [16]:
# listing of all categorical variables to be connected 

found = {
    2021: [],
    2020: [], 
    2019: [], 
    2018: [], 
    2017: []
}

In [17]:
def create_dummy_var(dataset, variable_name, mapping, found): 
    for year in years: 
        if variable_name in dataset[year].columns: 
            print(variable_name + ' variable found in ' + str(year))
            dummy_demo = pd.get_dummies(dataset[year][variable_name].map(mapping))
            found[year].extend(list(dummy_demo.columns))
            
            # need to merge dummy variables into df 
            dataset[year] = pd.concat([dataset[year], dummy_demo], axis=1)

In [18]:
# some variables are categorical 
# they will be transformed into dummy variables and their original label will be removed 

# econ_power
econ_power_mapping = {
    "The United States": "us_econ_power", 
    "China": "china_econ_power",
    "Japan": "japan_econ_power",
    "The countries of the European Union": "eu_econ_power",
    "(VOL) None / There is no leading economic power": "no_econ_power",
    "None / There is no leading economic power (DO NOT READ)" : "no_econ_power",
    "(VOL) Other": "other_econ_power", 
    "Other (DO NOT READ)" : "other_econ_power"
}

# maps onto dummy variables
create_dummy_var(d, 'econ_power', econ_power_mapping, found)

econ_power variable found in 2020
econ_power variable found in 2019
econ_power variable found in 2018
econ_power variable found in 2017


In [19]:
# 'us_or_china'

econ_us_china = {
    "The United States" : "prefer_us_econ", 
    "China" : "prefer_china_econ", 
    "Economic ties to both countries are equally important (DO NOT READ)" : "both_china_econ"
}

create_dummy_var(d, 'us_or_china', econ_us_china, found)
for year in years: 
    if 'both_china_econ' in d[year].columns: 
        d[year]['prefer_us_econ'] = d[year]['both_china_econ'] + d[year]['prefer_us_econ']
        d[year]['prefer_china_econ'] = d[year]['both_china_econ'] + d[year]['prefer_china_econ']
        d[year].drop(columns=['both_china_econ', 'no_econ_power', 'other_econ_power'])

us_or_china variable found in 2021
us_or_china variable found in 2019


In [20]:
# econ_power
world_leader_mapping = {
    "The U.S. is the world’s leading power": "US_better_worldleader", 
    "China is the world’s leading power": "China_better_worldleader",
    "Both (DO NOT READ)": "both_better_worldleader"
}

# maps onto dummy variables
create_dummy_var(d, 'worldleader_uschina', world_leader_mapping, found)
for year in years: 
    if 'both_better_worldleader' in d[year].columns: 
        d[year]['US_better_worldleader'] = d[year]['both_better_worldleader'] + d[year]['US_better_worldleader']
        d[year]['China_better_worldleader'] = d[year]['both_better_worldleader'] + d[year]['China_better_worldleader']
        d[year].drop(columns='both_better_worldleader')

worldleader_uschina variable found in 2018


### Identify mapping for transformations
- Create a list for variables where no transformations are needed 
- Create all mappings for variables
- Search through all variables and map those with similar corresponding labels 

In [21]:
# These column values don't need to be transformed, but we do want to keep them in the dataset 
# They are either discrete values or they are regional/naming conventions. 


keep = ['id', 'country', 'sex', 'age', "d_density", 'd_hhpeople', 'political_scale2', 'qdate_s', 'qdate_e', 
        'd_adults', "d_density", 'muslim_branch', 'political_affiliation', 'religious_affiliation', 
        'education_level', 'wealthy', 'regional_location', 'd_adult_us', 'worldleader_uschina']

# make sure to add in the categorical vars to use later on
keep.extend([x + "_categorical" for x in categorical_vars])
keep.extend([x + "_categorical" for x in only_categorical_vars])

keep_inc = {
    2021 : [ 'id'], 
    2020 : [ 'state_us', 'china_us_enemy'],
    2019 : [ "allies_new_1", "threats_new_1", 'close_relationship'], 
    2018 : [ 'qlang', 'influaffairs_china'], 
    2017 : [ 'qlang'], 
}


In [22]:
missing_vars = {
    "Don't know": 8, 
    "Refused" : 9
}

sat_bin = {
    "Dissatisfied": 0, 
    "Satisfied": 1, 
}

sat_q = {
    "Not too satisfied": 2, 
    "Not at all satisfied" : 1, 
    "Somewhat satisfied": 3, 
    "Very satisfied": 4, 
}

good_bad_q = {
    "Somewhat good" : 3,
    "Somewhat bad" : 2,
    "Very bad" : 1,
    "Very good" : 4
}

better_t = {
    "Worse off" : 1, 
    "Gotten worse" : 1,
    "Worse" : 1,
    "Better off" : 3, 
    "Better" : 3, 
    "Gotten better" : 3, 
    "Same (DO NOT READ)" : 2, 
    "Stayed about the same" : 2, 
    "About the same" : 2
}

prob_q = {
    "Very big problem" : 4, 
    "Moderately big problem" : 3, 
    "Small problem" : 2, 
    "Not a problem at all" : 1, 
}

fav_q = {
    "Somewhat favorable" : 3, 
    "Somewhat unfavorable" : 2, 
    "Very favorable": 4, 
    "Very unfavorable" : 1
}

amount_q = {
    "Great deal" : 4, 
    "A great deal" : 4,
    "Fair amount" : 3, 
    "A fair amount" : 3,
    "Not too much" : 2, 
    "Not at all" : 1
}

approval_q = {
    "Approve": 3, 
    "Strongly approve": 4, 
    "Disapprove" : 2, 
    "Strongly disapprove" : 1
} 

confid_q = {
    "No confidence at all" : 1, 
    "Not too much confidence" : 2, 
    "Some confidence" : 3, 
    "A lot of confidence" : 4
}

right_t = {
    "About right" : 2, 
    "Too great" : 3, 
    "Too small" : 1
}

yesno_bin = {
    "No" : 0, 
    "Yes" : 1
}

influe_q = {
    "Great deal of influence" : 4,
    "Very good influence" : 4, 
    "Fair amount of influence" : 3, 
    "Good influence" : 3, 
    "Bad influence" : 2, 
    "Very bad influence" : 1, 
    "Not too much influence" : 2, 
    "No influence at all" : 1, 
    "No influence (DO NOT READ)" : 8
}
  
mil_bin = {
    "Yes, would use military force" : 1, 
    "Yes, should use military force" : 1, 
    "No, would not use military force" : 0, 
    "No, should not use military force" : 0
}

import_q = {
    "Very important" : 4,  
    "Somewhat important" : 3,
    "Not very important" : 2,
    "Not too important" :2, 
    "Not at all important": 1, 
    "Not important at all": 1  
}

roles_t = {
    "More important role" : 3, 
    "Doing more" : 3, 
    "Less important role" : 1, 
    "Doing less" : 1, 
    "As important as 10 years ago" : 2, 
    "U.S. does not help (DO NOT READ)" : 2, 
    "About the same" : 2
}

threat_t = {
    "Major threat" : 3, 
    "Not a threat" : 1, 
    "Minor threat" : 2, 
    'Very concerned' : 3,
    'Very serious' : 3, 
    'Somewhat concerned' : 2, 
    'Somewhat serious' : 2, 
    'Not too serious' : 2, 
    'Not too concerned' : 2, 
    'Not at all concerned' : 1, 
    'Not a problem' : 1
}

threat_t2 = {    "Threat" : 1, 
    "Not a threat" : 0}

changes_t = {
    'It needs to be completely reformed' : 4, 
    'It needs major changes' : 3, 
    'It needs minor changes' : 2, 
    'It doesn’t need to be changed' : 1
}

god_bin = {
    "It is necessary to believe in God in order to be moral and have good values" : 1, 
    "It is not necessary to believe in God in order to be moral and have good values" : 0
}

china_bin = {
    "The U.S. should try to promote human rights in China, even if it harms economic relations with China" : 1,
    '(response in COUNTRY) should try to promote human rights in China, even if it harms econo' : 1, 
    "The U.S. should prioritize strengthening economic relations with China, even if it means not addressing human rights iss" : 0,
    '(response in COUNTRY) should prioritize strengthening economic relations with China, even if it means not addressing' : 0
}

priority_q = {
    "Top priority" : 4, 
    "Important but lower priority" : 3, 
    "Not too important" : 2, 
    "Should not work on this issue" : 1
}

trust_bin = {
    "In general, most people can be trusted" : 1,
    "In general, most people cannot be trusted" :0
}

econ_q = {
    "Improve a lot" : 5, 
    "Improve a little" : 4,
    "Worsen a little" : 2,
    "Worsen a lot" :1, 
    "Remain the same" : 3
}

trust_q = {
    "A lot" : 4, 
    "Somewhat" : 3, 
    "Not much" : 2, 
    "Not at all" : 1
}

enemy_t = {
    "Competitor" : 2,     
    "Enemy" :3,
    "Partner" :1
}

agree_q = {
    "Mostly disagree" : 2, 
    "Completely disagree" :1, 
    "Mostly agree" : 3, 
    "Completely agree" : 4, 
}

goodbad_b = {
    "Bad thing" : 0, 
    "Good thing" : 1, 
    "Neither good nor bad" : 0, 
    "Neither (DO NOT READ)" : 0,
    "Both (DO NOT READ)" :0
}

goodbad_b2 = {
    "Investment from China is a good thing" : 1, 
    "Investment from China is a bad thing" : 0
}

posneg_b  = {
    "Positive" : 1, 
    "Negative" : 0, 
    "Neither/both (DO NOT READ)" : 0   
}

opto_b = {
    "Optimistic" : 1, 
    "Pessimistic" : 0, 
    "Neither (DO NOT READ)" : 0
}

smart_b = {
    "Smartphone" : 1,
    "Not a smartphone" : 0
}

cell_b = {
    "Yes, someone in household has cell phone" : 1, 
    "No" : 0 
}

global_b = {
    "should act as part of a global community that works together to solve problems" : 1,                      
    "should act as independent nations that compete with other countries and pursue their own interests" : 0, 
    "Both (DO NOT READ)" : 1, 
    "Neither (DO NOT READ)" : 0                  
}

homo_b = {
    "Homosexuality should be accepted by society" : 1,
    "Homosexuality should not be accepted by society" : 0,  
}

relat_b = {
    "Building a strong relationship with China on economic issues" : 0,
    "Getting tougher with China on economic issues" : 1
}

news_q = {
    "Very well" : 4, 
    "Somewhat well" :3, 
    "Not too well" : 2, 
    "Not well at all" : 1, 
    "News organizations should not do this (DO NOT READ)" : 0
}

news_b = {
    "It is never acceptable for a news organization to favor one political party over others when reporting news" : 0, 
    "It is sometimes acceptable for a news organization to favor one political party over others when reporting news" : 1
}

respect_b = {
    "Yes, respects personal freedoms" : 1, 
    "No, does not respect personal freedoms" : 0 
}

support_b = {
    "Support" : 1, 
    "Oppose" : 0
}
    
civic_q = {
    "Have done in the past year" : 4, 
    "Have done in the more distant past" : 3, 
    "Have not done, but might do" : 2, 
    "Have not done and would never, under any circumstances, do" : 1
}

increase_t = {
    "Increase" : 3, 
    "Decrease" : 1,
    "Does not make a difference" : 2
}

nukes_t = {
    "Too much" : 3, 
    "About what needs to be done OR" : 2, 
    "Too little" : 1
}

jobs_t = {
    "Job creation" : 3, 
    "Job losses" : 1, 
    "Does not make a difference" : 2
}

likely_q = {
    "Very likely" : 4, 
    "Somewhat likely" : 3, 
    "Not too likely" : 2, 
    "Not at all likely" : 1
}

social_s = {
    "Several times a day" : 7,  
    "Once a day" : 6,       
    "Several times a week" : 5, 
    "Once a week" : 4, 
    "Several times a month" : 3, 
    "Once a month" : 2, 
    "Less than once a month" : 1,  
    "Never" : 0,                    
}

better_place_t = {
    "A better place to live" : 3, 
    "A worse place to live" : 1, 
    "Doesn\'t make much difference either way (DO NOT READ)" : 2
}

reliability = {
    "Very reliable" : 4,
    "Somewhat reliable" : 3, 
    "Not too reliable" : 2,
    "Not at all reliable" : 1, 
}

dictionaries = [sat_bin, good_bad_q, sat_q, better_t, fav_q, amount_q, approval_q, confid_q, right_t, 
               yesno_bin, influe_q, mil_bin, import_q, roles_t, threat_t, prob_q, trust_q, threat_t2]

dictionaries_niche = [god_bin, trust_bin, china_bin, econ_q, enemy_t, agree_q, goodbad_b, goodbad_b2,  
                      posneg_b, opto_b, smart_b, global_b, homo_b, relat_b, news_q, respect_b, support_b, 
                     civic_q, increase_t, nukes_t, jobs_t, likely_q, social_s, priority_q, news_b, 
                     cell_b, better_place_t, reliability, changes_t]

# ensures the missing variables are included in the datasets 

for i in dictionaries: 
    i.update(missing_vars)
    
for i in dictionaries_niche: 
    i.update(missing_vars)

In [23]:
# This function provides a matching mechanism for data labels into a numeric scale 
# This scale is constant across years (where positive responses are ranked highest)
# The original dataset is overrridden with these transformations 
def preprocess(year, dictionaries, found):
    
    for i in d[year].columns: 
        if ('qs' not in i) and ('region' not in i) and (i not in protected_vars) and (i not in keep) and (i not in keep_inc[year]) and (i not in found[year]): 
            for di in dictionaries: 
                if len(set(d[year][i]).difference(set(di.keys()))) == 0: 
                    found[year].append(i)
                    d[year][i] = d[year][i].map(di)
                    break
                       
            for di in dictionaries_niche: 
                if len(set(d[year][i]).difference(set(di.keys()))) == 0: 
                    found[year].append(i)
                    d[year][i] = d[year][i].map(di)
                    break
                       
        else: 
            found[year].append(i)
                
    return d[year], found[year]

In [24]:
# NOTE 2021 columns "reliable_us" --> "relations_us" and "climate_concern" --> "intthreat_climatechange"

In [25]:
for year in years: 
    
    print("Currently, we are preprocessing year " + str(year))
    d[year], found[year] = preprocess(year, dictionaries, found)
    print("There were " + str(len(found[year])) + " columns preprocessed.")
    print("This means that there were " + str(len(set(d[year].columns).difference(set(found[year])))) + " columns left to support: ")
    print(set(d[year].columns).difference(set(found[year])))
    print("")
    print("--------------------------------------------")

Currently, we are preprocessing year 2021
There were 64 columns preprocessed.
This means that there were 1 columns left to support: 
{'us_or_china'}

--------------------------------------------
Currently, we are preprocessing year 2020
There were 66 columns preprocessed.
This means that there were 1 columns left to support: 
{'econ_power'}

--------------------------------------------
Currently, we are preprocessing year 2019
There were 139 columns preprocessed.
This means that there were 3 columns left to support: 
{'econ_power', 'influaffairs_us', 'us_or_china'}

--------------------------------------------
Currently, we are preprocessing year 2018
There were 145 columns preprocessed.
This means that there were 1 columns left to support: 
{'econ_power'}

--------------------------------------------
Currently, we are preprocessing year 2017
There were 140 columns preprocessed.
This means that there were 1 columns left to support: 
{'econ_power'}

-------------------------------------

## Identify Commonalities 

Through individual parsing of data, it is possible that variables denoting the same question with different variable names have been listed as seperate columns in the merged dataset. We look to identify any variables of identical questions that were NOT merged accordingly and manually adjust the final spreadsheet. 

Currently, we have a secondary sheet logged within each original file which contains the actual questioned asked of respondants. Our goal is to identify similarity between these questions, we then can confirm these mappings and then transform the overlapping variables to the same name. Here is the methodology: 

- Create a mapping between the question variable name and the question itself for each variable in every year. 
- ~~Use WordMoverDistance to identify semantic similarities~~
- Provide a listing of variables with the questions listed for approval 
- Verified pairs will be listed in a dataframe 
- All verified pairs will be replaced with a common variable name 

Then we can return to merging our data together. 

*Note: This process was instead conducted through manually matching our variables of interest to save time.*

In [26]:
# definition of a map to transform variable names to a common entity

varTransformAll = {
    'children_betteroff2' : 'children_betteroff', 
    'id' : 'id_survey'
}
varTransform = {
    2021: {'d_adults' : 'd_adult_us', 
           'confid_biden' : 'confid_uspres', 
           'reliable_us' : 'us_relation', 
           'polsys_country' : 'polsys_reform', 
          }, 
    2020: {'confid_trump' : 'confid_uspres', 
           'china_tough_categorical' : 'china_tough_econ_categorical'
          }, 
    2019: {'confid_trump' : 'confid_uspres', 
           'relations_us' : 'us_relation', 
           'allies_new_1' : 'foreign_allies', 
           'threats_new_1' : 'foreign_threats', 
           'china_influ_econ' : 'china_influence', 
           'intthreat_uspower' : 'us_threat', 
           'intthreat_chpower' : 'china_threat'
          },
    2018: {'confid_trump' : 'confid_uspres', 
           'influaffairs_china' : 'china_influence',
           'intthreat_uspower' : 'us_threat', 
           'intthreat_chpower' : 'china_threat'
          },
    2017: {'confid_trump' : 'confid_uspres',
           'intthreat_uspower' : 'us_threat', 
           'intthreat_chpower' : 'china_threat'
          }
}

for year in years: 
    d[year].rename(columns=varTransformAll, inplace=True)
    d[year].rename(columns=varTransform[year], inplace=True)

In [27]:
c21 = set(d[2021].columns)
c20 = set(d[2020].columns)
c19 = set(d[2019].columns)
c18 = set(d[2018].columns)
c17 = set(d[2017].columns)
#assert len(var_set) == 15

In [28]:
c21.intersection(c20).intersection(c19).intersection(c18).intersection(c17)

{'age',
 'confid_putin',
 'confid_uspres',
 'confid_xi',
 'confid_xi_categorical',
 'd_adult_us',
 'econ_sit',
 'econ_sit_categorical',
 'education_level',
 'fav_china',
 'fav_china_categorical',
 'fav_us',
 'fav_us_categorical',
 'id_survey',
 'political_affiliation',
 'qdate_e',
 'qdate_s',
 'regional_location',
 'religion_import',
 'religion_import_categorical',
 'religious_affiliation',
 'sex',
 'wealthy'}

## Data Merge - reducing dimensionality 

- each year has a time frame added when the survey was conducted 
- data is merged on like column names 
- dimensionality reduction showcased 

In [29]:
# adding the survey year to the dataframe 
for year in years: 
    d[year]['survey_year'] = year

In [30]:
# assessing data dimensionality 
count = 0
col = 0
for year in years: 
    count = d[year].shape[0] + count
    col = d[year].shape[1] + col
    print("The dimensionality of this year is " + str(count) + " by " + str(col))

The dimensionality of this year is 16254 by 64
The dimensionality of this year is 30530 by 127
The dimensionality of this year is 68956 by 261
The dimensionality of this year is 99065 by 401
The dimensionality of this year is 141018 by 537


In [31]:
# data merge 

df = pd.DataFrame()
for year in years: 
    df = df.append(d[year].reset_index())
    
df = df.reset_index()
print("The dimensionality of the combined data is " + str(df.shape[0]) + " by " + str(df.shape[1]))
print("In total, we have captured " + str(round(count * 100 / df.shape[0], 2)) + "% of the data after the merge.")
print("There has been a " + str(round((col - df.shape[1]) * 100 / df.shape[1], 2)) + "% decrease in column through overlapping.")

The dimensionality of the combined data is 141018 by 320
In total, we have captured 100.0% of the data after the merge.
There has been a 67.81% decrease in column through overlapping.


In [32]:
print("Across all of our years of data " + str(df.dropna(axis=1).shape[1]) + " columns are present across the dataset. ")

Across all of our years of data 21 columns are present across the dataset. 


## Tranformations

For consistency, conduct your final transformations prior to export. This includes mapping all religious affiliations to categories. 

Note: consider bridging over political affiliations as well. Requires lots of research on political groups. 

In [33]:
# transformation for all religious beliefs. 
# If the label CONTAINS the key, it should be grouped into the value categories. 
relig_transform  = {
    r"(.*)Christian(.*)" : "Christian", 
    r"(.*)Unitarian(.*)" : "Christian", 
    r"(.*)Agnostic(.*)" : "Agnostic", 
    r"(.*)African(.*)" : "traditional African religion", 
    r"(.*)Atheist(.*)" : "Atheist", 
    r"(.*)Baha(.*)" : "Bahai", 
    r"(.*)Buddhis(.*)" : "Buddhist",
    r"(.*)Buddist(.*)" : "Buddhist", 
    r"(.*)Catholic(.*)" : "Catholic", 
    r"(.*)Confucianism(.*)" : "Confucianism", 
    r"(.*)Congregationalist(.*)" : "Protestant", 
    r"(.*)Druze(.*)" : "Druze", 
    r"(.*)Evangelical(.*)" : "Protestant", 
    r"(.*)Hindu(.*)" : "Hindu", 
    r"(.*)Iglesia ni Cristo(.*)" : "Christian", 
    r"(.*)Indigenous religion(.*)" : "Indigenous religion", 
    r"(.*)Jain(.*)" : "Jain", 
    r"(.*)Jehova(.*)" : "Restorationist Christian",  
    r"(.*)Jew(.*)" : "Jewish", 
    r"(.*)Lutheran(.*)" : "Protestant", 
    r"(.*)Mormon(.*)" : "Restorationist Christian", 
    r"(.*)Muslim(.*)" : "Muslim", 
    r"(.*)No(.*)" : None, 
    r"(.*)Orthodox(.*)" : "Catholic", 
    r"(.*)Pentecostal(.*)" : "Protestant", 
    r"(.*)Presbyterian(.*)" : "Protestant", 
    r"(.*)Protestant(.*)" : "Protestant", 
    r"(.*)Sikh(.*)" : "Sikh", 
    r"(.*)Something else(.*)" : "religious",
    r"(.*)Spiritist(.*)" : "Spiritist", 
    r"(.*)Refused(.*)" : None,
    r"(.*)Afrobrazilian religion(.*)" : "Afrobrazilian religion", 
    r"(.*)Unification(.*)" : "Christian", 
    r"(.*)Unitarian(.*)" : "Christian", 
    r"(.*)Yes(.*)" : "religious", 
}

df['religious_affiliation'] = df['religious_affiliation'].replace(regex=relig_transform)

### Consolidating needed variables 

We have now cleaned the whole dataset. Let's now make it applicable to China. 

In [34]:
df.drop(columns=['level_0', 'index'], inplace=True)
df['id'] = np.arange(1, len(df) + 1)

In [35]:
needed_vars = ['survey_year', 'country', 'id', 'id_survey', 'regional_location', 'qdate_s', 'sex', 'male', 'age', 'd_hhpeople', 
               'd_adult_us', 'wealthy', 'political_affiliation', 'religious_affiliation', 'education_level', 
               'us_econ_power', 'china_econ_power', 'japan_econ_power', 'eu_econ_power', 
               'other_econ_power', 'prefer_us_econ', 'prefer_china_econ', 'children_betteroff', 
               'confid_uspres_categorical', 'confid_uspres', 'us_threat', 'china_threat'] 

categorical_vars.remove('confid_biden')
categorical_vars.remove('confid_trump')
add_categorical = [x + '_categorical' for x in categorical_vars]
categorical_vars.remove('intthreat_uspower')
categorical_vars.remove('intthreat_chpower')

only_categorical_vars.remove('china_tough')
add_only_categorical = [x + '_categorical' for x in only_categorical_vars]
categorical_vars.remove('children_betteroff2')
categorical_vars.remove('econ_power')
categorical_vars.remove('us_or_china')

needed_vars.extend(categorical_vars)
needed_vars.extend(add_categorical)
needed_vars.extend(add_only_categorical)

In [36]:
# clean d_adult_us

adult_us = {
    "Don't know" : 88,
    "Don’t know" : 88,
    "8 or more" : 8, 
    "9" : 8, 
    "11" : 8, 
    "10" : 8, 
    "15" : 8,
    "12" : 8, 
    "2.65415019762846" : 3, 
    "2.80942828485456" : 3,
    "4.06118355065196" : 4,
    "Refused" : 88, 
    "97+" : 100
}

df['d_adult_us'] = [int(x) for x in df['d_adult_us'].replace(adult_us)]
df['d_hhpeople'] = [int(x) if pd.notna(x) else x for x in df['d_hhpeople'].replace(adult_us)]

In [37]:
# find the country variable for 2021 data to be gathered before geocoding

country_map = {
    "singapore" : "Singapore", 
    "greece" : "Greece", 
    "sweden" : "Sweden", 
    "japan" : "Japan", 
    "germany_2017" : "Germany",
    "canada_2017" : "Canada", 
    "skorea_2017" : "South Korea", 
    "italy_2017" : "Italy", 
    "uk_2017" : "United Kingdom", 
    "spain" : "Spain", 
    "taiwan" : "Taiwan", 
    "australia_2017" : "Australia", 
    "newzealand" : "New Zealand", 
    "belgium" : "Belgium",
    "france" : "France" , 
    "netherlands" : "Netherlands", 
    "germany_vocational" : "Germany", 
    "australia_2017a" : "Australia"
}

s = dOriginal[2021].filter(regex=("(d_educ_.*)"))
s.columns= s.columns.str.split("d_educ_", expand=True)

s = (s.replace("Don't know", np.nan))

s = s.stack().reset_index().drop_duplicates(subset='level_0')

dict_country = dict(zip(s['level_0'], s['level_1']))

df['temp_country'] = df['id'].map(dict_country).replace(country_map)

df['country'] = [x if pd.isna(y) else y for x, y in zip(df['country'], df['temp_country'])]

# Note: still need to geocode countries and regions in seperate notebook 

In [38]:
# transform sex to be able to be used in modeling 

males = {
    'Male' : 1, 
    'Female' : 0
}

df['male'] = df['sex'].map(males)

In [39]:
# transform d_hhpeople for 2021 due to missing data

# calculating the average difference between entire households and just adults. 
t = df.loc[(df['survey_year'] < 2021) & (df['d_hhpeople'] <15) & (df['d_adult_us'] <15)]
t['diff'] = (t['d_hhpeople'] - t['d_adult_us'])

# from the results, we can infer how many children are in household 
print(t[t['diff'] > 0]['diff'].mean())

# then we add this delta (2) to the d_hhpeople variable 
# any grouping larger than 8 will be categorized as 8 and understood as '8+'
df['d_hhpeople'] = [int(x + 2) if (pd.isna(y)) & (x != 88) else (x if (x == 88) else int(y)) for x, y in zip(df['d_adult_us'], df['d_hhpeople'])]
df['d_hhpeople'] = [x if (x < 9) | (x == 88) else 8 for x in df['d_hhpeople']]

2.102923335811842


In [40]:
# mapping for econ power categorical variable 

econ_power_mapping = {
    "(VOL) None / There is no leading economic power": "There is no leading economic power", 
    "None / There is no leading economic power (DO NOT READ)" : "There is no leading economic power",
    'None/There is no leading economic power (DO NOT READ)' : "There is no leading economic power",
    "(VOL) Other": "There is an economic power (not China, US, Japan, or EU)", 
    "Other (DO NOT READ)" : "There is an economic power (not China, US, Japan, or EU)", 
    "Other (DO NOT  READ)" : "There is an economic power (not China, US, Japan, or EU)", 
    'Refused' : 'Refused to answer', 
    '(VOL)\xa0None / There is no leading economic power' : "There is no leading economic power"
}

df['econ_power_categorical'] = df['econ_power_categorical'].replace(econ_power_mapping)

In [41]:
df['us_or_china_categorical'] = [x.replace(r" (DO NOT READ)", "") if pd.notna(x) else x for x in df['us_or_china_categorical']]

us_or_china_map = {
    'Economic ties to both countries\xa0are equally important' : 'Economic ties to both countries are equally important', 
    'Neither' : 'Economic ties to neither country is important'
}

df['us_or_china_categorical'] = df['us_or_china_categorical'].replace(us_or_china_map)

In [42]:
# mergeing categorical results for US presidents 

df['confid_uspres_categorical'] = [x if pd.notna(x) else y for x, y in zip(df['confid_biden_categorical'], df['confid_trump_categorical'])]

In [43]:
# removing unneeded labels in categorical variables 

df['children_betteroff2_categorical'] = [x.replace(" (DO NOT READ)", "") if pd.notna(x) else x for x in df['children_betteroff2_categorical']]

In [44]:
# restrict to the supportive variables 
df = df[needed_vars]

In [45]:
df.rename(columns={'threats_new_1_categorical':'foreign_threats_categorical', 'allies_new_1_categorical':'foreign_allies_categorical', 'close_relationship_categorical': 'russia_relationship_categorical'}, inplace=True)

In [46]:
# make sure all the questions left as categorical responses are ready to be mapped over 

# read in the dictionary of variables to provide the mapping 
var_list = pd.read_excel("pewQVDict.xlsx", sheet_name='Final Variable Listing')

# build the dictionary 

mini_vars = var_list[pd.notna(var_list['categorical'])]
keys = [x + '_categorical' for x in mini_vars['question']]

dictionary = dict(zip(keys, mini_vars['variable_name']))

# map over the variables to the new names 
df = df.rename(columns=dictionary)

# remove labels that contain the phrase '(DO NOT READ)'
df = df.replace('\(DO NOT READ\)', '', regex=True)

In [47]:
# do a couple of last mappings to clean up the categorical responses
df['china_econ_military: Does China\'s economic or military strength concern you more?'] = df['china_econ_military: Does China\'s economic or military strength concern you more?'].replace({'Its economic strength [OR]' : 'Its economic strength'})

In [48]:
df.columns


Index(['survey_year', 'country', 'id', 'id_survey', 'regional_location',
       'qdate_s', 'sex', 'male', 'age', 'd_hhpeople', 'd_adult_us', 'wealthy',
       'political_affiliation', 'religious_affiliation', 'education_level',
       'us_econ_power', 'china_econ_power', 'japan_econ_power',
       'eu_econ_power', 'other_econ_power', 'prefer_us_econ',
       'prefer_china_econ', 'children_betteroff',
       'confid_uspres: How much confidence you have in the U.S. President to do the right thing regarding world affairs? ',
       'confid_uspres', 'us_threat', 'china_threat', 'fav_us', 'fav_china',
       'fav_russia', 'country_satis', 'satisfied_democracy', 'confid_xi',
       'confid_putin', 'econ_sit', 'religion_import', 'respect_china',
       'respect_us', 'improve_econ', 'intthreat_econcondition',
       'us_relationship', 'china_econ', 'china_military',
       'children_betteroff: When children today in (survey country) grow up, do you think they will be better off or worse off fi

In [49]:
# at this point in time, only geocoded variables should be left
set(var_list['variable_name']).difference(set(df.columns))

{'country_id', 'gl2_id'}

In [51]:
var_list = pd.read_excel("pewQVDict.xlsx", sheet_name='Final Variable Listing')
len(var_list['variable_name'])

102

In [52]:
df.shape

(141018, 100)

## Export Data 

Export data to begin geocoding. 

In [54]:
df.to_csv("pew_processed.csv", index=True)

## Appendix 

Throughout the preprocessing of this data, several areas of expansion were identified. These include: 
- __parsing of political affililation__ - data currently includes favorability to 'mainstream' parties and party affiliation. Transformation to leaning across countries could be valuable. Currently, all political identification moved toward one generic variable "political_party". 
- __updating income level__ - there is a variable 'd_income2_' that supposedly categorizes wealth. This label is inaccurate and doesn't log all of respondants wealth, regardless of a variable 'd_income_' that has this granularity. 