## Geocoding Pew Public Opinion Data 

This file allows for the geocoding of the Pew Public Opinion Data (as it stands on June 6th with data from 2017-2021). 

Geocoding will occur at the country and administrative boundary 1 level. 

*Note: we have been forced to reconsider our geocoding scheme with the adoption of the Pew Public Opinion Data* 

In [None]:
import pandas as pd
import geopandas as gpd
import json

import requests

import numpy as np
from fuzzywuzzy import fuzz

In [2]:
def load_dict(path): 
    file = open(path, "r")
    contents = file.read()
    dictionary = json.loads(contents)
    file.close()
    return dictionary

## Grab Data

In [3]:
df = pd.read_csv("pew_processed.csv", index_col=0)
countries = pd.read_csv("../../data_final/countries.csv", dtype={'country_id':str}, index_col=0)

  df = pd.read_csv("pew_processed.csv", index_col=0)


In [4]:
# map the names to a common value for searching for the geometry 
# grabs it from txt file for consistency of country naming conventions

recipient_mapping = load_dict("../country_config.txt")
df['country'] = df['country'].replace(recipient_mapping)

## Geocode Country Level

In [5]:
# checking that all conventions of country naming is consistent. 
set(df['country']).difference(set(countries['country']))

# an empty set means all countries are accounted for

{nan}

In [6]:
countries_map = dict(zip(countries['country'], countries['country_id']))
df['country_id'] = df['country'].map(countries_map)

## Geocode at the Regional Level 

### Utilize GeoBoundaries to grab up to date information on administrative boundaries

Administrative boundaries courtesy of <a href= 'https://www.geoboundaries.org'>geoBoundaries</a>

In [7]:
# the following is an example of how to utilize the geoBoundaries API
j = requests.get("https://www.geoboundaries.org/api/current/gbOpen/RUS/ADM1")
path = j.json()['gjDownloadURL']
shape = requests.get(path).json()

In [8]:
rus = pd.json_normalize(
    shape, 
    record_path = ['features'])

### Build a common variable name to use for matching with ADM1 versus NUTS2 Boundaries 

Countries using ADM1 should be marked with adm_region.    
Countries using NUTS2 should be marked with nuts_region. 

Both will be run through independant sources to grab shapefiles. Regional codes will exported as a list of ADM1 boundaries which the data is attributed to.     
*Note: This looks to increase granularity while ensuring standardization across datasets*

In [9]:
# have a temporary dataset of locations
region_temps = pd.DataFrame(columns = ['adm_temp', 'nuts_temp', 'temp_temp'])

In [10]:
# for unique countries, things have to be classified with unique codes
# build a filter. Add the filtered data to temp_temp. 
# Combine the information into the relevant 'adm_temp' or 'nuts_temp'. 
# once everything is combined. Transfer back to DF and check length and completeness. 

In [11]:
# 2021 and 2020 
# all based off of ADM codes
filter_1 = (df['survey_year'] == 2021) | (df['survey_year'] == 2020)
region_temps['adm_temp'] = df['region'].where(filter_1)

In [12]:
# 2019 
filter_1 = (df['survey_year'] == 2019)

region_temps['temp_temp'] = [x if pd.notna(x) else y for x, y in zip(df['region'].where(filter_1), df['stratum'].where(filter_1))]
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# override with a qs5 value for Isreal
filter_2 = (df['survey_year'] == 2019) & (df['country'] == 'Isreal')
region_temps['temp_temp'] = df['qs5'].where(filter_2)
region_temps['adm_temp'] = region_temps['temp_temp'].combine_first(region_temps['adm_temp'])

In [13]:
# 2018 

# EU Nuts coded
filter_1 = ((df['survey_year'] == 2018) & ((df['country'] == 'Greece') | 
                (df['country'] == 'Italy') | (df['country'] == 'Hungary') | (df['country'] == 'Poland')))
region_temps['temp_temp'] = df['qs5'].where(filter_1)
region_temps['nuts_temp'] = region_temps['nuts_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from stratum
filter_2 = ((df['survey_year'] == 2018) & ((df['country'] == 'Russia') | (df['country'] == 'India') | 
            (df['country'] == 'Philippines') | (df['country'] == 'Tunisia') | (df['country'] == 'Kenya') | 
            (df['country'] == 'South Africa') | (df['country'] == 'Argentina') | (df['country'] == 'Brazil') |
            (df['country'] == 'Mexico') | (df['country'] == 'Nigeria')))
region_temps['temp_temp'] = df['stratum'].where(filter_2)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from qs5
filter_3 = ((df['survey_year'] == 2018) & ((df['country'] == 'Australia') | (df['country'] == 'Indonesia') | 
           (df['country'] == 'Japan') | (df['country'] == 'Israel')))
region_temps['temp_temp'] = df['qs5'].where(filter_3)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from qs11
filter_4 = ((df['survey_year'] == 2018) & ((df['country'] == 'Canada')))
region_temps['temp_temp'] = df['qs11'].where(filter_4)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# NUTS coded from qs11
filter_5 = ((df['survey_year'] == 2018) & ((df['country'] == 'France') | (df['country'] == 'Germany') 
                                          | (df['country'] == 'Netherlands') | (df['country'] == 'Spain') 
                                          | (df['country'] == 'Sweden') | (df['country'] == 'United Kingdom')))
region_temps['temp_temp'] = df['qs11'].where(filter_5)
region_temps['nuts_temp'] = region_temps['nuts_temp'].combine_first(region_temps['temp_temp'])



# TODO: missing values currently from westernized areas... look into                       

In [14]:
# 2017 

# ADM coded in qs12
filter_1 = ((df['survey_year'] == 2017) & ((df['country'] == 'Canada') | (df['country'] == 'Germany'))) 
region_temps['temp_temp'] = df['qs12'].where(filter_1)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded in qs5
filter_2 = ((df['survey_year'] == 2017) & ((df['country'] == 'Australia') | (df['country'] == 'Israel')
           | (df['country'] == 'Venezuela'))) 
region_temps['temp_temp'] = df['qs5'].where(filter_2)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from stratum
filter_3 = ((df['survey_year'] == 2017) & ((df['country'] == 'Mexico') | (df['country'] == 'Russia') 
            | (df['country'] == 'India') | (df['country'] == 'Indonesia') | (df['country'] == 'Philippines')
            | (df['country'] == 'Argentina') | (df['country'] == 'Brazil') | (df['country'] == 'Chile')
            | (df['country'] == 'Colombia') | (df['country'] == 'Ghana') | (df['country'] == 'Kenya') 
            | (df['country'] == 'Nigeria')  | (df['country'] == 'South Africa') | (df['country'] == 'Senegal')
            | (df['country'] == 'Tunisia') | (df['country'] == 'Vietnam') | (df['country'] == 'Jordan') 
            | (df['country'] == 'Lebanon') | (df['country'] == 'Tanzania') | (df['country'] == 'Peru')))
region_temps['temp_temp'] = df['stratum'].where(filter_3)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# EU Nuts coded
filter_4 = ((df['survey_year'] == 2017) & ((df['country'] == 'Greece') | (df['country'] == 'Italy') | 
           (df['country'] == 'Hungary') | (df['country'] == 'Poland') | (df['country'] == 'Turkey'))) 
region_temps['temp_temp'] = df['stratum'].where(filter_4)
region_temps['nuts_temp'] = region_temps['nuts_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from qs11
filter_5 = ((df['survey_year'] == 2017) & ((df['country'] == 'France') | (df['country'] == 'Netherlands') 
            | (df['country'] == 'South Korea') | (df['country'] == 'Spain') | (df['country'] == 'Sweden')
            | (df['country'] == 'United Kingdom')))
region_temps['temp_temp'] = df['qs11'].where(filter_5)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])



# missing values from US/South Korea/UK/France/Netherlands/Spain/Sweden/Japan

In [15]:
# Override with US States for 2020-2017

filter_1 = (df['country'] == 'United States')
region_temps['temp_temp'] = df['state_us'].where(filter_1)
region_temps['adm_temp'] = region_temps['temp_temp'].combine_first(region_temps['adm_temp'])

In [16]:
print("You still have " + str(region_temps.shape[0] - region_temps['adm_temp'].combine_first(region_temps['nuts_temp']).value_counts().sum()) + " variables missing")

You still have 2984 variables missing


In [17]:
# put all adm_temp and nuts_temp values in the database for replacing 
df['adm_region'] = [[x] if pd.notna(x) else [] for x in region_temps['adm_temp']]
df['nuts_region'] = [[x] if pd.notna(x) else [] for x in region_temps['nuts_temp']]

df['regional_location_original'] = [x if pd.notna(x) else y for x, y in zip(region_temps['adm_temp'], region_temps['nuts_temp'])]
df['regional_location_original'] = [x if pd.notna(x) else np.nan for x in df['regional_location_original']].copy()

# Fixing Cases for Groups 
df['adm_region'] = [[x[0].title()] if len(x) > 0 else x for x in df['adm_region']]

In [18]:
# run to see missing variables
# use this query to explore a particular country/year further 
# df[(df['survey_year'] == 2017) & (df['country'] == 'Japan')][['region', 'stratum', 'psu', 'qs5', 'qs6', 'qs8', 'qs11', 'qs12']]

# use this query to identify missing variables 
df[pd.isna(df['regional_location_original'])].groupby(['country', 'survey_year'])[['region', 'stratum', 'psu', 'qs5', 'qs6', 'qs8', 'qs11', 'qs12']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,region,stratum,psu,qs5,qs6,qs8,qs11,qs12
country,survey_year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
France,2018,0,0,0,0,0,0,0,0
France,2019,0,0,0,0,0,26,0,0
France,2020,0,0,0,0,0,0,0,0
France,2021,0,0,0,0,0,0,0,0
Germany,2018,0,0,0,0,0,0,0,0
Germany,2019,0,0,0,0,0,226,0,0
Germany,2020,0,0,0,0,0,0,0,0
Germany,2021,0,0,0,0,0,0,0,0
Italy,2021,0,0,0,0,0,0,0,0
Japan,2017,0,1009,0,507,502,0,0,0


In [19]:
# for ease moving forward. Any country without a value for adm/nuts will be filled with ['None']

df['adm_region'] = [x if len(x) > 0 else ['None'] for x in df['adm_region']]
df['nuts_region'] = [x if len(x) > 0 else ['None'] for x in df['nuts_region']]

### Geocoding Methodology

We need to update Pew's data of regional locations to an ADM1 bound. The following method has been identified to create that pairing. 

- Utilize a countries ISO3 id to grab the json file of the ADM1 boundaries for each country. 
- Use fuzzy matching to identify the ADM1 boundaries that can automatically be matched with the correct shapefile. (Utilize a custom matching percent to get the best result) 
- Identify which regional locations were below your accuracy threshold, and create dictionaries to support the geocoding of those particular locations. *Note: more recent data is typically more accurate, know that they regional locations have changed across years and there will need to be some parsing for consistency*



In [20]:
# example to identify the changes in regional location over time within pew
# know that there is some ambiguity in ADM 1 to be grabbing
# df.groupby('survey_year')['regional_location'].value_counts()

In [21]:
df.head()

Unnamed: 0,survey_year,country,id,id_survey,region,stratum,psu,qs5,qs6,qs8,...,china_tough_econ: Is it more important to be tough on China's territorial disputes or build a strong economic relationship with them?,russia_threat: Is Russia's power and influence a threat to your country?,china_influence: How much influence does the China exert on other countries?,us_influence: How much influence does the U.S. exert on other countries?,polsys_reform: Do you think that the political system of (survey country) needs to be reformed?,econ_ties_usch: How do you view current economic ties between the U.S. and China?,country_id,adm_region,nuts_region,regional_location_original
0,2021,Canada,1,8300005,Rest of British Columbia,,,,,,...,,,,,It needs minor changes,,14,[Rest Of British Columbia],[None],Rest of British Columbia
1,2021,Canada,2,8300007,Rest of British Columbia,,,,,,...,,,,,It needs major changes,,14,[Rest Of British Columbia],[None],Rest of British Columbia
2,2021,Canada,3,8300054,Rest of British Columbia,,,,,,...,,,,,It needs major changes,,14,[Rest Of British Columbia],[None],Rest of British Columbia
3,2021,Canada,4,8300133,Rest of Quebec,,,,,,...,,,,,It needs minor changes,,14,[Rest Of Quebec],[None],Rest of Quebec
4,2021,Canada,5,8300240,Montreal,,,,,,...,,,,,It needs minor changes,,14,[Montreal],[None],Montreal


__Update Country Listings__

*testing automation* 

In [22]:
# get a list of regions 
regions = df['adm_region']

mapping = {
    'Montreal' : ['Quebec'],
    'Vancouver' : ['British Columbia'], 
    'Toronto' : ['Ontario'], 
    'Bc' : ['British Columbia'], 
    'Qc' : ['Quebec'],
    'On' : ['Ontario'],
    'Mb' : ['Manitoba'], 
    'Ab' : ['Alberta'], 
    'Pei' : ['Prince Edward Island'], 
    'Sk' : ['Saskatchewan'], 
    'Atlantic Provinces' : ['New Brunswick', 'Nova Scotia', 'Prince Edward Island']
}

regionsMap = [[mapping[innerX] if (innerX in mapping.keys()) else [innerX] for innerX in x] for x in regions]

len(regions) == len(regionsMap)

regionsSmall = [ele if isinstance(ele, list) else [ele] for sublist in regionsMap for ele in sublist]

len(regionsSmall) == len(regions)

True

__Beginning Mapping Techniques__ 

In [23]:
def update_country(dff, name, mapping, loc_name): 
    regionsMap = [[mapping[innerX] if (innerX in mapping.keys()) else [innerX] for innerX in x] for x in dff[loc_name]]
    # unwinding so that the elements are not double listed 
    regionsReduced = [ele if isinstance(ele, list) else [ele] for sublist in regionsMap for ele in sublist]
    
    # checking to ensure dimensionality is correct
    len(regionsReduced) == len(df['adm_region'])
    
    return regionsReduced

In [26]:
canada = {
    'Montreal' : ['Quebec'],
    'Vancouver' : ['British Columbia'], 
    'Toronto' : ['Ontario'], 
    'Bc' : ['British Columbia'], 
    'Qc' : ['Quebec'],
    'On' : ['Ontario'],
    'Mb' : ['Manitoba'], 
    'Ab' : ['Alberta'], 
    'Pei' : ['Prince Edward Island'], 
    'Sk' : ['Saskatchewan'], 
    'Atlantic Provinces' : ['New Brunswick', 'Nova Scotia', 'Prince Edward Island']
}

df['adm_region'] = update_country(df, 'Canada', canada, 'adm_region')

In [24]:
# PERU 
peru = {
    'Junin' : ['Junín'],
}

df['adm_region'] = update_country(df, 'Peru', peru, 'adm_region')

In [25]:
## END WORK READY TO GEOCODE WITH MORE COUNTRIES 

0         [Rest Of British Columbia]
1         [Rest Of British Columbia]
2         [Rest Of British Columbia]
3                   [Rest Of Quebec]
4                         [Montreal]
                     ...            
141013       [Lima Provincias Rural]
141014       [Lima Provincias Rural]
141015       [Lima Provincias Rural]
141016       [Lima Provincias Rural]
141017       [Lima Provincias Rural]
Name: adm_region, Length: 141018, dtype: object

In [None]:
# MEXICO 

# Electoral regions are groupings of ADM1 
# TODO: Map or create unique regions 

In [None]:
# CHILE 

In [None]:
# ARGENTINA 

# argen = {
#     'amba' : 'Autonomous City of Buenos Aires', 
# }

In [None]:
def load_boundary(iso3): 
    
    j = requests.get("https://www.geoboundaries.org/api/current/gbOpen/" + iso3 + "/ADM1")
    try: 
        path = j.json()['gjDownloadURL']
        shape = requests.get(path).json()

        country_df = pd.json_normalize(
                            shape, 
                            record_path = ['features'])

        return country_df
    
    except: 
        print("The country " + iso3 + " is not available within geoboundaries.")
        raise Exception

In [None]:
# original find_adm for flat string conversion... moving over to list conversion
# def find_adm(adm_df, entity_df, sim_threshold): 

#     adm_dict = dict(zip(range(0, len(adm_df)), adm_df['properties.shapeName']))
#     entity_df['fuzzy_matching'] = [[fuzz.partial_ratio(x, y) for x in adm_df['properties.shapeName']] for y in entity_df['adm_region']]
#     entity_df['adm1'] = [adm_dict[x.index(max(x))] if max(x) > sim_threshold else "Not found" for x in entity_df['fuzzy_matching']]
    
#     print("There are " + str(len(entity_df[entity_df['adm1'] == 'Not found'])) + " instances not identified")
#     print("Here are the missing elements: ")
#     print(set(entity_df[entity_df['adm1'] == 'Not found']['adm_region']))
    
    
#     return dict(zip(entity_df['adm_region'], adm_df['properties.shapeName']))

def find_adm(adm_df, entity_df, sim_threshold): 

    adm_dict = dict(zip(range(0, len(adm_df)), adm_df['properties.shapeName']))
    entity_df['fuzzy_matching'] = [[fuzz.partial_ratio(x, y) for x in adm_df['properties.shapeName']] for y in entity_df['adm_region']]
    entity_df['adm1'] = [adm_dict[x.index(max(x))] if max(x) > sim_threshold else "Not found" for x in entity_df['fuzzy_matching']]
    
    print("There are " + str(len(entity_df[entity_df['adm1'] == 'Not found'])) + " instances not identified")
    print("Here are the missing elements: ")
    print(set(entity_df[entity_df['adm1'] == 'Not found']['adm_region']))
    
    
    return dict(zip(entity_df['adm_region'], adm_df['properties.shapeName']))

In [None]:
set([item for elem in df[df['country'] == countries['country'][13]]['adm_region'] for item in elem])

In [None]:
for country in range(0, len(countries[0:20])): 
    
    # find all the unique values of regional locations for the country of interest 
    all_loc = set([item for elem in df[df['country'] == countries['country'][country]]['adm_region'] for item in elem])
    all_loc.discard(np.nan)
    unique_loc = pd.DataFrame(all_loc, columns = ['adm_region'])
    if len(unique_loc) == 0: 
        # print("The country " + countries['country'][country] + " is not in Pew dataset.")
        continue 
        
    # grab the geoBoundary for the country of interest 
    try: 
        shape = load_boundary(countries['iso3'][country])
    except: 
        print("\n" + countries['country'][country] + ":")
        print("ERROR")
        continue

    # get the ADM name for all of the regional locations
    # ignore 'Refused' and 'DK' values. They will only have location at the country level coded. 

    # encode == True will go ahead and add the Adm1_id code to the dataframe. Its reccomended when testing the dataset
    # you leave this as False
    
    print("\n" + countries['country'][country] + ":")
    dict_country = find_adm(shape, unique_loc, 80)
    
    print('--------------')
    print('encoding adm1_id codes')
    # df['adm1'] = df['adm_region'].replace(dict_country) | outdated for single value
    df['adm1'] = [[dict_country[x] if (x in dict_country.keys()) else x for x in innerList] for innerList in df['adm_region']]

In [None]:
l = [['lel'], ['lel', 'asdfasd'], ['adfasdf']]

diction = {'lel' : 'test', 'asdfasd' : 'test'}
[[diction[x] if (x in diction.keys()) else x for x in ll] for ll in l]

### Debugging Center 

Give it a name of a country to see:
- the listing of years and common regional locations being displayed
- the values of the ADM boundaries it is trying to match with

In [None]:
testVar='Canada'

df2 = df[df['country'] == testVar]
df2.groupby('survey_year')['adm_region'].value_counts()

In [None]:
test = load_boundary(countries.at[countries.loc[countries['country']==testVar].index[0], 'iso3'])
test

## Clean and Prepare Data for Export

In [None]:
df.rename(columns={'qdate_s' : 'qdate'}, inplace=True)

In [None]:
# ensure all the variables are as expected and that we have all of the variables we would like to be collecting
vars_list = pd.read_excel("pewQVDict.xlsx", sheet_name="Final Variable Listing")

assert len(set(vars_list['variable_name']).difference(set(df.columns))) == 0

## Export Geocoded Dataframe

In [None]:
df.to_csv("../../data_final/pew.csv", index=False)

In [None]:
nig = df[(df['country'] == 'Nigeria') & ((df['survey_year'] == 2019) | (df['survey_year'] == 2018) | (df['survey_year'] == 2017))].describe()

In [None]:
nig.transpose()['count']