## Geocoding Pew Public Opinion Data 

This file allows for the geocoding of the Pew Public Opinion Data (as it stands on June 6th with data from 2017-2021). 

Geocoding will occur at the country and administrative boundary 1 level. 

*Note: we have been forced to reconsider our geocoding scheme with the adoption of the Pew Public Opinion Data* 

In [56]:
import pandas as pd
import geopandas as gpd
import json
import requests

import numpy as np
from fuzzywuzzy import fuzz

In [57]:
def load_dict(path): 
    file = open(path, "r")
    contents = file.read()
    dictionary = json.loads(contents)
    file.close()
    return dictionary

## Grab Data

In [58]:
df = pd.read_csv("pew_processed.csv", index_col=0)
countries = pd.read_csv("../../data_final/countries.csv", dtype={'country_id':str}, index_col=0)

  df = pd.read_csv("pew_processed.csv", index_col=0)


In [59]:
# map the names to a common value for searching for the geometry 
# grabs it from txt file for consistency of country naming conventions

recipient_mapping = load_dict("../country_config.txt")
df['country'] = df['country'].replace(recipient_mapping)

## Geocode Country Level

In [60]:
# checking that all conventions of country naming is consistent. 
set(df['country']).difference(set(countries['country']))

# an empty set means all countries are accounted for

{nan}

In [61]:
countries_map = dict(zip(countries['country'], countries['country_id']))
df['country_id'] = df['country'].map(countries_map)

## Geocode at the Regional Level 

### Utilize GeoBoundaries to grab up to date information on administrative boundaries

Administrative boundaries courtesy of <a href= 'https://www.geoboundaries.org'>geoBoundaries</a>

In [62]:
# the following is an example of how to utilize the geoBoundaries API
j = requests.get("https://www.geoboundaries.org/api/current/gbOpen/RUS/ADM1")
path = j.json()['gjDownloadURL']
shape = requests.get(path).json()

In [63]:
rus = pd.json_normalize(
    shape, 
    record_path = ['features'])

### Build a common variable name to use for matching with ADM1 versus NUTS2 Boundaries 

Countries using ADM1 should be marked with adm_region.    
Countries using NUTS2 should be marked with nuts_region. 

Both will be run through independant sources to grab shapefiles. Regional codes will exported as a list of ADM1 boundaries which the data is attributed to.     
*Note: This looks to increase granularity while ensuring standardization across datasets*

In [64]:
# have a temporary dataset of locations
region_temps = pd.DataFrame(columns = ['adm_temp', 'nuts_temp', 'temp_temp'])

In [65]:
# for unique countries, things have to be classified with unique codes
# build a filter. Add the filtered data to temp_temp. 
# Combine the information into the relevant 'adm_temp' or 'nuts_temp'. 
# once everything is combined. Transfer back to DF and check length and completeness. 

In [66]:
# 2021 and 2020 
# all based off of ADM codes
filter_1 = (df['survey_year'] == 2021) | (df['survey_year'] == 2020)
region_temps['adm_temp'] = df['region'].where(filter_1)

In [67]:
# 2019 
filter_1 = (df['survey_year'] == 2019)

region_temps['temp_temp'] = [x if pd.notna(x) else y for x, y in zip(df['region'].where(filter_1), df['stratum'].where(filter_1))]
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# override with a qs5 value for Isreal
filter_2 = (df['survey_year'] == 2019) & (df['country'] == 'Isreal')
region_temps['temp_temp'] = df['qs5'].where(filter_2)
region_temps['adm_temp'] = region_temps['temp_temp'].combine_first(region_temps['adm_temp'])

In [68]:
# 2018 

# EU Nuts coded
filter_1 = ((df['survey_year'] == 2018) & ((df['country'] == 'Greece') | 
                (df['country'] == 'Italy') | (df['country'] == 'Hungary') | (df['country'] == 'Poland')))
region_temps['temp_temp'] = df['qs5'].where(filter_1)
region_temps['nuts_temp'] = region_temps['nuts_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from stratum
filter_2 = ((df['survey_year'] == 2018) & ((df['country'] == 'Russia') | (df['country'] == 'India') | 
            (df['country'] == 'Philippines') | (df['country'] == 'Tunisia') | (df['country'] == 'Kenya') | 
            (df['country'] == 'South Africa') | (df['country'] == 'Argentina') | (df['country'] == 'Brazil') |
            (df['country'] == 'Mexico') | (df['country'] == 'Nigeria')))
region_temps['temp_temp'] = df['stratum'].where(filter_2)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from qs5
filter_3 = ((df['survey_year'] == 2018) & ((df['country'] == 'Australia') | (df['country'] == 'Indonesia') | 
           (df['country'] == 'Japan') | (df['country'] == 'Israel')))
region_temps['temp_temp'] = df['qs5'].where(filter_3)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from qs11
filter_4 = ((df['survey_year'] == 2018) & ((df['country'] == 'Canada')))
region_temps['temp_temp'] = df['qs11'].where(filter_4)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# NUTS coded from qs11
filter_5 = ((df['survey_year'] == 2018) & ((df['country'] == 'France') | (df['country'] == 'Germany') 
                                          | (df['country'] == 'Netherlands') | (df['country'] == 'Spain') 
                                          | (df['country'] == 'Sweden') | (df['country'] == 'United Kingdom')))
region_temps['temp_temp'] = df['qs11'].where(filter_5)
region_temps['nuts_temp'] = region_temps['nuts_temp'].combine_first(region_temps['temp_temp'])



# TODO: missing values currently from westernized areas... look into                       

In [69]:
# 2017 

# ADM coded in qs12
filter_1 = ((df['survey_year'] == 2017) & ((df['country'] == 'Canada') | (df['country'] == 'Germany'))) 
region_temps['temp_temp'] = df['qs12'].where(filter_1)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded in qs5
filter_2 = ((df['survey_year'] == 2017) & ((df['country'] == 'Australia') | (df['country'] == 'Israel')
           | (df['country'] == 'Venezuela'))) 
region_temps['temp_temp'] = df['qs5'].where(filter_2)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from stratum
filter_3 = ((df['survey_year'] == 2017) & ((df['country'] == 'Mexico') | (df['country'] == 'Russia') 
            | (df['country'] == 'India') | (df['country'] == 'Indonesia') | (df['country'] == 'Philippines')
            | (df['country'] == 'Argentina') | (df['country'] == 'Brazil') | (df['country'] == 'Chile')
            | (df['country'] == 'Colombia') | (df['country'] == 'Ghana') | (df['country'] == 'Kenya') 
            | (df['country'] == 'Nigeria')  | (df['country'] == 'South Africa') | (df['country'] == 'Senegal')
            | (df['country'] == 'Tunisia') | (df['country'] == 'Vietnam') | (df['country'] == 'Jordan') 
            | (df['country'] == 'Lebanon') | (df['country'] == 'Tanzania') | (df['country'] == 'Peru')))
region_temps['temp_temp'] = df['stratum'].where(filter_3)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])

# EU Nuts coded
filter_4 = ((df['survey_year'] == 2017) & ((df['country'] == 'Greece') | (df['country'] == 'Italy') | 
           (df['country'] == 'Hungary') | (df['country'] == 'Poland') | (df['country'] == 'Turkey'))) 
region_temps['temp_temp'] = df['stratum'].where(filter_4)
region_temps['nuts_temp'] = region_temps['nuts_temp'].combine_first(region_temps['temp_temp'])

# ADM coded from qs11
filter_5 = ((df['survey_year'] == 2017) & ((df['country'] == 'France') | (df['country'] == 'Netherlands') 
            | (df['country'] == 'South Korea') | (df['country'] == 'Spain') | (df['country'] == 'Sweden')
            | (df['country'] == 'United Kingdom')))
region_temps['temp_temp'] = df['qs11'].where(filter_5)
region_temps['adm_temp'] = region_temps['adm_temp'].combine_first(region_temps['temp_temp'])



# missing values from US/South Korea/UK/France/Netherlands/Spain/Sweden/Japan

In [70]:
# Override with US States for 2020-2017

filter_1 = (df['country'] == 'United States')
region_temps['temp_temp'] = df['state_us'].where(filter_1)
region_temps['adm_temp'] = region_temps['temp_temp'].combine_first(region_temps['adm_temp'])

In [76]:
print("You still have " + str(region_temps.shape[0] - region_temps['adm_temp'].combine_first(region_temps['nuts_temp']).value_counts().sum()) + " variables missing")

You still have 2984 variables missing


In [73]:
# run to see missing variables
# use this query to explore a particular country/year further 
# df[(df['survey_year'] == 2017) & (df['country'] == 'Japan')][['region', 'stratum', 'psu', 'qs5', 'qs6', 'qs8', 'qs11', 'qs12']]

# use this query to identify missing variables 
df[pd.isna(df['adm_region']) & pd.isna(df['nuts_region'])].groupby(['country', 'survey_year'])[['region', 'stratum', 'psu', 'qs5', 'qs6', 'qs8', 'qs11', 'qs12']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,region,stratum,psu,qs5,qs6,qs8,qs11,qs12
country,survey_year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
France,2018,0,0,0,0,0,0,0,0
France,2019,0,0,0,0,0,26,0,0
France,2020,0,0,0,0,0,0,0,0
France,2021,0,0,0,0,0,0,0,0
Germany,2018,0,0,0,0,0,0,0,0
Germany,2019,0,0,0,0,0,226,0,0
Germany,2020,0,0,0,0,0,0,0,0
Germany,2021,0,0,0,0,0,0,0,0
Italy,2021,0,0,0,0,0,0,0,0
Japan,2017,0,1009,0,507,502,0,0,0


In [None]:
# put all adm_temp and nuts_temp values in the database for replacing 
df['adm_region'] = region_temps['adm_temp']
df['nuts_region'] = region_temps['nuts_temp']

### Geocoding Methodology

We need to update Pew's data of regional locations to an ADM1 bound. The following method has been identified to create that pairing. 

- Utilize a countries ISO3 id to grab the json file of the ADM1 boundaries for each country. 
- Use fuzzy matching to identify the ADM1 boundaries that can automatically be matched with the correct shapefile. (Utilize a custom matching percent to get the best result) 
- Identify which regional locations were below your accuracy threshold, and create dictionaries to support the geocoding of those particular locations. *Note: more recent data is typically more accurate, know that they regional locations have changed across years and there will need to be some parsing for consistency*



In [9]:
# example to identify the changes in regional location over time within pew
# know that there is some ambiguity in ADM 1 to be grabbing
# df.groupby('survey_year')['regional_location'].value_counts()

In [10]:
df['regional_location_original'] = [x if pd.notna(x) else y for x,y in zip(df['adm_region'], df['nuts_region'])]

# Fixing Cases for Groups 

df['adm_region'] = [x.title() if pd.notna(x) else x for x in df['adm_region']]

__Update Country Listings__

In [11]:
def update_country(dff, name, mapping): 
    dff.loc[dff['country'] == name, 'regional_location'] = dff['regional_location'].replace(mapping)
    return dff

In [12]:
# MEXICO 

# Electoral regions are groupings of ADM1 
# TODO: Map or create unique regions 

In [13]:
# CANADA 

canada = {
    'Montreal' : 'Quebec',
    'Vancouver' : 'British Columbia', 
    'Toronto' : 'Ontario', 
    # 'Atlantic Region' : 'New Brunswick', 'Nova Scotia', 'Prince Edward Island'
}

df = update_country(df, 'Canada', canada)

In [14]:
# CHILE 

In [15]:
# PERU 

peru = {
    'Junin' : 'Junín'
}

df = update_country(df, 'Peru', peru)

In [16]:
# ARGENTINA 

# argen = {
#     'amba' : 'Autonomous City of Buenos Aires', 
# }

In [17]:
def load_boundary(iso3): 
    
    j = requests.get("https://www.geoboundaries.org/api/current/gbOpen/" + iso3 + "/ADM1")
    try: 
        path = j.json()['gjDownloadURL']
        shape = requests.get(path).json()

        country_df = pd.json_normalize(
                            shape, 
                            record_path = ['features'])

        return country_df
    
    except: 
        print("The country " + iso3 + " is not available within geoboundaries.")
        raise Exception

In [18]:
def find_adm(adm_df, entity_df, sim_threshold): 

    adm_dict = dict(zip(range(0, len(adm_df)), adm_df['properties.shapeName']))
    entity_df['fuzzy_matching'] = [[fuzz.partial_ratio(x, y) for x in adm_df['properties.shapeName']] for y in entity_df['regional_location']]
    entity_df['adm1'] = [adm_dict[x.index(max(x))] if max(x) > sim_threshold else "Not found" for x in entity_df['fuzzy_matching']]
    
    print("There are " + str(len(entity_df[entity_df['adm1'] == 'Not found'])) + " instances not identified")
    print("Here are the missing elements: ")
    print(set(entity_df[entity_df['adm1'] == 'Not found']['regional_location']))
    
    
    return dict(zip(entity_df['regional_location'], adm_df['properties.shapeName']))

In [19]:
for country in range(0, len(countries)): 
    
    # find all the unique values of regional locations for the country of interest 
    all_loc = set(df[df['country'] == countries['country'][country]]['regional_location'])
    all_loc.discard(np.nan)
    unique_loc = pd.DataFrame(all_loc, columns = ['regional_location'])
    if len(unique_loc) == 0: 
        # print("The country " + countries['country'][country] + " is not in Pew dataset.")
        continue 
        
    # grab the geoBoundary for the country of interest 
    try: 
        shape = load_boundary(countries['iso3'][country])
    except: 
        print("\n" + countries['country'][country] + ":")
        print("ERROR")
        continue

    # get the ADM name for all of the regional locations
    # ignore 'Refused' and 'DK' values. They will only have location at the country level coded. 

    # encode == True will go ahead and add the Adm1_id code to the dataframe. Its reccomended when testing the dataset
    # you leave this as False
    
    print("\n" + countries['country'][country] + ":")
    dict_country = find_adm(shape, unique_loc, 80)
    
    print('--------------')
    print('encoding adm1_id codes')
    df['adm1'] = df['regional_location'].replace(dict_country)


Mexico:
There are 15 instances not identified
Here are the missing elements: 
{'Circunscripcion 2', 'Circunscripcion 1', 'Circunscripcion 4', 'Circunscripcion 3', 'Electoral Region 5', 'Electoral Region 4', 'Electoral Region 3', 'Circunscripción 5', 'Electoral Region 1', 'Circunscripción 2', 'Circunscripción 1', 'Circunscripción 4', 'Circunscripcion 5', 'Circunscripción 3', 'Electoral Region 2'}
--------------
encoding adm1_id codes

Canada:
There are 2 instances not identified
Here are the missing elements: 
{'Refused', 'Atlantic Region'}
--------------
encoding adm1_id codes

Argentina:
There are 10 instances not identified
Here are the missing elements: 
{'North', 'Centro- Gran Rosario', 'Amba', 'Centro- La Plata', 'South', 'Sur', 'Centro- Mar Del Plata', 'Norte', 'Centro', 'Central'}
--------------
encoding adm1_id codes

Chile:
There are 3 instances not identified
Here are the missing elements: 
{'Norte', 'Sur', 'Central'}
--------------
encoding adm1_id codes

Peru:
There are 0 


Belgium:
There are 11 instances not identified
Here are the missing elements: 
{'Flemish Brabant Province', 'Alberta', 'Refused', 'Liege Province', 'Limburg Province', 'Antwerp Province', 'Luxembourg Province', 'Walloon Brabant Province', 'Hainaut Province', 'Don’T Know', 'Namur Province'}
--------------
encoding adm1_id codes

Germany:
There are 247 instances not identified
Here are the missing elements: 
{'130361', '130348', '130181', '130004', '130138', '130249', '130162', '130300', '130089', 'Saxony-Anhalt', '130385', '130135', '130278', '130111', '130371', '130101', '130247', '130265', '130139', '130332', '130110', '130030', '130064', '130152', '130080', '130073', '130187', '130063', '130370', '130191', '130364', '130012', '130074', '130009', '130108', '130376', '130274', '130055', '130180', '130272', '130242', '130188', '130090', '130373', '130390', '130357', '130308', '130194', '130301', '130052', '130121', '130142', 'Bavaria', '130022', '130178', '130327', '130193', '130126', 


China:
There are 22 instances not identified
Here are the missing elements: 
{'Chunghwa County', 'Hsinchu City', 'Taitung County', 'Chiayi County', 'Chiayi City', 'Yunlin County', 'Nantou County', 'Kaohsiung City', 'Penghu County', 'Taoyuan City', 'Yilan County', 'Refused', 'Kinmen County', 'Miaoli County', 'Pingtung County', 'New Taipei City', 'Taipei City', 'Taichung City', 'Hualien County', 'Keelung City', 'Hsinchu County', 'Chungcheongbuk-Do'}
--------------
encoding adm1_id codes

Philippines:
There are 12 instances not identified
Here are the missing elements: 
{'Visayas (Excl Cebu City)', 'City Of Antipolo (Balance Luzon)', 'Cebu City (Visayas)', 'Balance Luzon (Excl City Of Antipolo)', 'Kalookan City (Ncr)', 'Ncr', 'Quezon City (Ncr)', 'National Capital Region (Excl Quezon City, Manila City, And Kalookan City)', 'National Capital Region (Ncr)', 'Manila City (Ncr)', 'Mindanao (Excl Davao City)', 'Davao City (Mindanao)'}
--------------
encoding adm1_id codes

South Korea:
There 

### Debugging Center 

Give it a name of a country to see:
- the listing of years and common regional locations being displayed
- the values of the ADM boundaries it is trying to match with

In [24]:
testVar='Russia'

df2 = df[df['country'] == testVar]
df2.groupby('survey_year')['regional_location'].value_counts()

survey_year  regional_location
2017         Central Region       282
             Volga                198
             Siberia              132
             Northwestern          96
             Southern              96
             Ural                  84
             North Caucasus        72
             Far East              42
2018         Volga                210
             Central              190
             Siberian             140
             Southern              90
             Urals                 90
             Moscow                80
             North Caucasus        60
             North Western         60
             Far East              40
             St.Petersburg         40
2019         Volga                226
             Central              209
             Siberia              128
             South                 98
             Moscow                90
             Ural                  90
             North-Western         70
             North 

In [25]:
test = load_boundary(countries.at[countries.loc[countries['country']==testVar].index[0], 'iso3'])
test

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.shapeName,properties.Level,properties.shapeISO,properties.shapeID,properties.shapeGroup,properties.shapeType
0,Feature,Polygon,"[[[85.1159569, 54.43783], [85.0334526, 54.4089...",Altai Krai,ADM1,RU-ALT,RUS-ADM1-28173009B49567218,RUS,ADM1
1,Feature,MultiPolygon,"[[[[42.442326, 54.8253986], [42.4073558, 54.84...",Republic of Mordovia,ADM1,RU-MO,RUS-ADM1-28173009B40201491,RUS,ADM1
2,Feature,Polygon,"[[[37.2687091, 54.8418015], [37.2565636, 54.83...",Tula Oblast,ADM1,RU-TUL,RUS-ADM1-28173009B92489997,RUS,ADM1
3,Feature,Polygon,"[[[62.0610577, 56.1308206], [62.070083, 56.129...",Kurgan Oblast,ADM1,RU-KGN,RUS-ADM1-28173009B91354212,RUS,ADM1
4,Feature,Polygon,"[[[44.8479393, 43.5655338], [44.8277321, 43.55...",Ingushetia,ADM1,RU-IN,RUS-ADM1-28173009B89166101,RUS,ADM1
...,...,...,...,...,...,...,...,...,...
78,Feature,MultiPolygon,"[[[[47.6840096, 43.9085911], [47.6834087, 43.9...",Dagestan,ADM1,RU-DA,RUS-ADM1-28173009B43574749,RUS,ADM1
79,Feature,MultiPolygon,"[[[[19.648615891827397, 54.4532277146658], [19...",Kaliningrad,ADM1,RU-KGD,RUS-ADM1-28173009B12217295,RUS,ADM1
80,Feature,Polygon,"[[[50.7681628, 51.7730366], [50.871413, 51.753...",Orenburg Oblast,ADM1,RU-ORE,RUS-ADM1-28173009B57910928,RUS,ADM1
81,Feature,MultiPolygon,"[[[[132.4063493, 44.545623], [132.4066543, 44....",Primorsky Krai,ADM1,RU-PRI,RUS-ADM1-28173009B5071071,RUS,ADM1


## Clean and Prepare Data for Export

In [22]:
df.rename(columns={'qdate_s' : 'qdate'}, inplace=True)

In [23]:
# ensure all the variables are as expected and that we have all of the variables we would like to be collecting
vars_list = pd.read_excel("pewQVDict.xlsx", sheet_name="Final Variable Listing")

assert len(set(vars_list['variable_name']).difference(set(df.columns))) == 0

AssertionError: 

## Export Geocoded Dataframe

In [None]:
df.to_csv("../../data_final/pew.csv", index=False)