## Introduction

The entire Tracking the Sun dataset contains over 1 million points of data for solar cell systems in the United States. In order to set up the dataset for plotting, the locations of the solar cell systems need to be geocoded. Using the geocoding service Nominatim in combination with geopy, the coordinates of each solar cell system will be found.

## Geocoding: Step by Step

First, the user will import the geocode python file to enable access to the functions to geocode a given dataset. The data file path needs to be specified and passed into the geocoding function. This notebook will walk the user through how to use the geocoding program.

In [1]:
import geocoding
from geocoding import read_data
from geocoding import find_unique_cities
from geocoding import create_save_files
from geocoding import geocode_unique
from geocoding import assign_lat_long_columns
from geocoding import geocode_full

In [2]:
data_file_path = 'data/TTS_sample.csv'
data = read_data(data_file_path)
data

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,dataProvider1,dataProvider2,program1ProjectID,PTODate_orProxy_,systemSizeInDCSTC_KW_,totalInstalledCost___,Up_FrontCashIncentive___,customerSegment,...,inv_model1_clean,inverterQuantity_1,additionalInverterModels_Y_N_,inv_microinv1,inv_battery_hybrid1,inv_builtin_meter1,inv_outputcapacity1,dc_optimizer,ILR,TotalModuleQty
0,302186,302186,"Washington, D.C. Public Service Commission",-1,DC-199230-SUN-I,28-Mar-19,2.925,-1.00,-1.0,-1,...,-1,-1.0,-1,-1,-1,-1,-1.000,-1,-1.000000,17.167495
1,588569,588569,California Public Utilities Commission,-1,SDGE-INT-264056,18-May-20,6.615,23351.00,0.0,RES,...,SE5000H-US [240V],1.0,0,0,0,1,5.052,1,1.309382,41.194444
2,499861,499861,Utah Office of Energy Development,-1,-1,28-Apr-16,7.245,17322.00,0.0,RES,...,M250-60-2LL-S25-NA (240V),-1.0,-1,1,0,0,0.240,0,-1.000000,-1.000000
3,367091,367091,Pennsylvania Department of Environmental Prote...,-1,PSP-00046,NaT,3.800,37010.00,8550.0,RES,...,PVP2000,2.0,0,0,0,1,2.000,0,0.950000,37.137289
4,588328,588328,California Public Utilities Commission,-1,PGE-INT-119016533,18-May-20,5.680,24289.00,0.0,RES,...,IQ7PLUS-72-2-US [240V],16.0,0,1,0,0,0.290,0,1.224138,31.212158
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,349837,349837,New Hampshire Public Utilities Commission,-1,140504,26-Jan-15,7.700,35000.00,3750.0,RES,...,SE7600A-US [240V],-1.0,-1,0,1,1,7.616,1,-1.000000,-1.000000
305,391418,391418,Massachusetts Clean Energy Center,Massachusetts Department of Energy Resources,CS2-C22943-015382,20-Feb-14,4.251,23456.00,3400.8,RES,...,PVI-4.2-OUTD-US (208V),1.0,0,0,0,0,4.200,0,1.012143,25.200526
306,154931,154931,Connecticut Green Bank,-1,RPV-46962,31-Dec-19,11.030,36155.03,3402.0,RES,...,IQ7-60-2-US [240V],35.0,0,1,0,0,0.240,0,1.313095,64.194444
307,372239,372239,Pennsylvania Department of Environmental Prote...,-1,PSP-05223,6-Jul-11,70.300,332000.00,37650.0,NON-RES,...,IG PLUS V 7.5-240,9.0,0,0,0,1,7.500,0,1.041481,739.148828


Geocoding the entire dataset line by line would take over 7 days for 1 million datapoints. So, instead of doing this, only the unique locations are geocoded so that computing time/power isn't wasted on duplicate locations. In this example, the number of unique cities is about 100 less than the total number of cities in the original dataset, thus the amount of cities is geocode is reduced by a third. Using this method on the full dataset results in only about 18% of the datapoints needing to be geocoded.

The find_unique_cities function creates a datframe of the unique cities in the dataset and preps the locations for geocoding by creating a column 'city_state_country'. This relies on the the original dataframe haivng columns for the city and state and assumign all location are in the USA. The reason for this is that it makes using the geocoding service, Nominatim, more accurate. Without the country specificied there is a chance the geocoding will pick city with the same name in the wrong country.

In [3]:
unique_cities = find_unique_cities(data)
unique_cities

Unnamed: 0,city_state_country
0,"Washington D.C., DC, USA"
1,"OCEANSIDE, CA, USA"
2,"St. George, UT, USA"
3,"Narberth, PA, USA"
4,"OAKLAND, CA, USA"
...,...
219,"Somerset, MA, USA"
220,"East Meadow, NY, USA"
221,"Etna, NH, USA"
222,"Falmouth, MA, USA"


If it's the user's first time running geocoding on their data, they should use the the create_save_files function to create blank csv files to store the geocoding progress as its runs. One file is used to store the geocoding data and one file is used to store cities that result in geocode errors. Many of these errors are due to typos in the city names in the original dataset. The purpose of collecting these is so the user can go back and fix the dataset manually after this code runs if they wish.

WARNING: Running this function after the geocoding is part of the way through will delete the progress. Only run this once before geocoding a new dataset.

In [None]:
create_save_files()

The geocode_unique function is used to geocode all the unique cities in the original dataset and save the progress as it goes (through the geocode_save function) as well as the cities that result in errors. It takes the unique cities dataframe as an arguement and returns a dataframe with the unique cities geocoded. It also saves this as a .csv as well for reference.

Note: If the geocoding below is interrupted, the user can simply run the cell below again and the geocoding will pickup where it left off. (Again, do NOT run the create_save_files function at this stage)

In [4]:
unique_geo_df = geocode_unique(unique_cities)

In [5]:
unique_geo_df

Unnamed: 0,city_state_country,geopy location,coordinates
0,"Washington D.C., DC, USA","Washington, District of Columbia, United States","(38.8950368, -77.0365427)"
1,"OCEANSIDE, CA, USA","Oceanside, San Diego County, California, 92054...","(33.1958696, -117.3794834)"
2,"St. George, UT, USA","St. George, Washington County, Utah, United St...","(37.104153, -113.5841313)"
3,"Narberth, PA, USA","Narberth, Montgomery County, Pennsylvania, 190...","(40.0084456, -75.26046)"
4,"OAKLAND, CA, USA","Oakland, Alameda County, California, United St...","(37.8044557, -122.2713563)"
...,...,...,...
219,"Somerset, MA, USA","Somerset, Bexar County, Texas, United States","(29.2263504, -98.6577985)"
220,"East Meadow, NY, USA","East Meadow, Town of Hempstead, Nassau County,...","(40.721929599999996, -73.55839696869465)"
221,"Etna, NH, USA","Etna, Hanover, Grafton County, New Hampshire, ...","(43.6928489, -72.2217561)"
222,"Falmouth, MA, USA","Falmouth, Cumberland County, Maine, 04105, Uni...","(43.729525, -70.241993)"


Latitude and longitude coordinates are broken out into two seperate columns through the assign_lat_long_columns function. This makes analyzing and plotting the data easier down the line.

In [6]:
unique_geo_df = assign_lat_long_columns(unique_geo_df)

In [7]:
unique_geo_df

Unnamed: 0,city_state_country,geopy location,coordinates,latitude,longitude
0,"Washington D.C., DC, USA","Washington, District of Columbia, United States","(38.8950368, -77.0365427)",38.895037,-77.036543
1,"OCEANSIDE, CA, USA","Oceanside, San Diego County, California, 92054...","(33.1958696, -117.3794834)",33.195870,-117.379483
2,"St. George, UT, USA","St. George, Washington County, Utah, United St...","(37.104153, -113.5841313)",37.104153,-113.584131
3,"Narberth, PA, USA","Narberth, Montgomery County, Pennsylvania, 190...","(40.0084456, -75.26046)",40.008446,-75.260460
4,"OAKLAND, CA, USA","Oakland, Alameda County, California, United St...","(37.8044557, -122.2713563)",37.804456,-122.271356
...,...,...,...,...,...
219,"Somerset, MA, USA","Somerset, Bexar County, Texas, United States","(29.2263504, -98.6577985)",29.226350,-98.657798
220,"East Meadow, NY, USA","East Meadow, Town of Hempstead, Nassau County,...","(40.721929599999996, -73.55839696869465)",40.721930,-73.558397
221,"Etna, NH, USA","Etna, Hanover, Grafton County, New Hampshire, ...","(43.6928489, -72.2217561)",43.692849,-72.221756
222,"Falmouth, MA, USA","Falmouth, Cumberland County, Maine, 04105, Uni...","(43.729525, -70.241993)",43.729525,-70.241993


Finally, the full dataset is geocoded by running the geocode_full function which uses the dataframe of unique cities with latitude and longitude values to assing values to every location in the original dataframe.

In [10]:
import pandas as pd
data = pd.DataFrame({'city_state_country':["Washington D.C., DC, USA", "OCEANSIDE, CA, USA"]})
unique_geo_df = pd.DataFrame({'coordinates':["(38.8950368, -77.0365427)"]})
geocode_full(data, unique_geo_df)

AttributeError: All unique cities in dataframe have not been geocoded

In [None]:
data = geocode_full(data, unique_geo_df)

In [None]:
data[['city_state_country', 'latitude', 'longitude']]

## Geocoding: One Shot

To geocode the dataset without seeing the step by step resulsts run create_save files and geocode_go. The output is saved to 'TTS_fully_geocoded_sample.csv' which is a the original data with the latitude and longitude coordinates added for each location.

In [None]:
from geocoding import geocode_go

In [None]:
create_save_files()

In [None]:
data = geocode_go('data/TTS_sample.csv')
data[['city_state_country', 'latitude', 'longitude']]