## Connecting Cleaned Data with the Closest Station Name 

1. Load in the newest data cbc data
2. Confirm all the circles that are in the United States
3. Clean Up lat long data
DEPRECATED: 4. Merge on Lat Long Location  
4. Merge on row number 

*Note: This file original merged on Lat Long. After some code modifications to the script that creates the noaa station matches, it can now merge on row number instead. 

**Author:** rectheworld

**Date Updated:** 2020-01-31

**Inputs**
- A Christmas Birdcount file (CSV), Cleaned, Inlcudes only US stations
-- As of 2020-01-31, most recent file is cleaned_cbc_usa.csv

- Closest NOAA stations file 
-- As of 2020-01-01, most recent is closest_stations_geoyear_20200127.csv

**Outputs** 
A CSV file called "cbc_cleaned_usa_statid.csv"


In [None]:
# Imports 
import pandas as pd
from datetime import date

## Set Directories
### Uncomment if Running in Google Drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
base_dir = "<BASE DRIVE>"
CBC_CIRCLE_FILE_PATH = "<CBC_CIRCLE_FILE_PATH>"
CLOSEST_NOAA_STATION_PATH = "<CLOSEST_NOAA_STATION_PATH>"

### Load in the cleaned CBC data for the United States

In [None]:
cbc_cir_df = pd.read_csv(CBC_CIRCLE_FILE_PATH, encoding ="latin_1")

In [None]:
print(cbc_cir_df.shape)
cbc_cir_df.head(10)

In [None]:
# Drop all the locations that are not in the united states 
indexNamesNUSA = cbc_cir_df[~cbc_cir_df['country_state'].str.contains("US-")].index
 
# Delete these row indexes from dataFrame
cbc_cir_df.drop(indexNamesNUSA , inplace=True)

In [None]:
print(cbc_cir_df.shape)
cbc_cir_df.head(10)

## Prep the Closest Station File 
The closest stations (as determanted by Lat, Long, and time station was active)

Below we will:
1. Load the File
DEPRECATED --- 2. Remove the duplicate enteries (Mostly so I can export a shorter file later) *** NOT PREFORMING anymore becuase even duplicate locations can have different stations active over the years***
3. Add in percision Decimals to the Lat Long Cordinates 

In [None]:
# Load the Closest Station File 
clos_st_nm = pd.read_csv(CLOSEST_NOAA_STATION_PATH, header = 0, encoding ="latin_1")

print(clos_st_nm.shape)
clos_st_nm.head()

In [None]:
clos_st_nm['circle_coordinates'] = "(" + clos_st_nm.circle_lat.map(str) + ", " + clos_st_nm.circle_lng.map(str)

In [None]:
# The Station cordinates in the closest station file (output.csv) are shorter/less percise 
# than the data that is in the main cbc file. We will restore the precision of these decimals 

# Split the circle coordinates into two columns for a sec 
temp_latlon = clos_st_nm["circle_coordinates"].str.split(",", n = 1, expand = True) 
temp_latlon.head()

# Remove the "(" and ")"
temp_latlon[0].replace(regex=True,inplace=True,to_replace=r'\(',value=r'')
temp_latlon[1].replace(regex=True,inplace=True,to_replace=r'\)',value=r'')
temp_latlon.head(20)

##Force these digits to have 6 decimals 
temp_latlon = temp_latlon.astype(float).round(6).copy(False)
to_8digit_str = lambda flt: str(flt).ljust(8,"0")
to_9digit_str = lambda flt: str(flt).ljust(9,"0")

temp_latlon[0] = temp_latlon[0].map(to_8digit_str)
temp_latlon[1] = temp_latlon[1].map(to_9digit_str)


## recombine into tuple 
temp_latlon["circle_coordinates"] = "(" + temp_latlon[0].astype(str) + ", " + temp_latlon[1].astype(str) + ")" 

temp_latlon.head(20)

## Save the result back to the closest station data 
clos_st_nm["circle_coordinates"] = temp_latlon["circle_coordinates"]
clos_st_nm.head(20)


# Prep Merge Between CBC Circle Data and the Closest Station Data 

In [None]:
# Create a Key in cbc_cir_df as the Lat Lon as a tuble
cbc_cir_df['station_key'] = "(" + cbc_cir_df['lat'].round(6).map(to_8digit_str) + ", " + cbc_cir_df['lon'].round(6).map(to_9digit_str) + ")" 


In [None]:
print(cbc_cir_df.shape)
cbc_cir_df.head(53)

# Merge the dataframes based Row Number
Orginally we were merging on Lat, Long, now we will just merge on the row number.


In [None]:
# Merge the dataframes based on lat long 
results = pd.merge(cbc_cir_df, clos_st_nm[['circle_coordinates','closest_station_id', 'closest_station_lat', 'closest_station_lng', 'distance', 'circle_num']], how = "left", left_on='Unnamed__0', right_on='circle_num')


In [None]:
print(results.shape)
results.head(50)

In [None]:
# Count the rows that have no match 
results['closest_station_id'].isna().sum(axis = 0)

In [None]:
# Save the File 
results.to_csv(base_dir + "cbc_cleaned_usa_statid.csv", index = False)