## Determine if a weather file is for a US location or not.

Weather data can be downloaded from the NOAA website at this link: https://www.ncei.noaa.gov/data/global-hourly/archive/csv/.
For this project the 2010.tar.gz - 2019.tar.gz and the 2023.tar.gz files were downloaded. They were downloaded on June 19th, containing data up to June 17th.

There's literally thousands of weather files to sift through and they take up a lot of space.<br>
This step was the first step to narrow down the files to save space while learning about how to work with the files.<br>
<br>
The files are moved into one of the following folders based on the ID provided as the name of the file:<br>
* US
* Non-US

In [None]:
import pandas as pd
basepath = 'your_path_here'

The LCD Station Numbers file provides a cross-reference for the ID given as the file name to a location: state and country.<br>
Unfortunately it doesn't provide a cross-reference to the airport code provided in the OnTime data, but we'll at least narrow down the files with this.

In [None]:
# Cross reference for Station IDs to Country/State...
sid_xref_path = basepath + '/Weather/stations/LCD Station Numbers.xlsx'
sid_xref = pd.read_excel(open(sid_xref_path, 'rb'),
              sheet_name='LCD Station List')

The Station IDs were checked to be sure there weren't duplicate records. We can count on matching to one location.

In [None]:
# All IDs are unique
dup_ids = sid_xref[sid_xref.duplicated(subset=['Station ID'],keep=False)]
print(dup_ids.shape)

(0, 12)


# A function was set up to run each years set of files through
* It gets the list of file names from the directory they're stored in.
* For each filename it checks the LCD Station list and checks for the country.
* * If it's US it moves the file to the US subfolder
* * If not, it moves it to NonUS

The function is called for each year.<br>
After this code was run. All files were in either US or NonUS. The NonUS files were deleted.

In [None]:
# Within the YearFolder, used as a parameter, we have these subfolders: US and NonUS
def determineUSfiles(YearFolder):
    Direc = basepath + "/Weather/Hourly/"+ YearFolder
    files = os.listdir(Direc)
    files = [f for f in files if os.path.isfile(Direc+'/'+f)]

    US_files = []
    testcount = 0
    for file in files:
        if testcount < 100:
            sid = file[0:-4]
            ext = file[-4:]
            country = sid_xref[sid_xref['Station ID'] == sid]['CTRY'].values[:].tolist()
            # print("filename:", sid, "ext:", ext, "country:", country)
            if len(country)> 0:
                # files_sid_ref.append([sid, country[0]])
                if country[0] == 'US':
                    US_files.append(file)
                    os.rename(Direc + "/" + file, Direc + "/US/" + file)
                else:
                    os.rename(Direc + "/" + file, Direc + "/NonUS/" + file)
            else:
                os.rename(Direc + "/" + file, Direc + "/NonUS/" + file)

determineUSfiles('2010')
determineUSfiles('2011')
determineUSfiles('2012')
determineUSfiles('2013')
determineUSfiles('2014')
determineUSfiles('2015')
determineUSfiles('2016')
determineUSfiles('2017')
determineUSfiles('2018')
determineUSfiles('2019')
determineUSfiles('2023')
