# Pulling NETCDF files from NASA, Aggregating files, filtering, and Converting to .CSV
### Objective:
- There is a specific way to pull down data from NASA
- netCDF files are large size and can take much more resourced to compute for data analysis
- Should convert netCDF to CSV format and filter to reduce the size of data


### STEPS: 
* Ensure that you have the correct credentials to access the data
* Setup your environment to work with wget
* Use wget to pull down daily files from a subset link provided by NASA 
* Do some preliminary filtering 
* Concatenate the data files
* final output: csv files for future analysis

### Ensure that you have the following completed: 
* You have registered for an EarthData Login Profile: https://wiki.earthdata.nasa.gov/display/EL/How+To+Register+For+an+EarthData+Login+Profile
* You have authorized "NASA GESDISC DATA ARCHIVE": https://disc.gsfc.nasa.gov/earthdata-login
* Generate the prerequisite files needed for the wget tool: https://disc.gsfc.nasa.gov/information/howto?title=How%20to%20Generate%20Earthdata%20Prerequisite%20Files
* Download the wget tool 
    * This is a little tricky, I was running into authorization issues on the NASA site using the most recent version of wget. I found successs using version 1.19.2
    * You can download from: https://eternallybored.org/misc/wget/

## Subset the data as needed

At the moment, the dataset that we are using can be found here: https://disc.gsfc.nasa.gov/api/jobs/results/6480ba7f9c692c7cd8c4a794

A handy subsetting tool is available

 ![alt text](Images\subset_image.jpg)

 From there, you can subset on a specific region, date range, etc. 

 ![Alt text](Images\subset_region.jpg)

 Once you are finished, you will be directed to this screen where you need to download the list of links provided. The list should be as long as the the number of days that you are subsetting from since each link directs to a data file for one day of readings.

 ![Alt text](Images\download_links_2014_2016.jpg)

## Get the data with wget

If you have correctly completed all the previous steps, this should work for you
* Open a command prompt on your PC
* Set your working directory to the location of your link list as well as the .dodsrc file you created (see the prequisite files mentioned above)
* From the command line, run: 

wget --load-cookies <path of .urs_cookies file> --save-cookies <path of .urs_cookies file> --keep-session-cookies --user=< YOURUSERNAME > --ask-password -P <folder you want to save the data to> --content-disposition -i <link list .txt>

For example:

wget --load-cookies C:\Users\badger\ .urs_cookies --save-cookies C:\Users\badger\ .urs_cookies --keep-session-cookies --user=bbadger --ask-password -P data_2014/ --content-disposition -i 2014_subset_data_links.txt

* it takes me roughly 5 seconds to download one file, for reference

You should now have your netcdf files downloaded


## Convert and concatenate multiple netCDF files to master CSV file

In [2]:
import pandas as pd
import os
import netCDF4 as nc
import netCDF4 as nc
import numpy as np
import pandas as pd
import dask.dataframe as dd
import geopandas as gpd
from shapely.geometry import Point
from datetime import datetime

## Path to NETCDF files and open counties shapefile
- Locate the downloaded netcdf files directory in pc directory

In [3]:
counties = gpd.GeoDataFrame.from_file(r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\us-county-boundaries\us-county-boundaries.shp")
path_a= (r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_all")

# Collect the path of each individual files
file_names= []

for file in os.listdir(path_a):
    # Check whether file ==.nc4 type
    if file.endswith(".nc4"):
        file_path = f"{path_a}\{file}"
      
        # Store the path location of each individual files
        file_names.append(file_path)
        
        
# check first 10 files path
#file_names[:10]

# Check the total files in the DIRECTORY

In [5]:
countFiles=0

for j in file_names:
    if j.endswith(".nc4"):
        countFiles+=1
        #print(j)
        
print('\nTotalFiles: ', countFiles)


TotalFiles:  2943


# Example: 
### Opening a single file in netCDF format

In [6]:
df_xco2= nc.Dataset(file_names[0])
list(df_xco2.variables.keys())

['sounding_id_idx',
 'longitude',
 'latitude',
 'time',
 'epoch_dimension',
 'co2_profile_apriori',
 'date',
 'file_index',
 'pressure_levels',
 'pressure_weight',
 'sensor_zenith_angle',
 'solar_zenith_angle',
 'vertex_latitude',
 'vertex_longitude',
 'xco2',
 'xco2_apriori',
 'xco2_averaging_kernel',
 'xco2_qf_bitflag',
 'xco2_quality_flag',
 'xco2_uncertainty',
 'sounding_id',
 'levels',
 'vertices']

## Function for DateTime format change

In [7]:
# DATE time function
def conv_date(d):
    return datetime.strptime(str(d), '%Y%m%d%H%M%S%f')

### Function for Dataframe building
- Ensure that the netcdf file has an xco2 column, if not, then we will try to redownload that file and append it to the final dataframe
- Convert the date/time columns to the correct type
- Create a 'coords' column populated with Point types so that we can match state, county, geoid to each xco2 reading
- Only use variables of interest

In [5]:
# FUNCTION to convert data

#keeping track of any missing data 
missing_data=[]

#function takes the a file path and whether we want to filter out the quallity flagged data
def convHdf(path_file):

    #opening the file as netcdf
    data= nc.Dataset(path_file)

    #if xco2 field is empty, report, add to missing_data, and continue
    if 'xco2' not in data.variables.keys():
        #print(path_file," is missing xco2 data")
        missing_data.append(path_file)
        return 

    #creating an empty dataframe to populate
    df_xco2= pd.DataFrame()

    #creating and populating columns
    df_xco2['xco2']= data.variables['xco2'][:]
    df_xco2['Latitude']= data.variables['latitude'][:]
    df_xco2['Longitude']= data.variables['longitude'][:] 
    df_xco2['xco2_quality_flag']= data.variables['xco2_quality_flag'][:]

    # Convert soundingID to datetime format
    df_xco2['DateTime']= data.variables['sounding_id'][:]
    # Applying the conv_date function we created previously
    df_xco2['DateTime']= df_xco2['DateTime'].apply(conv_date)
    df_xco2['DateTime']= pd.to_datetime(df_xco2['DateTime'])
    
    # Creating specific date columns
    df_xco2['Year']= df_xco2['DateTime'].dt.year
    df_xco2['Month']= df_xco2['DateTime'].dt.month
    df_xco2['Day']= df_xco2['DateTime'].dt.day

    #Creating a point field from longitude and latitude for matching the geoid,county,state info from the .shp geo data file
    df_xco2['coords'] = list(zip(df_xco2['Longitude'],df_xco2['Latitude']))
    df_xco2['coords'] = df_xco2['coords'].apply(Point)
    points = gpd.GeoDataFrame(df_xco2, geometry='coords', crs=counties.crs)
    df_xco2 = gpd.tools.sjoin(points, counties, predicate="within", how='left')

    #dropping any rows that have not been assigned a county name, we can assume the reading was taken outside of the united states
    df_xco2.dropna(subset=['name'], inplace=True)

    #renaming and choosing only the fields that we want
    df_xco2.rename(columns={"name": "county_name"},inplace=True)
    df_xco2=df_xco2[["county_name","state_name","geoid","DateTime","Year","Month","Day","Latitude","Longitude","xco2","xco2_quality_flag",]]

    return df_xco2

# Testing: Single file transformation

In [8]:
test=convHdf(file_names[1])

print(len(test),"non filtered rows")
display(test[:10])

9617 non filtered rows


Unnamed: 0,county_name,state_name,geoid,DateTime,Year,Month,Day,Latitude,Longitude,xco2,xco2_quality_flag
503,Hancock,Maine,23009,2014-09-07 17:37:46.360,2014,9,7,44.158154,-68.420464,393.300018,0
504,Hancock,Maine,23009,2014-09-07 17:37:46.370,2014,9,7,44.154636,-68.406906,395.017761,0
505,Hancock,Maine,23009,2014-09-07 17:37:46.770,2014,9,7,44.174461,-68.413887,393.613922,0
506,Hancock,Maine,23009,2014-09-07 17:37:49.340,2014,9,7,44.34333,-68.511208,393.447845,0
507,Hancock,Maine,23009,2014-09-07 17:37:49.350,2014,9,7,44.340012,-68.497513,391.911407,1
508,Hancock,Maine,23009,2014-09-07 17:37:49.750,2014,9,7,44.359821,-68.504669,391.107697,1
509,Hancock,Maine,23009,2014-09-07 17:37:50.330,2014,9,7,44.406162,-68.546303,395.608612,1
510,Hancock,Maine,23009,2014-09-07 17:37:50.710,2014,9,7,44.43235,-68.581123,393.809479,0
511,Hancock,Maine,23009,2014-09-07 17:37:50.720,2014,9,7,44.429253,-68.567215,392.958099,0
512,Hancock,Maine,23009,2014-09-07 17:37:50.730,2014,9,7,44.426067,-68.553291,391.530762,0


## Concatenate all files into one master dataframe without quality flags filtered and identify what days data may be missing

In [70]:
# using Function to READ FILES from the direcotry and convert all netCDF files to csv/txt 

#initializing to get the column headings
df_xco2 = convHdf(file_names[0])

#for each file in file_names
for j in range(1, len(file_names)):
  
       # Run the previously created function - be sure to define whether you want to perform quality filtering or not     
        df=convHdf(file_names[j])
        df_xco2 = pd.concat([df_xco2,df],ignore_index=True)


display(df_xco2.head())
display(df_xco2.tail())

C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_all\oco2_LtCO2_141023_B11014Ar_230329200159s.SUB.nc4  is missing xco2 data
C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_all\oco2_LtCO2_150129_B11014Ar_230322180219s.SUB.nc4  is missing xco2 data
C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_all\oco2_LtCO2_151026_B11014Ar_230222200809s.SUB.nc4  is missing xco2 data
C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_all\oco2_LtCO2_161229_B11014Ar_230111184801s.SUB.nc4  is missing xco2 data
C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_all\oco2_LtCO2_171024_B11014Ar_221212230141s.SUB.nc4  is missing xco2 data
C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_all\oco

Unnamed: 0,county_name,state_name,geoid,DateTime,Year,Month,Day,Latitude,Longitude,xco2,xco2_quality_flag
0,Moore,North Carolina,37125,2014-09-06 18:30:51.370,2014,9,6,35.101131,-79.464561,388.310669,1
1,Moore,North Carolina,37125,2014-09-06 18:30:51.730,2014,9,6,35.141167,-79.513474,385.072388,1
2,Moore,North Carolina,37125,2014-09-06 18:30:52.080,2014,9,6,35.135918,-79.465286,388.949951,1
3,Moore,North Carolina,37125,2014-09-06 18:30:52.380,2014,9,6,35.155876,-79.470993,386.402069,1
4,Moore,North Carolina,37125,2014-09-06 18:30:53.330,2014,9,6,35.240818,-79.542213,391.193634,1


Unnamed: 0,county_name,state_name,geoid,DateTime,Year,Month,Day,Latitude,Longitude,xco2,xco2_quality_flag
14960030,North Slope,Alaska,2185,2023-03-31 22:48:39.780,2023,3,31,68.875946,-163.871399,414.965546,1
14960031,North Slope,Alaska,2185,2023-03-31 22:48:40.040,2023,3,31,68.943138,-164.020432,401.649536,1
14960032,North Slope,Alaska,2185,2023-03-31 22:48:40.050,2023,3,31,68.941261,-163.990616,402.592468,1
14960033,North Slope,Alaska,2185,2023-03-31 22:48:40.070,2023,3,31,68.936943,-163.931183,403.559906,1
14960034,North Slope,Alaska,2185,2023-03-31 22:48:40.080,2023,3,31,68.93438,-163.901611,406.664581,1


## Identify missing data and search through the nasa subset function

### For whatever reason, I fount that these days in the NASA database are missing xco2 data - something we should make note of

In [None]:
path_b= (r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\data_netcdf_missing")

# Collect the path of each individual files
file_names_missing= []

for file in os.listdir(path_b):
    # Check whether file ==.nc4 type
    if file.endswith(".nc4"):
        file_path = f"{path_a}\{file}"
      
        # Store the path location of each individual files
        file_names_missing.append(file_path)
        
        
# check first 10 files path
print(file_names_missing[:10])

#for each file in file_names
for j in range(1, len(file_names_missing)):
        missing_data= nc.Dataset(file_names_missing[j])

        print(missing_data.variables.keys())


## Write base data to csv

In [273]:
local_path=r'C:/Users/ddrye/OneDrive/Documents/OMSA_Program/OMSA 2023\Summer2023/Practicum/off_git/data/'

df_xco2.to_csv(local_path+'OCO2_BASE_2014-2023_V1.csv',index=False) 

## Filter out bad quality flags and calculate stats and rates of interest

In [343]:
def variables(df,filter_quality=True):

    if filter_quality==True:
        df=df[df['xco2_quality_flag']==0]
    
    #Only gettin years where we have values for the entire year, not just a few months
    df = df.loc[(df['Year'] < 2023) & (df['Year'] > 2014) ]
    
    #add new variables
    counts=df.groupby(['geoid','county_name','state_name','Year'], as_index=False)["xco2"].size().rename(columns={'size':'readings_count'})
    mean=df.groupby(['geoid','Year'], as_index=False)["xco2"].mean().rename(columns={'xco2':'avg_xco2'})
    std_deviation=df.groupby(['geoid','Year'], as_index=False)["xco2"].std().rename(columns={'xco2':'stddev_xco2'})
    
    intermediate_df=pd.merge(counts, mean[['geoid','Year','avg_xco2']], on=['geoid','Year'])
    pct_change = (intermediate_df.groupby('geoid')['avg_xco2']
                                  .apply(pd.Series.pct_change) + 1).rename('pct_change').reset_index()

    intermediate_df=pd.merge(intermediate_df, pct_change['pct_change'], left_index=True, right_index=True)
    intermediate_df=pd.merge(intermediate_df, std_deviation[['geoid','Year','stddev_xco2']], on=['geoid','Year'])

    intermediate_df["delta"] = intermediate_df.groupby(['geoid'])['avg_xco2'].diff()
    intermediate_df["total_delta"] = intermediate_df.groupby(['geoid'])['delta'].cumsum()
    

    final_df=intermediate_df[["geoid","county_name","state_name","Year","readings_count","stddev_xco2","avg_xco2","delta","total_delta","pct_change"]]
            
    return final_df

## Apply the new function

In [344]:
df_copy=df_xco2
df_xco2_v1=variables(df_copy,filter_quality=True)

display(df_xco2_v1.head())
display(df_xco2_v1.tail())

Unnamed: 0,geoid,county_name,state_name,Year,readings_count,stddev_xco2,avg_xco2,delta,total_delta,pct_change
0,1001,Autauga,Alabama,2015,43,0.945703,396.257507,,,
1,1001,Autauga,Alabama,2016,268,1.087421,403.380249,7.122742,7.122742,1.017975
2,1001,Autauga,Alabama,2017,67,1.19092,403.876709,0.49646,7.619202,1.001231
3,1001,Autauga,Alabama,2018,437,2.605162,408.490875,4.614166,12.233368,1.011425
4,1001,Autauga,Alabama,2019,355,1.044961,410.446869,1.955994,14.189362,1.004788


Unnamed: 0,geoid,county_name,state_name,Year,readings_count,stddev_xco2,avg_xco2,delta,total_delta,pct_change
21781,56045,Weston,Wyoming,2018,449,2.742509,406.202148,0.424316,6.01474,1.001046
21782,56045,Weston,Wyoming,2019,270,0.492759,410.860779,4.65863,10.67337,1.011469
21783,56045,Weston,Wyoming,2020,625,2.966075,412.226837,1.366058,12.039429,1.003325
21784,56045,Weston,Wyoming,2021,69,1.06886,413.128815,0.901978,12.941406,1.002188
21785,56045,Weston,Wyoming,2022,350,1.919243,418.56955,5.440735,18.382141,1.01317


## Write cleaned df to csv

In [345]:
df_xco2_v1.to_csv(local_path+'OCO2_CLEANED_2015-2022_V1.csv',index=False) 