# Pulling NETCDF files from NASA, Aggregating files, Converting to .CSV
### Objective:
- There is a specific way to pull down data from NASA
- netCDF files are large size and can take much more resourced to compute for data analysis
- Convert netCDF to CSV format to reduce the size of data


### STEPS: 
* Ensure that you have the correct credentials to access the data
* Setup your environment to work with wget
* Use wget to pull down daily files from a subset link provided by NASA 
* Do some preliminary filtering 
* Concatenate the data files
* final output: master csv file

### Ensure that you have the following completed: 
* You have registered for an EarthData Login Profile: https://wiki.earthdata.nasa.gov/display/EL/How+To+Register+For+an+EarthData+Login+Profile
* You have authorized "NASA GESDISC DATA ARCHIVE": https://disc.gsfc.nasa.gov/earthdata-login
* Generate the prerequisite files needed for the wget tool: https://disc.gsfc.nasa.gov/information/howto?title=How%20to%20Generate%20Earthdata%20Prerequisite%20Files
* Download the wget tool 
    * This is a little tricky, I was running into authorization issues on the NASA site using the most recent version of wget. I found successs using version 1.19.2
    * You can download from: https://eternallybored.org/misc/wget/

## Subset the data as needed

At the moment, the dataset that we are using can be found here: https://disc.gsfc.nasa.gov/api/jobs/results/6480ba7f9c692c7cd8c4a794

A handy subsetting tool is available

 ![Alt text](subset_image.jpg)

 From there, you can subset on a specific region, date range, etc. 

 ![Alt text](subset_region.jpg)

 Once you are finished, you will be directed to this screen where you need to download the list of links provided. The list should be as long as the the number of days that you are subsetting from since each link directs to a data file for one day of readings.

 ![Alt text](download_links.jpg)

## Get the data with wget

If you have correctly completed all the previous steps, this should work for you
* Open a command prompt on your PC
* Set your working directory to the location of your link list as well as the .dodsrc file you created (see the prequisite files mentioned above)
* From the command line, run: 

wget --load-cookies <path of .urs_cookies file> --save-cookies <path of .urs_cookies file> --keep-session-cookies --user=< YOURUSERNAME > --ask-password -P <folder you want to save the data to> --content-disposition -i <link list .txt>

For example:

wget --load-cookies C:\Users\bucky\ .urs_cookies --save-cookies C:\Users\bucky\ .urs_cookies --keep-session-cookies --user=bbucky --ask-password -P data_2014/ --content-disposition -i 2014_subset_data_links.txt

* it takes me roughly 10 seconds to download one file, for reference

You should now have your netcdf files downloaded


## Convert and concatenate multiple netCDF files to master CSV file

In [1]:
import pandas as pd
import os
import netCDF4 as nc

# converting the datetime format
from datetime import datetime

## Path to NETCDF files
- Locate the downloaded netcdf files directory in pc directory

In [2]:
path_a= ('data\data_netcdf')

# Collect the paths of each individual files
file_names= []

for file in os.listdir(path_a):
    # Check whether file is in text format or not
    if file.endswith(".nc4"):
        file_path = f"{path_a}\{file}"
      
        # Store the path location of each individual files
        file_names.append(file_path)
        
        
# check first 10 files path
file_names[:10]

['data\\data_netcdf\\oco2_LtCO2_140906_B11014Ar_230329213249s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140907_B11014Ar_230329213320s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140908_B11014Ar_230329213350s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140909_B11014Ar_230329213426s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140910_B11014Ar_230329213451s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140911_B11014Ar_230329213526s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140912_B11014Ar_230329213557s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140913_B11014Ar_230329213633s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140914_B11014Ar_230329213705s.SUB.nc4',
 'data\\data_netcdf\\oco2_LtCO2_140915_B11014Ar_230329213733s.SUB.nc4']

# Check the total files in the DIRECTORY

In [9]:
countFiles=0

for j in file_names:
    if j.endswith(".nc4"):
        countFiles+=1
        #print(j)
        
print('\nTotalFiles: ', countFiles)


TotalFiles:  777


# Example: 
### Opening a single file in netCDF format

In [10]:
df_xco2= nc.Dataset(file_names[9])
list(df_xco2.variables.keys())

['sounding_id_idx',
 'longitude',
 'latitude',
 'time',
 'epoch_dimension',
 'co2_profile_apriori',
 'date',
 'file_index',
 'pressure_levels',
 'pressure_weight',
 'sensor_zenith_angle',
 'solar_zenith_angle',
 'vertex_latitude',
 'vertex_longitude',
 'xco2',
 'xco2_apriori',
 'xco2_averaging_kernel',
 'xco2_qf_bitflag',
 'xco2_quality_flag',
 'xco2_uncertainty',
 'sounding_id',
 'levels',
 'vertices']

# DateTime format Change

In [11]:
# DATE time function
def conv_date(d):
    return datetime.strptime(str(d), '%Y%m%d%H%M%S%f')

### NOTE:
- Refine the ENTIRE dataframe by GOOD quality_flag->0 

In [12]:
# FUNCTION to convert data

def convHdf(path_file, n=0):

    data= nc.Dataset(path_file)

    #if empty, report and continue
    if 'xco2' not in data.variables.keys():
        print(path_file," is missing xco2 data")
        return 

    df_xco2= pd.DataFrame()
    
    #if we want to experiment with different variables
    """variables=['sounding_id_idx', 'longitude', 'latitude', 'time', 'epoch_dimension', 'co2_profile_apriori', 'date',
                'file_index', 'pressure_levels', 'pressure_weight', 'sensor_zenith_angle', 'solar_zenith_angle', 'vertex_latitude',
                  'vertex_longitude', 'xco2', 'xco2_apriori', 'xco2_averaging_kernel', 'xco2_qf_bitflag', 'xco2_quality_flag', 'xco2_uncertainty', 'sounding_id']

    for i in variables:
        print
        try:
            df_xco2[i]=data.variables[i][:]
        except:
            pass"""

    df_xco2['Xco2']= data.variables['xco2'][:]
    df_xco2['Latitude']= data.variables['latitude'][:]
    df_xco2['Longitude']= data.variables['longitude'][:] 
    df_xco2['xco2_quality_flag']= data.variables['xco2_quality_flag'][:]
    
    # Date
    df_xco2['DateTime']= data.variables['sounding_id'][:]
    
    #Convert soundingID to datetime format
    df_xco2['DateTime']= df_xco2['DateTime'].apply(conv_date)
    df_xco2['DateTime']= pd.to_datetime(df_xco2['DateTime'])
    
    # YEAR and month column
    df_xco2['Year']= df_xco2['DateTime'].dt.year
    df_xco2['Month']= df_xco2['DateTime'].dt.month
    df_xco2['Day']= df_xco2['DateTime'].dt.day
    
    # Refine the ENTIRE dataframe by GOOD quality_flag->0
    # NOTE: REDUCES the size of the file
    df_xco2= df_xco2[df_xco2['xco2_quality_flag'] == 0]   
    
   
    date= str(data.variables['sounding_id'][0])      
    
    return df_xco2
    # create a CSV and store on new folder: csv_files
    #df_xco2.to_csv('csv_files'+'/'+ data.Sensor+'_xco2_'+ date+'_.txt', index= False)

# Testing: Single file transformation

In [13]:
convHdf(file_names[9])

Unnamed: 0,Xco2,Latitude,Longitude,xco2_quality_flag,DateTime,Year,Month,Day
7,392.113342,23.887823,-165.322113,0,2014-09-15 00:14:23.340,2014,9,15
12,395.041077,23.906345,-165.327133,0,2014-09-15 00:14:23.740,2014,9,15
22,394.904938,23.969070,-165.351288,0,2014-09-15 00:14:24.730,2014,9,15
26,393.841583,23.980486,-165.347305,0,2014-09-15 00:14:25.040,2014,9,15
33,395.470764,23.999025,-165.352356,0,2014-09-15 00:14:25.340,2014,9,15
...,...,...,...,...,...,...,...,...
20998,392.676025,47.035870,-159.644638,0,2014-09-15 23:25:51.750,2014,9,15
20999,393.736023,47.032795,-159.626373,0,2014-09-15 23:25:51.760,2014,9,15
21001,393.991211,47.026253,-159.590057,0,2014-09-15 23:25:51.780,2014,9,15
21002,393.972900,47.053600,-159.653290,0,2014-09-15 23:25:52.050,2014,9,15


## Concatenate all files into one master dataframe

In [14]:
# using Function to READ FILES from the direcotry and convert all netCDF files to csv/txt 

#initializing   
df_all = convHdf(file_names[0])

for j in range(1, len(file_names)):
  
       # EG to read FIRST dataset from THE DIRECTORY       
        df=convHdf(file_names[j], j)
        df_all = pd.concat([df_all,df],ignore_index=True)

df_all[:10]

data\data_netcdf\oco2_LtCO2_141023_B11014Ar_230329200159s.SUB.nc4  is missing xco2 data
data\data_netcdf\oco2_LtCO2_150129_B11014Ar_230322180219s.SUB.nc4  is missing xco2 data
data\data_netcdf\oco2_LtCO2_151026_B11014Ar_230222200809s.SUB.nc4  is missing xco2 data


Unnamed: 0,Xco2,Latitude,Longitude,xco2_quality_flag,DateTime,Year,Month,Day
0,389.764069,81.391083,-74.009087,0,2014-09-06 13:48:24.760,2014,9,6
1,388.759888,81.399887,-73.966896,0,2014-09-06 13:48:24.770,2014,9,6
2,389.952484,81.41291,-74.224922,0,2014-09-06 13:48:25.370,2014,9,6
3,392.939789,77.179932,-66.461746,0,2014-09-06 15:25:20.030,2014,9,6
4,389.582825,77.195396,-66.5214,0,2014-09-06 15:25:20.330,2014,9,6
5,391.755737,77.201897,-66.670692,0,2014-09-06 15:25:20.710,2014,9,6
6,390.422668,77.206406,-66.626122,0,2014-09-06 15:25:20.720,2014,9,6
7,391.380463,77.221825,-66.686157,0,2014-09-06 15:25:21.020,2014,9,6
8,390.483276,77.241692,-66.701355,0,2014-09-06 15:25:21.330,2014,9,6
9,393.851318,77.252609,-66.806648,0,2014-09-06 15:25:21.720,2014,9,6


## Write dataframe to csv

In [15]:
df_all.to_csv("data_master_csv")

## Checkout breakdown of readings per day

In [16]:
df_grouped=df_all.groupby(['Year','Month','Day']).size().reset_index().rename(columns={0:'count'})
df_grouped.to_csv("OVERVIEW_data_master")
