# Extracting TropOmi data for Nepal through EARTHDATA

<b> This notebook will contain a complete guide to extract TropOmi data for Nepal.
    The data is available here: [Link](https://measures.gesdisc.eosdis.nasa.gov/data/MINDS/TROPOMI_MINDS_NO2.1.1/)

## Step 1: Downloading the data

It is best recommended you download the data in bulk using <b> WGET</b>. After the data is downloaded, the directory structure should look like following:

If the data for 2019 has been sucessfully downloaded, the directory structure will look like: 

    2019 -> 
        001 
        002
        003
        005
        ---
        ---
        364
        365
    Each of the subfolders inside 2019 directory will contain 14-15 .nc files containing data for each orbit. Therefore, for a given year, there will be roughly 356 * 14 files.
    
    Our task will be to access each subfolder inside a given year's folder, like 2019, and access Nepal's tropomi data.
    
    In order to do so, we will be using Python as our programming language. In addition, following libraries will also be used:
    1. Numpy
    2. Pandas
    3. netCDF4
    
    Here, Numpy will be used to perform array manipulation. Pandas will be used to storing data in tabular format. Finally, NetCDF4 library will be utilized to read data from .nc files.

# 1. Importing dependencies

In [1]:
# This module will be used to access data from your local computer
import os 

# This module will be used in this script to stop the program from running when a specific condition is met
''' Like, if there is no data inside the folder. So,  sys.exit() method will stop the program execution  '''
import sys 

# Datetime module is simply being used to calculate the total time taken by the program to extract the data
from datetime import datetime

# Importing the pandas library as pd alias to store and manipulate data in dataframes
import pandas as pd

# This library will be used to access data from .nc files
from netCDF4 import Dataset

# Importing the numpy array to perform array manipulation, as our data in the .nc files will be in form of multidimensional array
import numpy as np

# 2. Setting up the code to dynamically access specific day's data inside 2019 folder

In [2]:
# Initializing an empty array to store names of .nc files that were corrupted 
files_with_errors = []

# Initializing an empty array that will contain the absolute path for all .nc files present in the selected subdirectories
path_  = []

# Initializing an empty pandas dataframe that will hold the data for necessary features like Latitude, Longitude etc.
concatenated_dataframe = pd.DataFrame({'ColumnAmountNO2Trop':[], 'Latitude': [], 'Longitude': [], 'Cloud_Fraction': [], 'QA_Value':[],
'Date': [], 'Time': []})

In [5]:
print("\nPlease press enter if the python file and .nc file are in same path\n")
selected_directory = input("Enter the path to the directory where corresponding files are stored: ")
if selected_directory == "":
    selected_directory = os.getcwd()

# Here, an infinite loop is created to ensure that the user will select two folders in ascending order
''' 
Example: Start = 001, End = 100
This is valid and Nepal's data from subfolder 001 to 100 inside 2019 folder will be extracted

Howver, This is invalid: Start = 040, End = 030
'''
while True:
    start = int(input('Starting Folder: '))
    end = int(input('Ending Folder: '))
    if start > end:
        print('Starting folder value cannot be greater than ending folder')
        print('please proceed again')
    else:
        break


Please press enter if the python file and .nc file are in same path

Enter the path to the directory where corresponding files are stored: Enter the path in your system where data exists
Starting Folder: 1
Ending Folder: 100


In [9]:
'''
The next try except block of code will do the following:
    1. From the absolute path of the directory entered by the user, where the TropOmi data exists, all folder names, path and
       filenames will be accessed using for loop.
    2. Using the values for start and end entered by the user in the previous step, only those subfolders inside 2019 will
       be used whose names lie in the range of [start, end].
    3. The subfolders which don't lie in the given range will be discarded, and the absolute path of all files inside the 
       selected subfolders will be appended to path_ list.
    4. This will result in absolute path to .nc files from selected range of days being added to the path_ list.
    5. Now, we will loop through each absolute path inside path_ list and exclude any files whose extension isn't .nc.
    6. This is to ensure we don't proceed with any other file types like pdf, txt or csv. As our data is only in .nc extension.
    7. Inside the try block, if any error occurs,  the except block will be executed.
    8. This will result in the termination of program execution, as there is no .nc data in any of the subfolders.
    9. This is ensured by sys.exit() method.
       
'''
try:
    for (dir_path, dir_names, file_names) in os.walk(selected_directory):
        folder_name = os.path.basename(dir_path)
        try:
            if ((int(folder_name) >= start) and (int(folder_name) <= end)):
                for name in file_names:
                    path_.append(os.path.join(dir_path, name))
            else:
                pass
        except:
            pass
    # this list will contain the absolute path to .nc files present in the selected subfolders
    selected_files_path = [filename for filename in  path_ if filename.endswith('.nc')]
    # this list will contain the names of .nc files present in the selected subfolders 
    selected_files = [file.split('\\')[-1] for file in selected_files_path]
except:
    print('\nSorry, there are no detected files in the given path')
    sys.exit()


if len(selected_files) == 0:
    print('\nSorry, there are no detected files in the given path')
    sys.exit()

# 3. Starting the Loop

In [None]:
# the total time taken for data extraction will be recorded after the next line of code executes
startTime = datetime.now()


''' Using the for loop, we will access the names and absolute path to selected .nc files '''
for filepath, filename in zip(selected_files_path, selected_files):
        # if there is some problem while reading a specific nc file, like file corruption, this try block will raise an error.
        # And, the except block is executed which will record the names of the files.
        try:
            print("\nSelected filename: ", filename)
            
            # Reading the .nc file 
            fh = Dataset(filepath, mode='r')

            ''' Selecting corresponding variables'''
            ''' For the values that are masked in the original file, I have replaced them with -1
                so they can be easily identified later'''
            
            ''' Inside each nc file, the necessary variables are present inside individual groups: SCIENCE_DATA, GEOLOCATION_DATA
                and ANCILLARY_DATA. Values with mask set to True in each variable is changed to -1, which will be removed later.
            '''
            
            # Accessing the ColumnAmountNO2Trop variable
            no2 = fh.groups['SCIENCE_DATA']['ColumnAmountNO2Trop'][:].filled(-1.)
            no2 = no2.astype('float64')
            
            # Accessing the latitude variable
            latitude = fh.groups['GEOLOCATION_DATA']['Latitude'][:].filled(-1.)
            latitude = latitude.astype('float64')
            
            # Accessing the Longitude variable

            longitude = fh.groups['GEOLOCATION_DATA']['Longitude'][:].filled(-1.)
            longitude = longitude.astype('float64')
            
            # Accessing the cloud_fraction variable
            cloud_fraction = fh.groups['ANCILLARY_DATA']['CloudRadianceFraction'][:].filled(-1.)
            cloud_fraction = cloud_fraction.astype('float64')

            # Accessing the qa_value
            qa_value = fh.groups['SCIENCE_DATA']['qa_value'][:].filled(-1.)
            qa_value = qa_value.astype('float64')
            
            # accessing the lat corner (4 latitude values for a given data)
            lat_corner = fh.groups['GEOLOCATION_DATA']['CornerLatitude'][:].data
            lat_corner = lat_corner.astype('float64')
            
            # Accessing the lon corner (4 longitude values for a given data)
            lon_corner = fh.groups['GEOLOCATION_DATA']['CornerLongitude'][:].data
            lon_corner = lon_corner.astype('float64')

            ''' Closing the file'''
            # It is best practice to close a file after reading
            fh.close()

            ''' converting the 2-d array into 1-d array '''
            no2 = np.array(np.meshgrid(no2)[0], dtype =  np.float64)
            latitude = np.array(np.meshgrid(latitude)[0], dtype =  np.float64)
            longitude = np.array(np.meshgrid(longitude)[0], dtype =  np.float64)
            cloud_fraction = np.array(np.meshgrid(cloud_fraction)[0], dtype =  np.float64)
            qa_value = np.array(np.meshgrid(qa_value)[0], dtype =  np.float64)
            
            # changing the shapes of our arrays
            ''' You can access this link to understand how np,reshape works: 
                w3schools.com/python/numpy/numpy_array_reshape.asp
            '''
            lat_corner = lat_corner.reshape(len(no2), 4)
            lon_corner = lon_corner.reshape(len(no2), 4)
            
            # accessing 8 new variables that contain the 4 pairs of lat-lon for each data
            LAT1 = lat_corner[:, 0]
            LAT2 = lat_corner[:, 1]
            LAT3 = lat_corner[:, 2]
            LAT4 = lat_corner[:, 3]
            LON1 = lon_corner[:, 0]
            LON2 = lon_corner[:, 1]
            LON3 = lon_corner[:, 2]
            LON4 = lon_corner[:, 3]

            ''' creating the dataframe'''
            # Accessing the date and time for the data from the filename
            date = filename[33:37]+'-'+filename[38:40]+'-'+filename[40:42]
            time = filename[43:45]+'-'+filename[45:47]+'-'+filename[47:49]
            df = pd.DataFrame({
                'ColumnAmountNO2Trop':no2,
                'Latitude':latitude,
                'Longitude':longitude,
                'Cloud_Fraction':cloud_fraction,
                'QA_Value':qa_value,
                'Date':[date for i in no2],
                'Time':[time for i in no2],
                'LAT1':LAT1,
                'LAT2':LAT2,
                'LAT3':LAT3,
                'LAT4':LAT4,
                'LON1':LON1,
                'LON2':LON2,
                'LON3':LON3,
                'LON4':LON4
            })
            
            # excluding the masked values from dataframe i.e -1
            df = df[(df.ColumnAmountNO2Trop != -1.) & (df.Latitude != -1.)\
                & (df.Longitude != -1.) & (df.Cloud_Fraction != -1.) & (df.QA_Value != -1.)].reset_index(drop = True)

            # only selecting the data points where latitude and longitude contain the data points for nepal
            # these pair of latitude and longitude also contain data from tibet, bihar and possibly uttarakhand
            # since there are roughly 2 million records in the original dataset, checking if cordinates would lie within bounding cordinates  of nepal would be complex
            # in terms of time
            # so, I chose these cordinates to filter out the data 

            df = df[(df.Latitude >= 0.0) & (df.Latitude <= 40.0) & (df.Longitude >= 60.0) & \
                (df.Longitude <= 100.0)].reset_index(drop = True)

            if len(df) == 0:
                print('\nSorry, there is no data associated to ASIA on the file: '+filename)
            else:
                # Finally, concatenating the local datafame to our global dataframe
                concatenated_dataframe = pd.concat([concatenated_dataframe, df]) 
        except:
            ''' If any file raised an error while reading, this block of code executes. Therefore, the names of files with 
                errors are recorded'''
            print('Filename has errors: ', filename)
            files_with_errors.append(filename)
    if len(concatenated_dataframe) == 0:
        print("There is no data of nepal after concatenation.")
    else:
        
        # converting the string date column to datetime object
        concatenated_dataframe['Date'] = pd.to_datetime(concatenated_dataframe['Date'])
        
        # Accessing year-month-day of the oldest data 
        least_recent_date = str(concatenated_dataframe['Date'].min())[:10]
        
        # Accessing year-month-day of the latest data 
        most_recent_date = str(concatenated_dataframe['Date'].max())[:10]


        # saving the data in parquet format
        complete_name = least_recent_date+'_'+most_recent_date
        concatenated_dataframe.to_parquet(complete_name+'.parquet', index = True)
        
        # if files_with_errors list has length >0, this indicates several files raised error while reading
        # Therefore, the names of files with errors are saved as csv files.
        if len(files_with_errors) > 0:
            pd.DataFrame({'Filename':files_with_errors}).to_csv('files_with_Error.csv', index=True)
        else:
            pass
        
        
        # Finally, the path where the extracted data is saved is displayed to user
        print("The concatenated data is saved as: ", complete_name)
        print("\nAll corresponding files have been saved to following path: ", os. getcwd())
        print("Thank you")
        
        # The total time taken for data extraction is also displayed to user
        print('\nTotal time taken: ', str(datetime.now() - startTime))

# 4. Notes

<b>
    
    1. For a given year, data for Nepal is extracted and saved as parquet format.

    2. The data for Nepal is subset of the extracted data, as only those data points were appended where lat and lon were in            range [25, 32, 80 ,88]. 
   
    3. Therefore, for timeseries plots, the extracted data needs to be cleaned to get only Nepal's data. 