# Data Aquisition, Processing and Collation
The following Jupyter Notebook collects the two datasets necessary for the MAPWAPS project application, namely: flux tower data and satellite data. It proceses the two datasets individually and then combines them to produce a single final dataset ready to be used in machine learning model training, valdation and testing.

## Library and Function Imports

This cell imports several essential libraries and sets up functionalities in the Jupyter Notebook, ensuring that they are readily available for implementation and utilization later in the code

In [1]:
!pip install rasterio
import ee
import rasterio
import numpy as np
import pandas as pd
import geemap
import re
import os
from IPython.display import Image, display

# Authenticate with Earth Engine (requires user interaction)
ee.Authenticate()

# Initialize the Earth Engine Python API
ee.Initialize()

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Collecting rasterio
  Downloading rasterio-1.3.8.post2-cp310-cp310-manylinux2014_x86_64.whl (20.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.6/20.6 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting affine (from rasterio)
  Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Collecting snuggs>=1.4.1 (from rasterio)
  Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)
Installing collected packages: snuggs, affine, rasterio
Successfully installed affine-2.4.0 rasterio-1.3.8.post2 snuggs-1.4.7
To authorize access needed by Earth Engine, open the following URL in a web browser and follow the instructions. If the web browser does not start automatically, please manually browse the URL below.

    https://code.earthengine.google.com/client-auth?scopes=https%3A//www.googleapis.com/auth/earthengine%20https%3A//www.googleapis.com/auth/devstorage.full_control&request_id=-kondMN2ayiZ88cJO_ldgQztCPzjFkr1eW0YieltP2o&tc=fKxmZp0nuiW-lS2vQtRdBDL9ssM3A50OCF4KZs-T2

## Miscellaneous Functions

### save_df_to_drive
Function that saves a Pandas DataFrame as a csv file to a Google Drive folder specified by a Google Drive file path

In [2]:
def save_df_to_drive(df, file_path_in_google_drive):
  """
  Function that saves a Pandas DataFrame as a csv file to a Google Drive folder specified by a google drive file path

    parameter:  dataframe -> the Pandas DataFrame needing to be saved
                file_path_in_drive -> Google Drive folder file path that will store the csv file
    return:     void

  """
  try:
      # Ensure the destination directory exists
      destination_dir = os.path.dirname(file_path_in_google_drive)
      os.makedirs(destination_dir, exist_ok=True)

      # Save the DataFrame to the specified file path in Google Drive
      df.to_csv(file_path_in_google_drive, index=False)

      print(f"DataFrame saved to Google Drive at '{file_path_in_google_drive}'")

  except Exception as e:
      print("An error occurred:", str(e))

## Flux Tower Data

### timestamp_separate
Function that uses string handlng techniques to seperate the TIME_STAMP (YYYYMMDDHHMM) into TIME_STAMP_DATE (YYYYMMDD) and TIME_STAMP_TIME (HHMM) in order to isolate the date variable for later use.

In [3]:
def timestamp_separate(df):

  """
  Function that uses string handling techniques to separate the flux tower variable TIME_STAMP (YYYYMMDDHHMM)
  into TIME_STAMP_DATE (YYYYMMDD) and TIME_STAMP_TIME (HHMM) in order to isolate the date variable for later use

    parameter:  df -> the Pandas DataFrame storing the original, unfiltered flux tower data
    return:     df_time_seperated -> the Pandas DataFrame with manipulated date and time variables

  """

  # Manipulates the TIMESTAMP (both START and END) columns to split into DATE and TIME separately
  # Convert TIMESTAMP columns to string
  df['TIMESTAMP_START'] = df['TIMESTAMP_START'].astype(str)
  df['TIMESTAMP_END'] = df['TIMESTAMP_END'].astype(str)

  # Extract TIMESTAMP_DATE (YYMMDD) and TIMESTAMP_TIME (HHMM) (for both START and END) with string handling principles
  df['TIMESTAMP_START_DATE'] = df['TIMESTAMP_START'].str[:8]
  df['TIMESTAMP_START_TIME'] = df['TIMESTAMP_START'].str[8:]
  df['TIMESTAMP_END_DATE'] = df['TIMESTAMP_END'].str[:8]
  df['TIMESTAMP_END_TIME'] = df['TIMESTAMP_END'].str[8:]

  # @ this stage the count variable should be 48 because the sampling rate is 30 minutes (24hrs x 2) and no entries have been removed
  df['COUNT'] = df.groupby('TIMESTAMP_START_DATE')['TIMESTAMP_START_DATE'].transform('count')

  # Specify TIMESTAMP columns order
  start_columns = [
      'TIMESTAMP_START', 'TIMESTAMP_START_DATE', 'TIMESTAMP_START_TIME',
      'TIMESTAMP_END', 'TIMESTAMP_END_DATE', 'TIMESTAMP_END_TIME'
  ]

  # Define the df order with TIMESTAMP columns first and remaining columns in their original order
  desired_order = start_columns + [col for col in df.columns if col not in start_columns]
  df_time_seperated = df[desired_order]

  return df_time_seperated

### check_LE
Function that check if a dataset has an LE column and if not finds a LE column derivation and renames it LE for simplified use

- Note: LE column derivation refers to a column that stores LE values but under a different column name (e.g. LE_1.1.1)

In [4]:
def check_LE(df):
    """
    Function that check if a dataset has an LE column and if not finds a LE column derivation and renames it LE for simplified future use
      Note: LE column derivation refers to a column that stores LE values but under a different column name (e.g. LE_1.1.1)

    parameter:  df -> the Pandas DataFrame storing the original flux tower data (with LE or LE derived columns)
    return:     df_with_LE -> the Pandas DataFrame with LE column name

    """
    column_name = 'LE'                        # Column of interest is the Latent Heat Flux (LE)
    df_with_LE = df

    if not column_name in df.columns:         # The Pandas DataFrame does not have a column with name 'LE'
      try:
          # Find the first column that starts with 'LE'
          LE_column = next(col for col in df.columns if col.startswith('LE'))

          # Rename the column to 'LE'
          df_with_LE = df.rename(columns={LE_column: 'LE'}, inplace=True)
      except StopIteration:
          # The Pandas DataFrame does not have a column name that starts with 'LE'
          print("No column starts with 'LE'")

    return df_with_LE

### remove_null
Function that removes any data entry/ row that has a null (-9999) Latent Heat Flux (LE) variable
- there is commented out section of code that will remove any entry with a null value (LE variable or otherwise)

In [5]:
def remove_null(df):

  """
  Function that removes any data entry/ row that has a null (-9999) Latent Heat Flux (LE) variable

    parameter:  df -> the Pandas DataFrame with LE null (-9999) values
    return:     df_without_LE_null_values -> the Pandas DataFrame with removed LE null (-9999) values

  """

  # # ----------------------- use if all entries with null values need removal -----------------------
  # # Removes any row with a null (-9999) value
  # columns_to_check = df.columns[2:]
  # void_filter_boolean = df[columns_to_check].apply(lambda x: (x != -9999).all(), axis=1) # creates a boolean mask to identify the presence of -9999 values
  # df_without_null_values = df[void_filter_boolean] # applies mask to df
  # # ------------------------------------------------------------------------------------------------

  # Removes any row with a LE column null (-9999) value
  df = df[df['LE'] != -9999]
  del df['COUNT']
  df['COUNT'] = df.groupby('TIMESTAMP_START_DATE')['TIMESTAMP_START_DATE'].transform('count')     # Counts the number of entries per date
  df_without_LE_null_values = df

  return df_without_LE_null_values

### group_df
Function that will group a dataframe by the date and add up each of its other columns - this is because we want to deal with daily ET estimates.
- It will also filter the dataframe to obtain only the neccessary variables: Date, LE, COUNT
- If the COUNT variable is not 2304 (48 x 48) then that row is removed because it signals an 'incomplete' dataset in that due to null value removals, there is not a full days worth of collected data.

In [6]:
def group_df(df):

  """
  Function that groups a Pandas DataFrame by the date variable and adds up the other columns

    parameter:  df -> the Pandas DataFrame with half hourly data readings (i.e. 48 readings per day - unless null values have been removed)
    return:     df_grouped -> the Pandas DataFrame with grouped entries/ rows and daily LE values
                              (only entries with a full day of recordings will be included - i.e. the daily count of 2304 = 48 x 48)

  """

  df_simplified = df[['TIMESTAMP_START_DATE','LE','COUNT']]         # Creates a new filtered Pandad DataFrame with only the important columns
  # df_simplified.head()           # Uncomment to ensure proper dataframe simplification

  # Group by 'TIMESTAMP_START_DATE' and sum the 'LE' and 'COUNT' columns
  df_grouped = df_simplified.groupby('TIMESTAMP_START_DATE').agg({'LE': 'sum', 'COUNT': 'sum'}).reset_index()
  df_grouped.rename(columns={'TIMESTAMP_START_DATE': 'DATE', 'LE': 'DAILY LE', 'COUNT': 'DAILY COUNT'}, inplace=True) # Rename the columns

  # Drop all 'incomplete' entries (those that do not have a days worth of data)
  df_grouped = df_grouped[df_grouped['DAILY COUNT'] == 2304]

  return df_grouped

## Landsat Satellite Data

### Image Bands and Plant Indices

#### get_landsat_bands
Function that extracts the multispectral bands from a Landsat 8 satellite image of a co-ordinate specified location

In [7]:
def get_landsat_bands(image_id, latitude, longitude):

  """
  Function that extracts the multispectral bands from a Landsat 8 satellite image of a co-ordinate specified location

    parameter:  image_id -> unique Landsat 8 image ID
                latitude -> latitude co-ordinate of desired location
                longitude -> longitude co-ordinate of desired location
    return:     band_dict -> dictionary of multispectral band values for a specified location with labels B1 - B11

  """
  try:
      # Load the Landsat image by its ID
      landsat_image = ee.Image(image_id)

      # Define a point geometry for the specified latitude and longitude
      point_geometry = ee.Geometry.Point([longitude, latitude])

      # Use the .sample() method to extract pixel values at the specified geometry
      # This will create a feature collection containing the pixel values
      pixel_values = landsat_image.sample(point_geometry, 30)  # 30 meters scale for Landsat

      # Initialize an empty dictionary to store the band values
      band_dict = {}

      # Extract band values from the feature collection and add them to the dictionary
      for band_name in landsat_image.bandNames().getInfo():
          band_value = pixel_values.first().get(band_name).getInfo()
          band_dict[band_name] = band_value

      return band_dict

  except ee.EEException as e:
      return {"error": "An Earth Engine exception occurred: " + str(e)}
  except Exception as e:
      return {"error": "An unexpected error occurred: " + str(e)}

#### get_cloud_cover
Function that extracts the cloud cover percentage from a satellite image specified by a Landsat 8 image ID


In [8]:
def get_cloud_cover(image_id):
  """
  Function that extracts the cloud cover percentage from a satellite image specified by a Landsat 8 image ID

    parameter:  image_id -> Landsat 8 image ID
    return:     cloud_cover_rounded -> percentage cloud cover rounded to 3 decimal points

  """

  # Load the Landsat image using the provided image_id
  image = ee.Image(image_id)

  # Get the cloud cover property
  cloud_cover = image.get('CLOUD_COVER').getInfo()

  # Round the cloud cover percentage to 3 decimal places
  cloud_cover_rounded = round(cloud_cover, 3)

  return cloud_cover_rounded

#### calculate_ndvi
Function that calculates the Normalised Diffference Vegetation Index (NDVI)

$ NDVI = \frac{NIR - Red}{NIR + Red} \in (-1, 1)$


In [9]:
def calculate_ndvi(band_dict):

  """
  Function that calculates the Normalised Diffference Vegetation Index (NDVI) from the multispectral bands of a satellite image

    parameter:  band_dict -> dictionary of multispectral band values
    return:     ndvi -> Normalised Diffference Vegetation Index (NDVI) value

  """

  try:
      # Extract the values for the NIR (Near Infrared) and Red bands from the dictionary
      nir = band_dict['B5']  # Assuming 'B5' is the NIR band
      red = band_dict['B4']  # Assuming 'B4' is the Red band

      # Calculate NDVI
      ndvi = (nir - red) / (nir + red)

      return ndvi

  except KeyError:
      return {"error": "Required bands (NIR and Red) not found in the band dictionary."}
  except ZeroDivisionError:
      return {"error": "Division by zero error. Check if NIR + Red is zero."}
  except Exception as e:
      return {"error": "An unexpected error occurred: " + str(e)}

#### calculate_vari
Function that calculates the Visible Atmospherically Resistant Index (VARI)

$VARI = \frac{Green - Red}{Green + Red - Blue}$

In [10]:
def calculate_vari(band_dict):
  """
  Function that calculates the Visible Atmospherically Resistant Index (VARI) from the multispectral bands of a satellite image

    parameter:  band_dict -> dictionary of multispectral band values
    return:     vari -> Visible Atmospherically Resistant Index (VARI) value

  """

  try:
      # Extract the values for the Blue, Red, and Green bands from the dictionary
      blue = band_dict['B2']  # Assuming 'B2' is the Blue band
      red = band_dict['B4']   # Assuming 'B4' is the Red band
      green = band_dict['B3'] # Assuming 'B3' is the Green band

      # Calculate VARI
      vari = (green - red) / (green + red - blue)

      return vari

  except KeyError:
      return {"error": "Required bands (Blue, Red, and Green) not found in the band dictionary."}
  except ZeroDivisionError:
      return {"error": "Division by zero error. Check if Green + Red - Blue is zero."}
  except Exception as e:
      return {"error": "An unexpected error occurred: " + str(e)}

#### calculate_savi
Function that calculates the Soil Adjusted Vegetation Index (SAVI)

$ SAVI = \frac{(1 + L) \cdot (NIR - Red)}{NIR + Red + L} \in (-1.0, 1.0)$

In [11]:
def calculate_savi(band_dict):
  """
  Function that calculates the Soil Adjusted Vegetation Index (SAVI) from the multispectral bands of a satellite image

    parameter:  band_dict -> dictionary of multispectral band values
    return:     savi -> Soil Adjusted Vegetation Index (SAVI) value

  """

  try:
      # Extract the values for the NIR (Near Infrared) and Red bands from the dictionary
      nir = band_dict['B5']  # Assuming 'B5' is the NIR band
      red = band_dict['B4']  # Assuming 'B4' is the Red band

      # Set the soil adjustment factor (L)
      L = 0.5

      # Calculate SAVI
      savi = ((1 + L) * (nir - red)) / (nir + red + L)

      return savi

  except KeyError:
      return {"error": "Required bands (NIR and Red) not found in the band dictionary."}
  except ZeroDivisionError:
      return {"error": "Division by zero error. Check if NIR + Red + L is zero."}
  except Exception as e:
      return {"error": "An unexpected error occurred: " + str(e)}

#### calculate_ndwi

Function that calculates the Normalised Diffference Water Index (NDWI)

$ NDWI = \frac{Green - NIR}{Green + NIR} \in (-1, 1)$

In [12]:
def calculate_ndwi(band_dict):
  """
  Function that calculates the Normalised Diffference Water Index (NDWI) from the multispectral bands of a satellite image

    parameter:  band_dict -> dictionary of multispectral band values
    return:     ndwi -> Normalised Diffference Water Index (NDWI) value

  """

  try:
      # Extract the values for the Green and NIR bands from the dictionary
      green = band_dict['B3']  # Assuming 'B3' is the Green band
      nir = band_dict['B5']    # Assuming 'B5' is the NIR band

      # Calculate NDWI
      ndwi = (green - nir) / (green + nir)

      return ndwi

  except KeyError:
      return {"error": "Required bands (Green and NIR) not found in the band dictionary."}
  except ZeroDivisionError:
      return {"error": "Division by zero error. Check if Green + NIR is zero."}
  except Exception as e:
      return {"error": "An unexpected error occurred: " + str(e)}

#### calculate_evi

Function that calculates the Enhanced Vegetation Index (EVI)

$ EVI =  \frac{2.5 \cdot (NIR - Red)}{NIR + 6 \cdot Red - 7.5 \cdot Blue + 1} $

In [13]:
def calculate_evi(band_dict):
  """
  Function that calculates the Enhanced Vegetation Index (EVI) from the multispectral bands of a satellite image

    parameter:  band_dict -> dictionary of multispectral band values
    return:     evi -> Enhanced Vegetation Index (EVI) value

  """

  try:
      # Extract the values for the Blue, Red, and NIR bands from the dictionary
      blue = band_dict['B2']  # Assuming 'B2' is the Blue band
      red = band_dict['B4']   # Assuming 'B4' is the Red band
      nir = band_dict['B5']   # Assuming 'B5' is the NIR band

      # Parameters for EVI calculation
      G = 2.5
      C1 = 6.0
      C2 = 7.5
      L = 1.0  # Can be adjusted for different regions

      # Calculate EVI
      evi = G * (nir - red) / (nir + (C1 * red) - (C2 * blue) + L)

      return evi

  except KeyError:
      return {"error": "Required bands (Blue, Red, and NIR) not found in the band dictionary."}
  except ZeroDivisionError:
      return {"error": "Division by zero error. Check if NIR + C2*Red + L is zero."}
  except Exception as e:
      return {"error": "An unexpected error occurred: " + str(e)}

### Image Collection

#### get_collection_landsat_image_ids
Function that extracts a list of Landsat 8 image IDs of a co-ordinate specified location within a date range

In [14]:
def get_collection_landsat_image_ids(start_date, end_date, latitude, longitude):
  """
  Function that extracts a list of Landsat 8 image IDs of a co-ordinate specified location within a date range

    parameter:  start_date -> start date of the date range
                end_date -> end date of the date range
                latitude -> latitude co-ordinate of desired location
                longitude -> longitude co-ordinate of desired location
    return:     image_id_list -> list of Landsat 8 image IDs

  """

  # Create a point of interest (POI) as a geometry
  poi = ee.Geometry.Point(longitude, latitude)

  # Create an image collection for Landsat imagery
  landsat_collection = ee.ImageCollection('LANDSAT/LC08/C01/T1_TOA') \
        .filterBounds(poi) \
        .filterDate(ee.Date.parse('YYYYMMdd', start_date), ee.Date.parse('YYYYMMdd', end_date))

  # Get a list of image IDs in the collection
  image_ids = landsat_collection.aggregate_array('system:id')

  # Get the image IDs as a Python list
  image_id_list = image_ids.getInfo()

  return image_id_list

#### create_df_from_image_ids
 Function that creates a Pandas Dataframe from the Landsat 8 image collection: Image ID, Date, Co-ordinates, Band Values, Cloud Cover Percentage and Plant Indices.

In [15]:
def create_df_from_image_ids(image_ids, latitude, longitude):
  """
  Function that creates a Pandas Dataframe from the Landsat 8 image collection: Image ID, Date, Co-ordinates, Band Values, Cloud Cover Percentage and Plant Indices.

    parameter:  image_ids -> list of Landsat 8 image IDs
                latitude -> latitude co-ordinate of desired location
                longitude -> longitude co-ordinate of desired location
    return:     df -> Pandas DataFrame of satellite data information for each Landsat 8 image ID

  """


  data = []

  for image_id in image_ids:
      # Extract date from the image ID and remove hyphens
      date_str = image_id.split('_')[3]
      date = ''.join(date_str.split('-'))

      # Get Landsat band values for the current image
      band_values = get_landsat_bands(image_id, latitude, longitude)

      cloud = get_cloud_cover(image_id)

      # Calculate Plant Indices (NDVI, EVI, SAVI, NDWI, EVI) for the current image
      ndvi = calculate_ndvi(band_values)
      evi = calculate_evi(band_values)
      savi = calculate_savi(band_values)
      ndwi = calculate_ndwi(band_values)
      vari = calculate_vari(band_values)

      # Append the data as a dictionary to the list, including band values
      data.append({'Landsat Image ID': image_id, 'Date': date, 'Latitude': latitude, 'Longitude': longitude, **band_values, 'Cloud Cover': cloud, 'NDVI': ndvi, 'EVI': evi, 'SAVI': savi, 'VARI': vari, 'NDWI': ndwi})

  # Create the Pandas DataFrame using pandas.concat
  df = pd.concat([pd.DataFrame([d]) for d in data], ignore_index=True)

  return df

## Ameriflux Application

Note: the code's commenting often refers to two attempts (where one will always be commented out and the other implemented). Attempt 1 refers to the spatially constricted 66 datasets and attempt 2 refers to the all inclusive 375 datasets. It is important that before running, each of the following cells are implementing the same attempt (i.e. all commented out or included sections agree on which attmept is being implemented).

### Ameriflux Site Overview Data

In [16]:
# Ameriflux allows you to download a CSV file summarising the flux tower sites whose data you have chosen to use and their characteristics.
#     -   the importance of this is so link the flux tower site (Site ID) with its co-ordinate point

# Specify the Google Drive file path of the AmeriFlux Site Overview Dataset
# ------------------------------- ATTEMPT 1 -------------------------------
site_overview_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/AmeriFlux Site Description.csv'
# -------------------------------------------------------------------------

# ------------------------------- ATTEMPT 2 -------------------------------
# site_overview_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/AmeriFlux Site Description Extended.csv'
# -------------------------------------------------------------------------

# Load the AmeriFlux Site Overview Dataset into a Pandas DataFrame
df_site_overview = pd.read_csv(site_overview_file_path, delimiter=';')

# df_site_overview.head()                               # Uncomment to ensure proper csv file upload

# Filter the AmeriFlux Site Overview Dataset
df_site_overview['Years of AmeriFlux BASE Data'] = df_site_overview['Years of AmeriFlux BASE Data'].astype(str)                               # Converts list to string
df_site_overview['Start Year'] = df_site_overview['Years of AmeriFlux BASE Data'].apply(lambda x: min(map(int, x.strip('()').split(','))))    # Extracts the first year from the list
df_site_overview['End Year'] = df_site_overview['Years of AmeriFlux BASE Data'].apply(lambda x: max(map(int, x.strip('()').split(','))))      # Extracts the last year from the list

# df_site_overview.head()                               # Uncomment to ensure proper start and end year extraction

important_columns = ['Site ID', 'Latitude (degrees)', 'Longitude (degrees)', 'Start Year', 'End Year']    # Specifies important columns (the rest will be discarded for simplicity)
df_site_overview_filtered = df_site_overview[important_columns]                                           # Creates a new filtered Pandad DataFrame with only the important columns

# Saves the filtered AmeriFlux Site Overview Dataset the a specified Google Drive file path
# ------------------------------- ATTEMPT 1 -------------------------------
save_df_to_drive(df_site_overview_filtered, '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/AmeriFlux Site Description Filtered.csv')
# -------------------------------------------------------------------------

# ------------------------------- ATTEMPT 2 -------------------------------
# save_df_to_drive(df_site_overview_filtered, '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/AmeriFlux Site Description Extended Filtered.csv')
# -------------------------------------------------------------------------

# df_site_overview_filtered.head()                      # Uncomment to view simplified/ filtered database containing the needed variables
# print(df_site_overview_filtered.info())               # Uncomment to view the Pandas DataFrame's characteristics (# columns, # rows, variables, variable types)

DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/AmeriFlux Site Description Filtered.csv'


### Ameriflux Individual Datasets

#### Individual Dataset Functions

##### get_date_range_and_coordinates
Function that obtains the co-ordinates and operational date range of a flux tower from the AmeriFlux Site Overview dataset to be used as input parameters in extracting the landsat satellite image dataset

In [17]:
def get_date_range_and_coordinates(site_overview_df, file_name):

  """
  Function that obtains the co-ordinates and operational date range of a flux tower from the AmeriFlux Site Overview dataset
    to be used as input parameters in extracting the landsat satellite image dataset

    parameter:  site_overview_df -> overview Pandas DataFrame describing the Ameriflux flux towers and their characteristics
                file_name -> file name of an individual Ameriflux flux tower dataset
    return:     data_array -> array of an individual Ameriflux flux tower's characteristics (co-ordinates and date range)

  """
  # Extract the Site ID from the flux tower dataset
  pattern = r'AMF_(.*?)_BASE'                 # Define a regex pattern to match the desired substring - in this case the Site ID of the flux tower
  match = re.search(pattern, file_name)       # Use re.search to find the match (i.e. the row in the overview table that refers to the name of the individual flux tower dataset)

  site_id = match.group(1)
  row_number = site_overview_df[site_overview_df['Site ID'] == site_id].index[0]      # Obtain the row number of the matched Site ID in the AmerFlux Site Overview dataset
  longitude = site_overview_df.loc[row_number, 'Longitude (degrees)']                 # Extract the longitude from the specified row
  latitude = site_overview_df.loc[row_number, 'Latitude (degrees)']                   # Extract the latitude from the specified row
  start_year = site_overview_df.loc[row_number, 'Start Year']                         # Extract the start year from the specified row
  end_year = site_overview_df.loc[row_number, 'End Year']                             # Extract the end year from the specified row

  data_array = [latitude, longitude, start_year, end_year]              # Append extracted parameters to a list

  return data_array

##### data_aquisition
Function that creates and stores a filtered flux tower dataset and a generated Landsat 8 satellite dataset for each Ameriflux flux tower and saves them to specified Google Drive file paths

In [18]:
def data_aquisition(file_name):

  """
  Function that creates and stores a filtered flux tower dataset and a generated Landsat 8 satellite dataset for each Ameriflux flux tower and saves them to Google Drive folders

    parameter:  file_name -> file name of an individual Ameriflux flux tower dataset
    return:     void

  """

  csv_path = folder_path+"/"+file_name                            # Specify the Google Drive file path of the individual AmeriFlux flux tower Dataset using input paramter file_name
  df = pd.read_csv(csv_path, delimiter=',', skiprows=2)           # Load the individual AmeriFlux flux tower Dataset into a Pandas DataFrame

  # Filter the individual AmeriFlux flux tower dataset using pre-defined functions
  df_time_seperated = timestamp_separate(df)                      # separate the timestamp variables of the Pandas DataFrame
  df_with_LE = check_LE(df_time_seperated)                        # ensure the Pandas DataFrame has a LE column
  df_without_LE_null_values = remove_null(df_with_LE)             # remove all null LE values from the Pandas DataFrame
  df_grouped = group_df(df_without_LE_null_values)                # group the Pandas DataFrame by date to obtain daily entries


  # Specify the Google Drive file path where the filtered individual AmeriFlux flux tower Dataset must be saved
  # ------------------------------- ATTEMPT 1 -------------------------------
  flux_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered/'+file_name+' - Flux Data.csv'
  # -------------------------------------------------------------------------

  # ------------------------------- ATTEMPT 2 -------------------------------
  # flux_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered Extended/'+file_name+' - Flux Data Extended.csv'
  # -------------------------------------------------------------------------

  # Save the filtered individual AmeriFlux flux tower Dataset to the specified Google Drive file path
  save_df_to_drive(df_grouped, flux_file_path)

  # Generate the individual Landsat 8 satellite datasets using pre-defined functions
  parameter_array = get_date_range_and_coordinates(df_site_overview_filtered, file_name)  # obtain the operation date range and co-ordinate specified location of the flux tower
  latitude = parameter_array[0]                 # Extract the latitude
  longitude = parameter_array[1]                # Extract the longitude
  start_year = str(parameter_array[2])          # Extract the start year
  end_year = str(parameter_array[3])            # Extract the end year

  # Set the start date to the 1st of January [start year] and the end date to the 31st of December [end year]
  start_date = start_year+'0101'
  end_date = end_year+'1231'

  # Check that the satellite can be extracted due to the operational limit of Landsat 8 satellite being post 2013
  if int(end_year) < 2013:
    print('Dataset cannot be extracted - outside of Landsat 8 range')
  else:
    landsat_ids = get_collection_landsat_image_ids(start_date, end_date, latitude, longitude)             # Create the Landsat image ID list

    # This ensure code continuation in the case that the landsat_ids list is empty
    try:
      df_landsat = create_df_from_image_ids(landsat_ids, latitude, longitude)                             # Generate the individual Landsat 8 satellite dataset

      # Specify the Google Drive file path where the filtered individual Landsat 8 satellite Dataset must be saved
      # ------------------------------- ATTEMPT 1 -------------------------------
      landsat_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Satellite Data/First/'+file_name+' - Landsat Data.csv'
      # -------------------------------------------------------------------------

      # ------------------------------- ATTEMPT 2 -------------------------------
      # landsat_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Satellite Data/Extended/'+file_name+' - Landsat Data Extended.csv'
      # -------------------------------------------------------------------------

      # Save the individual Landsat 8 satellite Dataset to the specified Google Drive file path
      save_df_to_drive(df_landsat, landsat_file_path)

    except ValueError as ve:
      print(f"ValueError: {ve}") # Handle the ValueError (No objects to concatenate)

##### merge_df
Function that merges the individual filtered AmeriFlux flux tower dataset and the Landsat 8 satellite dataset for each Ameriflux flux tower and saves them to Google Drive file paths

In [19]:
def merge_df(file_name):

  """
  Function that merges the individual filtered AmeriFlux flux tower dataset and the Landsat 8 satellite dataset for each Ameriflux flux tower and saves them to Google Drive file paths

    parameter:  file_name -> file name of an individual Ameriflux flux tower dataset
    return:     void

  """

  # Specify the Google Drive file paths where the filtered individual AmeriFlux flux tower Dataset and Landsat 8 satellite Dataset that corresponds to the input parameter file_name are stored
  # ------------------------------- ATTEMPT 1 -------------------------------
  file_path_1 = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered/'+file_name+' - Flux Data.csv'
  file_path_2 = '/content/drive/My Drive/Colab Notebooks/Data/Satellite Data/First/'+file_name+' - Landsat Data.csv'
  # -------------------------------------------------------------------------

  # ------------------------------- ATTEMPT 2 -------------------------------
  # file_path_1 = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered Extended/'+file_name+' - Flux Data Extended.csv'
  # file_path_2 = '/content/drive/My Drive/Colab Notebooks/Data/Satellite Data/Extended/'+file_name+' - Landsat Data Extended.csv'
  # -------------------------------------------------------------------------

  # Load the CSV files into pandas DataFrames
  df1 = pd.read_csv(file_path_1)                                # Pandas DataFrame that stores the individual AmeriFlux flux tower Dataset
  df2 = pd.read_csv(file_path_2, error_bad_lines=False)         # Pandas DataFrame that stores the individual Landsat 8 satellite Dataset

  file_name = file_name.split(' - ')[0]

  # Merge the Pandas DataFrames based on the common column date
  merged_df_outer = pd.merge(df1, df2, on='Date', how='outer')
  merged_df_inner = pd.merge(df1, df2, on='Date', how='inner')

  # 'how' parameter specifies the type of merge:
  # - 'inner': Keeps only rows with matching 'ID' in both DataFrames (default).
  # - 'left': Keeps all rows from df1 and matching rows from df2.
  # - 'right': Keeps all rows from df2 and matching rows from df1.
  # - 'outer': Keeps all rows from both DataFrames.


  # Specify the Google Drive file paths where the outer merged datasets should be saved
  # ------------------------------- ATTEMPT 1 -------------------------------
  merged_outer_data_path = '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/'+file_name+' - Outer Merged.csv'
  # -------------------------------------------------------------------------

  # ------------------------------- ATTEMPT 2 -------------------------------
  # merged_outer_data_path = '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged Extended/'+file_name+' - Outer Merged Extended.csv'
  # -------------------------------------------------------------------------

  # Save the outer merged Dataset to the specified Google Drive file path
  save_df_to_drive(merged_df_outer, merged_outer_data_path)


  # Specify the Google Drive file paths where the inner merged datasets should be saved
  # ------------------------------- ATTEMPT 1 -------------------------------
  merged_inner_data_path = '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/'+file_name+' - Inner Merged.csv'
  # -------------------------------------------------------------------------

  # ------------------------------- ATTEMPT 2 -------------------------------
  # merged_inner_data_path = '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged Extended/'+file_name+' - Inner Merged Extended.csv'
  # -------------------------------------------------------------------------

  # Save the inner merged Dataset to the specified Google Drive file path
  save_df_to_drive(merged_df_inner, merged_inner_data_path)

#### Individual Dataset Application

##### Extracting Original Flux Tower Datasets

In [20]:
# Specify the Google Drive file path of the individual AmeriFlux Flux Tower Datasets
# ------------------------------- ATTEMPT 1 -------------------------------
folder_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Original'
# -------------------------------------------------------------------------

# ------------------------------- ATTEMPT 2 -------------------------------
# folder_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Original Extended'
# -------------------------------------------------------------------------

# Change the current working directory to the specified folder
os.chdir(folder_path)

# Create an array to store file names
Ameriflux_datasets = []

# List files in the current directory and sort them alphabetically
file_names = sorted(os.listdir())

# Append the sorted file names to the Ameriflux_datasets array
for file_name in file_names:
    if os.path.isfile(file_name):
        Ameriflux_datasets.append(file_name)

# SANITY CHECK: correct number of file names/ datasets
print(len(Ameriflux_datasets))


66


##### Acquire the Filtered Flux Tower and Satellite Datasets

In [21]:
# Loop through the file names in the array and apply the 'data_aquisition' function
for file_name in Ameriflux_datasets:
  try:
    print("-------------------------------- "+file_name+" -----------------------------------")

    # To speed up the process of each time an error broke my code I employed this if statement to see if the file had already been aquiared

    # ------------------------------- ATTEMPT 1 -------------------------------
    flux_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered/'+file_name+' - Flux Data.csv'
    # -------------------------------------------------------------------------

    # ------------------------------- ATTEMPT 2 -------------------------------
    # flux_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered Extended/'+file_name+' - Flux Data Extended.csv'
    # -------------------------------------------------------------------------

    if os.path.exists(flux_file_path):
      print('Dataset has been extracted')
    else:
      data_aquisition(file_name)

  except KeyError as e:
    # Handle the KeyError
    print(f"Error: {e} - LE column does not exist")

-------------------------------- AMF_US-ASH_BASE_HH_1-5.csv -----------------------------------
Dataset has been extracted
-------------------------------- AMF_US-ASL_BASE_HH_1-5.csv -----------------------------------
Dataset has been extracted
-------------------------------- AMF_US-ASM_BASE_HH_1-5.csv -----------------------------------
Dataset has been extracted
-------------------------------- AMF_US-Bi1_BASE_HH_9-5.csv -----------------------------------
Dataset has been extracted
-------------------------------- AMF_US-Bi2_BASE_HH_14-5.csv -----------------------------------
Dataset has been extracted
-------------------------------- AMF_US-Blo_BASE_HH_4-5.csv -----------------------------------
Dataset has been extracted
-------------------------------- AMF_US-CGG_BASE_HH_1-5.csv -----------------------------------
Dataset has been extracted
-------------------------------- AMF_US-CMW_BASE_HH_2-5.csv -----------------------------------
Dataset has been extracted
---------------

##### Merge the Filtered Flux Tower and Satellite Datasets Together

In [22]:
# Loop through the file names in the array and apply the 'merge_df' function
for file_name in Ameriflux_datasets:
  print("-------------------------------- "+file_name+" -----------------------------------")

  # ------------------------------- ATTEMPT 1 -------------------------------
  flux_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered/'+file_name+' - Flux Data.csv'
  landsat_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Satellite Data/First/'+file_name+' - Landsat Data.csv'
  # -------------------------------------------------------------------------

  # ------------------------------- ATTEMPT 2 -------------------------------
  # flux_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Flux Tower Data/Ameriflux/Filtered Extended/'+file_name+' - Flux Data Extended.csv'
  # landsat_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Satellite Data/Extended/'+file_name+' - Landsat Data Extended.csv'
  # -------------------------------------------------------------------------

  if os.path.exists(flux_file_path) and os.path.exists(landsat_file_path):
    merge_df(file_name)
  else:
    print('No Satellite Date (Likely it preceeds 2013)')

-------------------------------- AMF_US-ASH_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-ASH_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-ASH_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-ASL_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-ASL_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-ASL_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-ASM_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-ASM_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-ASM_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Bi1_BASE_HH_9-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Bi1_BASE_HH_9-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Bi1_BASE_HH_9-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Bi2_BASE_HH_14-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Bi2_BASE_HH_14-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Bi2_BASE_HH_14-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Blo_BASE_HH_4-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-CGG_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-CGG_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-CGG_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-CMW_BASE_HH_2-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-CMW_BASE_HH_2-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-CMW_BASE_HH_2-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-DPW_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-DPW_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-DPW_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-DS3_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-DS3_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-DS3_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Dia_BASE_HH_1-1.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-Dmg_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Dmg_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Dmg_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-EDN_BASE_HH_2-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-EDN_BASE_HH_2-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-EDN_BASE_HH_2-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Elm_BASE_HH_4-1.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Elm_BASE_HH_4-1.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Elm_BASE_HH_4-1.csv - Inner Merged.csv'
-------------------------------- AMF_US-Esm_BASE_HH_5-1.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Esm_BASE_HH_5-1.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Esm_BASE_HH_5-1.csv - Inner Merged.csv'
-------------------------------- AMF_US-Fmf_BASE_HH_6-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-Fuf_BASE_HH_6-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-Fwf_BASE_HH_8-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-Hsm_BASE_HH_2-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Hsm_BASE_HH_2-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Hsm_BASE_HH_2-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-KS1_BASE_HH_3-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-KS2_BASE_HH_3-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-KS3_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-KS3_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-KS3_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-LS1_BASE_HH_1-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-LS2_BASE_HH_1-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-Lin_BASE_HH_2-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-MtB_BASE_HH_4-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-MtB_BASE_HH_4-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-MtB_BASE_HH_4-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Myb_BASE_HH_13-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Myb_BASE_HH_13-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Myb_BASE_HH_13-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-ONA_BASE_HH_3-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-ONA_BASE_HH_3-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-ONA_BASE_HH_3-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-PAS_BASE_HH_1-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-PSH_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-PSH_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-PSH_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-PSL_BASE_HH_1-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-RGB_BASE_HH_3-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-RGB_BASE_HH_3-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-RGB_BASE_HH_3-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-RGo_BASE_HH_2-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-RGo_BASE_HH_2-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-RGo_BASE_HH_2-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-SP1_BASE_HH_4-1.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-SP2_BASE_HH_3-1.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-SP3_BASE_HH_3-1.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-SP4_BASE_HH_3-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-SRC_BASE_HH_6-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-SRC_BASE_HH_6-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-SRC_BASE_HH_6-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-SRG_BASE_HH_15-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-SRG_BASE_HH_15-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-SRG_BASE_HH_15-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-SRM_BASE_HH_26-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-SRM_BASE_HH_26-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-SRM_BASE_HH_26-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-SRS_BASE_HH_3-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-SRS_BASE_HH_3-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-SRS_BASE_HH_3-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Snd_BASE_HH_2-1.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Snd_BASE_HH_2-1.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Snd_BASE_HH_2-1.csv - Inner Merged.csv'
-------------------------------- AMF_US-Sne_BASE_HH_7-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Sne_BASE_HH_7-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Sne_BASE_HH_7-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Snf_BASE_HH_3-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Snf_BASE_HH_3-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Snf_BASE_HH_3-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Srr_BASE_HH_1-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Srr_BASE_HH_1-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Srr_BASE_HH_1-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-SuM_BASE_HH_2-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-SuM_BASE_HH_2-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-SuM_BASE_HH_2-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-SuS_BASE_HH_2-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-SuS_BASE_HH_2-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-SuS_BASE_HH_2-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-SuW_BASE_HH_2-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-Ton_BASE_HH_17-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Ton_BASE_HH_17-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Ton_BASE_HH_17-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Tw1_BASE_HH_9-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Tw1_BASE_HH_9-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Tw1_BASE_HH_9-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Tw2_BASE_HH_2-5.csv -----------------------------------
No Satellite Date (Likely it preceeds 2013)
-------------------------------- AMF_US-Tw3_BASE_HH_5-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Tw3_BASE_HH_5-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Tw3_BASE_HH_5-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Tw4_BASE_HH_12-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Tw4_BASE_HH_12-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Tw4_BASE_HH_12-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Tw5_BASE_HH_3-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Tw5_BASE_HH_3-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Tw5_BASE_HH_3-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Twt_BASE_HH_7-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Twt_BASE_HH_7-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Twt_BASE_HH_7-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Var_BASE_HH_18-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Var_BASE_HH_18-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Var_BASE_HH_18-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Whs_BASE_HH_21-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Whs_BASE_HH_21-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Whs_BASE_HH_21-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-Wkg_BASE_HH_21-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-Wkg_BASE_HH_21-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-Wkg_BASE_HH_21-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xCL_BASE_HH_6-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xCL_BASE_HH_6-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xCL_BASE_HH_6-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xDS_BASE_HH_6-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xDS_BASE_HH_6-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xDS_BASE_HH_6-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xPU_BASE_HH_5-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xPU_BASE_HH_5-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xPU_BASE_HH_5-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xSB_BASE_HH_6-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xSB_BASE_HH_6-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xSB_BASE_HH_6-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xSJ_BASE_HH_6-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xSJ_BASE_HH_6-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xSJ_BASE_HH_6-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xSP_BASE_HH_7-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xSP_BASE_HH_7-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xSP_BASE_HH_7-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xSR_BASE_HH_7-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xSR_BASE_HH_7-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xSR_BASE_HH_7-5.csv - Inner Merged.csv'
-------------------------------- AMF_US-xTE_BASE_HH_7-5.csv -----------------------------------




  df2 = pd.read_csv(file_path_2, error_bad_lines=False)


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Outer Merged/AMF_US-xTE_BASE_HH_7-5.csv - Outer Merged.csv'
DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged/AMF_US-xTE_BASE_HH_7-5.csv - Inner Merged.csv'


##### Combine All Individual Datasets into One

In [23]:
# Define the path to the Google Drive folder where your CSV files are located

# ------------------------------- ATTEMPT 1 -------------------------------
folder_path = '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged'
# -------------------------------------------------------------------------

# ------------------------------- ATTEMPT 2 -------------------------------
# folder_path = '/content/drive/My Drive/Colab Notebooks/Data/Merged Data/Inner Merged Extended'
# -------------------------------------------------------------------------

# Initialize an empty list to store DataFrames
dfs = []

# Loop through the files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file is a CSV file
    if file_name.endswith('.csv'):
        # Read the CSV file into a DataFrame and append it to the list
        df = pd.read_csv(os.path.join(folder_path, file_name))
        dfs.append(df)

# Concatenate all DataFrames in the list into one DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# combined_df_file_path = folder_path = '/content/drive/My Drive/Colab Notebooks/Data/Combined Dataset with error.csv' # Including error messages and missing band values
# save_df_to_drive(combined_df, combined_df_file_path)

# Specify the column where NaN values should not be considered
column_to_exclude = 'error'  # Replace with the name of the specific column
df_combined_and_cleaned = combined_df.dropna(subset=[col for col in df.columns if col != column_to_exclude]) # Remove rows with NaN values, except in the specified column

# ------------------------------- ATTEMPT 1 -------------------------------
combined_and_cleaned_df_file_path = folder_path = '/content/drive/My Drive/Colab Notebooks/Data/Combined Dataset.csv'
# -------------------------------------------------------------------------

# ------------------------------- ATTEMPT 2 -------------------------------
# combined_and_cleaned_df_file_path = folder_path = '/content/drive/My Drive/Colab Notebooks/Data/Combined Dataset Extended.csv'
# -------------------------------------------------------------------------

save_df_to_drive(df_combined_and_cleaned, combined_and_cleaned_df_file_path)

DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Combined Dataset.csv'


In [24]:
# SANITY CHECK: check number of columns and rows, variables and their data types
print(df_combined_and_cleaned.info())
df_combined_and_cleaned.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1442 entries, 0 to 1506
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              1442 non-null   object 
 1   Daily LE          1442 non-null   float64
 2   Daily Count       1442 non-null   object 
 3   Landsat Image ID  1442 non-null   object 
 4   Latitude          1442 non-null   float64
 5   Longitude         1442 non-null   float64
 6   B1                1442 non-null   float64
 7   B2                1442 non-null   float64
 8   B3                1442 non-null   float64
 9   B4                1442 non-null   float64
 10  B5                1442 non-null   float64
 11  B6                1442 non-null   float64
 12  B7                1442 non-null   float64
 13  B8                1442 non-null   float64
 14  B9                1442 non-null   float64
 15  B10               1442 non-null   float64
 16  B11               1442 non-null   float64


Unnamed: 0,Date,Daily LE,Daily Count,Landsat Image ID,Latitude,Longitude,B1,B2,B3,B4,...,B10,B11,BQA,Cloud Cover,NDVI,EVI,SAVI,VARI,NDWI,error
0,20161019,3743.284896,2304,LANDSAT/LC08/C01/T1_TOA/LC08_042035_20161019,36.1697,-120.201,0.129312,0.107969,0.094198,0.080152,...,294.060547,292.998322,2720.0,0.15,0.524457,0.476226,0.316796,0.211592,-0.463478,
1,20161026,3511.882929,2304,LANDSAT/LC08/C01/T1_TOA/LC08_043035_20161026,36.1697,-120.201,0.125262,0.104519,0.09053,0.077565,...,296.055389,294.163574,2720.0,4.15,0.515012,0.445805,0.301392,0.203928,-0.455997,
2,20161104,2136.6915,2304,LANDSAT/LC08/C01/T1_TOA/LC08_042035_20161104,36.1697,-120.201,0.170604,0.14783,0.132317,0.113736,...,284.938782,283.150085,6816.0,11.01,0.467943,0.563557,0.323537,0.18917,-0.406802,
3,20161111,2517.055224,2304,LANDSAT/LC08/C01/T1_TOA/LC08_043035_20161111,36.1697,-120.201,0.153352,0.127818,0.108407,0.090966,...,289.738831,288.493958,2720.0,35.87,0.47281,0.484864,0.289609,0.243744,-0.401956,
4,20161127,2055.056845,2304,LANDSAT/LC08/C01/T1_TOA/LC08_043035_20161127,36.1697,-120.201,0.129083,0.102393,0.081025,0.059895,...,284.973633,284.453094,2720.0,29.63,0.582329,0.510228,0.318404,0.548454,-0.473752,


##### Process the Single Dataset for Machine Learning Application

In [25]:
# Extract only the neccesary variables for machine learning model dataset

## ------------------------------------------------------------------------ Without Date ------------------------------------------------------------------------
# ml_columns = ['Daily LE', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10', 'B11', 'BQA', 'Cloud Cover', 'NDVI', 'EVI', 'SAVI', 'VARI', 'NDWI']
# ml_df = df_combined_and_cleaned[ml_columns]

# # ------------------------------- ATTEMPT 1 -------------------------------
# # ml_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset.csv'
# # -------------------------------------------------------------------------

# # ------------------------------- ATTEMPT 2 -------------------------------
# ml_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset Extended.csv'
# # -------------------------------------------------------------------------
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------

# -------------------------------------------------------------------------- With Date --------------------------------------------------------------------------
ml_columns = ['Daily LE', 'Date', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10', 'B11', 'BQA', 'Cloud Cover', 'NDVI', 'EVI', 'SAVI', 'VARI', 'NDWI']
ml_df = df_combined_and_cleaned[ml_columns]

# ------------------------------- ATTEMPT 1 -------------------------------
ml_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset (Date).csv'
# -------------------------------------------------------------------------

# ------------------------------- ATTEMPT 2 -------------------------------
# ml_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset Extended (Date).csv'
# -------------------------------------------------------------------------
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------

save_df_to_drive(ml_df, ml_file_path)

ml_df.head()


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset (Date).csv'


Unnamed: 0,Daily LE,Date,B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,BQA,Cloud Cover,NDVI,EVI,SAVI,VARI,NDWI
0,3743.284896,20161019,0.129312,0.107969,0.094198,0.080152,0.256945,0.156945,0.087725,0.09032,0.00113,294.060547,292.998322,2720.0,0.15,0.524457,0.476226,0.316796,0.211592,-0.463478
1,3511.882929,20161026,0.125262,0.104519,0.09053,0.077565,0.242298,0.150072,0.084479,0.084928,0.001056,296.055389,294.163574,2720.0,4.15,0.515012,0.445805,0.301392,0.203928,-0.455997
2,2136.6915,20161104,0.170604,0.14783,0.132317,0.113736,0.313797,0.187037,0.115338,0.123078,0.030207,284.938782,283.150085,6816.0,11.01,0.467943,0.563557,0.323537,0.18917,-0.406802
3,2517.055224,20161111,0.153352,0.127818,0.108407,0.090966,0.254131,0.143038,0.080938,0.100026,0.012391,289.738831,288.493958,2720.0,35.87,0.47281,0.484864,0.289609,0.243744,-0.401956
4,2055.056845,20161127,0.129083,0.102393,0.081025,0.059895,0.226909,0.106365,0.04933,0.071254,0.001072,284.973633,284.453094,2720.0,29.63,0.582329,0.510228,0.318404,0.548454,-0.473752


##### Date Variable Variation

In [26]:
# until this point the dat variable as being 'fed' into the machine learning (ML) model as a string
  # this is an investigation as to whether the perfromance could be improved if it was a different data type

df_epoch = ml_df

# Date was converted to Epoch (numeric representation)
df_epoch['Date'] = pd.to_datetime(df_epoch['Date'], format='%Y%m%d')
df_epoch['Date Epoch'] = df_epoch['Date'].astype(int) // 10**9  # Convert to seconds since epoch
df_epoch = df_epoch.drop('Date', axis=1)

df_epoch_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset (Date) - epoch.csv'
save_df_to_drive(df_epoch, df_epoch_file_path)

# Date was converted to Epoch (numeric representation) and the day, month and year of the dat were included
df_epoch_and = ml_df
df_epoch_and['Year'] = df_epoch_and['Date'].dt.year
df_epoch_and['Month'] = df_epoch_and['Date'].dt.month
df_epoch_and['Day'] = df_epoch_and['Date'].dt.day
df_epoch_and = df_epoch_and.drop('Date', axis=1)

df_epoch_and_file_path = '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset (Date) - epoch, year, month, day.csv'
save_df_to_drive(df_epoch_and, df_epoch_and_file_path)

df_epoch.head()
df_epoch_and.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_epoch['Date'] = pd.to_datetime(df_epoch['Date'], format='%Y%m%d')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_epoch['Date Epoch'] = df_epoch['Date'].astype(int) // 10**9  # Convert to seconds since epoch


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset (Date) - epoch.csv'


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_epoch_and['Year'] = df_epoch_and['Date'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_epoch_and['Month'] = df_epoch_and['Date'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_epoch_and['Day'] = df_epoch_and['Date'].dt.day


DataFrame saved to Google Drive at '/content/drive/My Drive/Colab Notebooks/Data/Machine Learning Dataset (Date) - epoch, year, month, day.csv'


Unnamed: 0,Daily LE,B1,B2,B3,B4,B5,B6,B7,B8,B9,...,Cloud Cover,NDVI,EVI,SAVI,VARI,NDWI,Date Epoch,Year,Month,Day
0,3743.284896,0.129312,0.107969,0.094198,0.080152,0.256945,0.156945,0.087725,0.09032,0.00113,...,0.15,0.524457,0.476226,0.316796,0.211592,-0.463478,1476835200,2016,10,19
1,3511.882929,0.125262,0.104519,0.09053,0.077565,0.242298,0.150072,0.084479,0.084928,0.001056,...,4.15,0.515012,0.445805,0.301392,0.203928,-0.455997,1477440000,2016,10,26
2,2136.6915,0.170604,0.14783,0.132317,0.113736,0.313797,0.187037,0.115338,0.123078,0.030207,...,11.01,0.467943,0.563557,0.323537,0.18917,-0.406802,1478217600,2016,11,4
3,2517.055224,0.153352,0.127818,0.108407,0.090966,0.254131,0.143038,0.080938,0.100026,0.012391,...,35.87,0.47281,0.484864,0.289609,0.243744,-0.401956,1478822400,2016,11,11
4,2055.056845,0.129083,0.102393,0.081025,0.059895,0.226909,0.106365,0.04933,0.071254,0.001072,...,29.63,0.582329,0.510228,0.318404,0.548454,-0.473752,1480204800,2016,11,27
