# STEM Enhancement in Earth Science 2025 Summer Internship

**Team:** Virtual - Hack the GLOBE!

**Subteam:** Metadata Augmentation

**Members:** Jordan Tran

**Dataset:** GLOBE Surface Temperature Protocol

**Original Repository:** https://github.com/DerpCake1/Hack-The-Globe-SEES

**Summary:** 
For the past 2 weeks, using a sample of GLOBE's dataset, I explored ways to improve the quality of the dataset through an interpretable flagging system and generating metadata.

#### Exploration:
In order to improve the surface temperature dataset and make data more avaliable for future researchers, I decided to explore metadata augmentation of the dataset, which introduce new pieces of data that could be used for models or to identify trends not previously obvious with the original set of data.

Country/Continent/Biome data was to enrich the dataset with new information based off the coordinates recieved.

Year/Month/Country Code was generated in order to make future data projects easier with the dataset.

#### Takeaways:

Through this project I learned more about how to use dataframes to generate new data that wasn't explicitly given, the applications for what this new data could be used for, as well as the science behind surface temperature and strategies to classify outliers in our specific dataset.



# Importing Libraries


In [1]:
!pip install pycountry-convert



In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# libraries for metadata augmentation
import reverse_geocode
import pycountry_convert as pc
import pyarrow
import geopandas as gpd
import folium

# Initialize Data

Pandas will convert the original surface temperature parquet into a dataframe that can be used to generate new columns to improve the data.

In order to generate **Biome** data, an external shape file (.shp) along with its helper files must be downloaded. Complete the following steps to generate the biome column.

1. Download, and unpack the zip file onto your local device. (https://files.worldwildlife.org/wwfcmsprod/files/Publication/file/6kcchn7e3u_official_teow.zip)
2. Move ONLY **.dbf, .prj, .shp, .shx** to the current directory.
3. When accessing the shape file (see in **biome**), ensure that the file path is where the file is located on your LOCAL device.
```
global_biomes = gpd.read_file("INSERT PATH HERE/wwf_terr_ecos.shp")
```

In [3]:
# Modifies Pandas settings to view only a set amount of rows and columns; used to display only a small subset of values while viewing all columns
pd.options.display.max_columns = 70
pd.options.display.max_rows = 20

# Initializes the dataframe, provides the data types of all columns, and displays the dataset in a table
# IF DOWNLOADED ON LOCAL DEVICE, PLEASE CHANGE THE FILE PATH TO WHERE THE PARQUET IS STORED
df = pd.read_parquet('https://github.com/IGES-Geospatial/Hack-the-GLOBE-2025/raw/refs/heads/main/inputs/mv_surface_temperatures_wide.parquet')
display(df.info())
display(df)

# Percent of missing values per column (WRITTEN BY MARLIN WONG)
pd.options.display.max_rows = 100
missing = df.isnull().mean().sort_values(ascending=False) * 100
print(missing[missing > 0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 716031 entries, 0 to 716030
Data columns (total 64 columns):
 #   Column                            Non-Null Count   Dtype         
---  ------                            --------------   -----         
 0   st_id                             716031 non-null  Int64         
 1   site_id                           716031 non-null  Int64         
 2   measured_at                       716031 non-null  datetime64[ns]
 3   protocol_id                       716031 non-null  Int64         
 4   userid                            716031 non-null  Int64         
 5   surface_condition                 584465 non-null  object        
 6   organizationid                    716031 non-null  int64         
 7   usertype                          716031 non-null  Int64         
 8   submission_id                     90065 non-null   Int64         
 9   st_updated_at                     716031 non-null  object        
 10  st_created_at                   

None

Unnamed: 0,st_id,site_id,measured_at,protocol_id,userid,surface_condition,organizationid,usertype,submission_id,st_updated_at,st_created_at,sts_id,sample_number,sample_surface_temperature_c,sample_snow_depth_mm,sample_snow_depth_flag,version_id,version,site_version_activated_at,version_date,site_version_comments,homogeneous_site_short_length_m,homogeneous_site_long_length_m,surface_cover_type,instrument_type,protocol_name,protocol_model,protocol_association_name,protocol_alt_name,protocol_investigation_area,user_type_description,submission_comments,submission_developer_key_id,submission_access_code_id,submission_latitude,submission_longitude,submission_elevation,submission_point,submission_data,protocol_set_name,protocol_set_code,site_name,site_activated_at,site_deactivated_at,site_comments,site_latitude,site_longitude,site_elevation,site_elevation_type,site_location_source,site_point,site_developer_key_id,site_is_citizen_science,site_nickname,site_true_latitude,site_true_longitude,site_true_elevation,site_true_point,site_photo_measured_at,site_photo_primary_thumb_url,site_photo_primary_photo_url,site_photo_photo_data,developer_key_name,developer_key_is_citizen_science
0,1957,10652,2008-01-21 11:00:00,8,-1,,166361,-1,,2012-07-03 13:56:42.722429,2012-07-03 13:56:42.722416,12952,1,6.5,0.0,,7248,1,2008-01-21 11:00:00,2008-02-05 09:20:51,please replace with Surface Temperature Site C...,30.0,30.0,short grass,raytek st20,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,not categorized,,,,,,,,,,,Gim7RZ/PL/143:ATM189,2008-01-31,NaT,,50.146600,22.173800,179.1,ellipsoidal,gps,01010000A0E6100000780B24287E2C3640FBCBEEC9C312...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
1,1957,10652,2008-01-21 11:00:00,8,-1,,166361,-1,,2012-07-03 13:56:42.722429,2012-07-03 13:56:42.722416,12959,2,6.0,0.0,,7248,1,2008-01-21 11:00:00,2008-02-05 09:20:51,please replace with Surface Temperature Site C...,30.0,30.0,short grass,raytek st20,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,not categorized,,,,,,,,,,,Gim7RZ/PL/143:ATM189,2008-01-31,NaT,,50.146600,22.173800,179.1,ellipsoidal,gps,01010000A0E6100000780B24287E2C3640FBCBEEC9C312...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
2,1957,10652,2008-01-21 11:00:00,8,-1,,166361,-1,,2012-07-03 13:56:42.722429,2012-07-03 13:56:42.722416,12966,3,6.5,0.0,,7248,1,2008-01-21 11:00:00,2008-02-05 09:20:51,please replace with Surface Temperature Site C...,30.0,30.0,short grass,raytek st20,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,not categorized,,,,,,,,,,,Gim7RZ/PL/143:ATM189,2008-01-31,NaT,,50.146600,22.173800,179.1,ellipsoidal,gps,01010000A0E6100000780B24287E2C3640FBCBEEC9C312...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
3,1958,10652,2008-01-22 11:00:00,8,-1,,166361,-1,,2012-07-03 13:56:42.743904,2012-07-03 13:56:42.743891,12953,1,6.0,0.0,,7248,1,2008-01-21 11:00:00,2008-02-05 09:20:51,please replace with Surface Temperature Site C...,30.0,30.0,short grass,raytek st20,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,not categorized,,,,,,,,,,,Gim7RZ/PL/143:ATM189,2008-01-31,NaT,,50.146600,22.173800,179.1,ellipsoidal,gps,01010000A0E6100000780B24287E2C3640FBCBEEC9C312...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
4,1958,10652,2008-01-22 11:00:00,8,-1,,166361,-1,,2012-07-03 13:56:42.743904,2012-07-03 13:56:42.743891,12960,2,6.0,0.0,,7248,1,2008-01-21 11:00:00,2008-02-05 09:20:51,please replace with Surface Temperature Site C...,30.0,30.0,short grass,raytek st20,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,not categorized,,,,,,,,,,,Gim7RZ/PL/143:ATM189,2008-01-31,NaT,,50.146600,22.173800,179.1,ellipsoidal,gps,01010000A0E6100000780B24287E2C3640FBCBEEC9C312...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716026,166969,318180,2025-03-27 17:40:00,8,107437128,dry,107437161,11,59928075,2025-03-27 18:13:17.113169,2025-03-27 18:13:17.113169,729233,5,31.9,,,33853,1,2023-06-14 18:39:37.794328,2023-06-14 18:39:37.794322,,53.0,50.0,asphalt,Etekcity,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,non-student user - trained,,5,,40.117903,-83.160308,279.5,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,,Surface Temperature,9808,asphalt parking lot,2023-06-14,NaT,Asphalt Parking,40.117903,-83.160308,279.5,,other,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
716027,166969,318180,2025-03-27 17:40:00,8,107437128,dry,107437161,11,59928075,2025-03-27 18:13:17.113169,2025-03-27 18:13:17.113169,729234,6,32.0,,,33853,1,2023-06-14 18:39:37.794328,2023-06-14 18:39:37.794322,,53.0,50.0,asphalt,Etekcity,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,non-student user - trained,,5,,40.117903,-83.160308,279.5,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,,Surface Temperature,9808,asphalt parking lot,2023-06-14,NaT,Asphalt Parking,40.117903,-83.160308,279.5,,other,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
716028,166969,318180,2025-03-27 17:40:00,8,107437128,dry,107437161,11,59928075,2025-03-27 18:13:17.113169,2025-03-27 18:13:17.113169,729235,7,31.8,,,33853,1,2023-06-14 18:39:37.794328,2023-06-14 18:39:37.794322,,53.0,50.0,asphalt,Etekcity,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,non-student user - trained,,5,,40.117903,-83.160308,279.5,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,,Surface Temperature,9808,asphalt parking lot,2023-06-14,NaT,Asphalt Parking,40.117903,-83.160308,279.5,,other,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False
716029,166969,318180,2025-03-27 17:40:00,8,107437128,dry,107437161,11,59928075,2025-03-27 18:13:17.113169,2025-03-27 18:13:17.113169,729236,8,31.5,,,33853,1,2023-06-14 18:39:37.794328,2023-06-14 18:39:37.794322,,53.0,50.0,asphalt,Etekcity,Surface Temperature,SurfaceTemperature,surface_temperature,Surface Temperature,Atmosphere,non-student user - trained,,5,,40.117903,-83.160308,279.5,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,,Surface Temperature,9808,asphalt parking lot,2023-06-14,NaT,Asphalt Parking,40.117903,-83.160308,279.5,,other,01010000A0E610000061527C7C42CA54C0D68C0C72170F...,1,False,,,,,,NaT,,,,GLOBE Data Entry Web Forms,False


submission_access_code_id           100.000000
site_nickname                       100.000000
site_true_point                      99.988269
site_true_elevation                  99.988269
site_true_longitude                  99.988269
site_true_latitude                   99.988269
site_deactivated_at                  99.913132
submission_comments                  98.645729
submission_data                      96.842734
sample_snow_depth_flag               96.560903
submission_point                     90.061185
submission_elevation                 90.061185
submission_longitude                 90.061185
submission_latitude                  90.061185
protocol_set_name                    90.061185
protocol_set_code                    90.061185
submission_developer_key_id          87.422332
submission_id                        87.421634
site_photo_primary_photo_url         86.794706
site_photo_primary_thumb_url         86.794706
site_photo_measured_at               86.794147
site_photo_ph

# Metadata Augmentation

Metadata augmentation uses the provided data and generates new data that can be used to improve future models and provide more information about a given datapoint.

Columns were added for the purposes of providing information that wasn't previously given in dataset, as well as making avaliable information easier to access for future data projects.

Metadata added:
- Countries
- Country Code
- Continent
- Year
- Month
- Biome

### Country / Country Code / Continent

Useful to compare surface temperature for country, i.e (avg surface temp of 1 country vs. another)

In [4]:
# uses an offline library to reverse geocode the country that data was taken in through its site latitude and longitude.
# MISSING DATA RETURNS "" FOR BOTH COUNTRY AND COUNTRY CODE
def reverse_geo(lat, lon):
    if pd.isna(lat) or pd.isna(lon):
        return None, None

    coords = [(lat, lon)]
    # Perform reverse geocoding
    location = reverse_geocode.search(coords)
    
    # Extract country information
    if location:
        return location[0].get("country"), location[0].get("country_code")
    else:
        return "", ""

# function that takes a given row's latitude and longitude in order to use .apply and make a new column with countries
def get_country_from_row(row):
    lat = row['site_latitude']
    lon = row['site_longitude']
    return reverse_geo(lat, lon)


reverse_geo(37.698551,-122.073959) # test case
df[["country", 'country_code']] = df.apply(get_country_from_row, axis=1, result_type='expand')
display(df["country_code"])

0         PL
1         PL
2         PL
3         PL
4         PL
          ..
716026    US
716027    US
716028    US
716029    US
716030    US
Name: country_code, Length: 716031, dtype: object

In [5]:
# uses pycountry_convert to turn a country name into its respective continent
# MISSING DATA RETURNS "" FOR CONTINENT
def country_to_continent(country):
    try:
        country_alpha2 = pc.country_name_to_country_alpha2(country)
        country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
        country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
        return country_continent_name
    except:
        return ""

def get_cont_from_row(row):
    return country_to_continent(row["country"])

print(country_to_continent("Poland")) # test case
df["continent"] = df.apply(get_cont_from_row, axis=1)
display(df["continent"])


Europe


0                Europe
1                Europe
2                Europe
3                Europe
4                Europe
              ...      
716026    North America
716027    North America
716028    North America
716029    North America
716030    North America
Name: continent, Length: 716031, dtype: object

### Year / Month

In [6]:
# "Measured_at" is in datetime64[ns], which year and month can be extracted from through pandas
df['year'] = df['measured_at'].dt.year
df['month'] = df['measured_at'].dt.month
display(df['month'])

0         1
1         1
2         1
3         1
4         1
         ..
716026    3
716027    3
716028    3
716029    3
716030    3
Name: month, Length: 716031, dtype: int32

### Biomes

The global biome data is sourced from World Wildlife Fund's (WWF) Terrestrial Ecoregions of the World. 

DOI: https://doi.org/10.1641/0006-3568(2001)051[0933:TEOTWA]2.0.CO;2

WWF's Publication: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world

Download URL: https://files.worldwildlife.org/wwfcmsprod/files/Publication/file/6kcchn7e3u_official_teow.zip

In [7]:
# Creates polygons out of WWF's .shp file, and overlays site coordinates over it to determine each point's biome
global_biomes = gpd.read_file("https://github.com/IGES-Geospatial/Hack-the-GLOBE-2025/raw/refs/heads/main/inputs/custom/wwf/wwf_terr_ecos.shp")

# turns site coordinates into 'Point' datatypes for geopanadas
points = gpd.points_from_xy(df['site_longitude'],df["site_latitude"])

gdf = gpd.GeoDataFrame(df, geometry=points, crs="EPSG:4326")

biomes = global_biomes.to_crs("EPSG:4326")

# creates a new dataframe which intersects the original dataframe with the geopandas one
result = gpd.sjoin(gdf, biomes, how="left", predicate='intersects')
df['biome'] = result["ECO_NAME"]

In [8]:
# Return the dataframe back to a parquet that is exported to the 'outputs' directory for future use
df.to_parquet("../outputs/JordanTran_Output.parquet")