## Hack The Globe Notebook: Jordan Tran

For the past 2 weeks, using a sample of GLOBE's dataset, I explored ways to improve the quality of the dataset through an interpretable flagging system and generating metadata.


#### Exploration:
In order to improve the surface temperature dataset and make data more avaliable for future researchers, I decided to explore metadata augmentation of the dataset, which introduce new pieces of data that could be used for models or to identify trends not previously obvious with the original set of data.

Country/Continent/Biome data was to enrich the dataset with new information based off the coordinates recieved.

Year/Month/Country Code was generated in order to make future data projects easier with the dataset.

#### Takeaways:

Through this project I learned more about how to use dataframes to generate new data that wasn't explicitly given, the applications for what this new data could be used for, as well as the science behind surface temperature and strategies to classify outliers in our specific dataset.


# Importing Libraries

**Pandas** is used to store the parquet into a dataframe which allows vectorization and manipulating the data through entire columns


**pycountry_convert, reverse_geocode, and folium** are all libraries not installed into the kaggle environment and must be *pip installed*

In [None]:
!pip install pycountry_convert
!pip install reverse_geocode
!pip install folium
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# libraries for metadata augmentation
import reverse_geocode
import pycountry_convert as pc
import geopandas as gpd

# Initialize Data

Pandas will convert the original surface temperature parquet into a dataframe that can be used to generate new columns to improve the data.

In order to generate **Biome** data, an external shape file (.shp) along with its helper files must be downloaded. Complete the following steps to generate the biome column.

1. Download, and unpack the zip file onto your local device. (https://files.worldwildlife.org/wwfcmsprod/files/Publication/file/6kcchn7e3u_official_teow.zip)
2. Move ONLY **.dbf, .prj, .shp, .shx** to the current directory.
3. When accessing the shape file (see in **biome**), ensure that the file path is where the file is located on your LOCAL device.
```
global_biomes = gpd.read_file("INSERT PATH HERE/wwf_terr_ecos.shp")
```

In [None]:
# Modifies Pandas settings to view only a set amount of rows and columns; used to display only a small subset of values while viewing all columns
pd.options.display.max_columns = 70
pd.options.display.max_rows = 20

# Initializes the dataframe, provides the data types of all columns, and displays the dataset in a table
# IF DOWNLOADED ON LOCAL DEVICE, PLEASE CHANGE THE FILE PATH TO WHERE THE PARQUET IS STORED
df = pd.read_parquet('/kaggle/input/hack-the-globe/mv_surface_temperatures_wide.parquet')
display(df.info())
display(df)

# Percent of missing values per column (WRITTEN BY MARLIN WONG)
pd.options.display.max_rows = 100
missing = df.isnull().mean().sort_values(ascending=False) * 100
print(missing[missing > 0])

# Metadata Augmentation

Metadata augmentation uses the provided data and generates new data that can be used to improve future models and provide more information about a given datapoint.

Columns were added for the purposes of providing information that wasn't previously given in dataset, as well as making avaliable information easier to access for future data projects.

Metadata added:
- Countries
- Country Code
- Continent
- Year
- Month
- Biome

### Country / Country Code / Continent

Useful to compare surface temperature for country, i.e (avg surface temp of 1 country vs. another)

In [None]:
# uses an offline library to reverse geocode the country that data was taken in through its site latitude and longitude.
# MISSING DATA RETURNS "" FOR BOTH COUNTRY AND COUNTRY CODE
def reverse_geo(lat, lon):
    if pd.isna(lat) or pd.isna(lon):
        return None, None

    coords = [(lat, lon)]
    # Perform reverse geocoding
    location = reverse_geocode.search(coords)
    
    # Extract country information
    if location:
        return location[0].get("country"), location[0].get("country_code")
    else:
        return "", ""

# function that takes a given row's latitude and longitude in order to use .apply and make a new column with countries
def get_country_from_row(row):
    lat = row['site_latitude']
    lon = row['site_longitude']
    return reverse_geo(lat, lon)


reverse_geo(37.698551,-122.073959) # test case
df[["country", 'country_code']] = df.apply(get_country_from_row, axis=1, result_type='expand')
display(df["country_code"])

In [None]:
# uses pycountry_convert to turn a country name into its respective continent
# MISSING DATA RETURNS "" FOR CONTINENT
def country_to_continent(country):
    try:
        country_alpha2 = pc.country_name_to_country_alpha2(country)
        country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
        country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
        return country_continent_name
    except:
        return ""

def get_cont_from_row(row):
    return country_to_continent(row["country"])

print(country_to_continent("Poland")) # test case
df["continent"] = df.apply(get_cont_from_row, axis=1)
display(df["continent"])


### Year / Month

In [None]:
# "Measured_at" is in datetime64[ns], which year and month can be extracted from through pandas
df['year'] = df['measured_at'].dt.year
df['month'] = df['measured_at'].dt.month
display(df['month'])

### Biomes

The global biome data is sourced from World Wildlife Fund's (WWF) Terrestrial Ecoregions of the World. 

DOI: https://doi.org/10.1641/0006-3568(2001)051[0933:TEOTWA]2.0.CO;2

WWF's Publication: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world

Download URL: https://files.worldwildlife.org/wwfcmsprod/files/Publication/file/6kcchn7e3u_official_teow.zip

In [None]:
# Creates polygons out of WWF's .shp file, and overlays site coordinates over it to determine each point's biome
# IF RUNNING ON LOCAL DEVICE
global_biomes = gpd.read_file("/kaggle/input/global-biome-dataset/wwf_terr_ecos.shp")

# turns site coordinates into 'Point' datatypes for geopanadas
points = gpd.points_from_xy(df['site_longitude'],df["site_latitude"])

gdf = gpd.GeoDataFrame(df, geometry=points, crs="EPSG:4326")

biomes = global_biomes.to_crs("EPSG:4326")

# creates a new dataframe which intersects the original dataframe with the geopandas one
result = gpd.sjoin(gdf, biomes, how="left", predicate='intersects')
df['biome'] = result["ECO_NAME"]

In [None]:
# Return the dataframe back to a parquet that is exported to the current directory for future use
df.to_parquet("updated_meta_HTG_data.parquet")