## Aim

This code does the following: 
- Extracts Indian data from the [Planted Trees map](https://www.globalforestwatch.org/blog/data-and-tools/updated-planted-trees-map-near-global-coverage/) from the World Resources Institute (WRI). From this link, you can also download the data.
- Converts the GeoPackage data into a shape file (such that it can be ingested in Google Earth Engine (GEE))

More information on this dataset can also be found [here](https://www.wri.org/research/spatial-database-planted-trees-sdpt-version-2?ap3c=IGkMIRz9Ic8arE0EAGkMIRxggQ62SFTnOwQXKFOP5omGChdpJA).


In this notebook, GDAL is used to extract different shapefiles (.shp extension) and this for separate years. The following command in GDAL was used:

```bash
ogr2ogr -f "ESRI Shapefile" india_plant_v2.shp india_plant_v2.gpkg ind_plant_v2
```

The command line gave several warnings including one where  "2GB file size limit reached for india_plant_v2.dbf". Hence, the big .gpkg file will be split into several smaller files and this according to the years. If desired, the split-up could also be according to the class. 

In order to install GDAL using Python, an excellent resource is the course material of Ujaval Gandhi ["Mastering GDAL Tools (Full Course)"](https://courses.spatialthoughts.com/).

## Reading in the Python packages

In [35]:
import glob
import re
import subprocess
import os

import pandas as pd

In [3]:
## input_dir = "data" # I am running the program in the same directory
input_dir_complete = r"D:\sdpt_v2_v20231128.gpkg"# Path for the entire big file - External hard disk in my case
gpkg_india = 'india_plant_v2.gpkg'

## Extracting India data

First, get some information of the entire file:

In [4]:
cmd = ["ogrinfo", input_dir_complete]

result = subprocess.run(cmd, capture_output=True, text=True)
## print(result.stdout) ## Uncomment this line to see the results. This shows multi polygons per country 


Extract India:

In [5]:
cmd = [
    "ogr2ogr",
    "-f", "GPKG",
    gpkg_india,
    input_dir_complete,
    "ind_plant_v2"
]

result = subprocess.run(cmd, capture_output=True, text=True)

print(result.stdout)
print(result.stderr)  # shows warnings/errors






## Converting the geopackage to shape files
First, just check the newly created India file:

In [11]:
cmd = [
        "ogrinfo",
        "-so", "-al",
         gpkg_india
        ]
result = subprocess.run(cmd, check=True, capture_output=True, text=True)

if result.returncode == 0:
    print("ogrinfo ran successfully!\n")
    # Split output into lines and format it
    lines = [line.strip() for line in result.stdout.splitlines() if line.strip()]
    for line in lines:
        print(line)
else:
    print("Error running ogrinfo:")
    print(result.stderr)

ogrinfo ran successfully!

INFO: Open of `india_plant_v2.gpkg'
using driver `GPKG' successful.
Layer name: ind_plant_v2
Geometry: Multi Polygon
Feature Count: 1247437
Extent: (69.021570, 6.752422) - (97.180654, 34.820436)
Layer SRS WKT:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
MEMBER["World Geodetic System 1984 (G2296)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0

Check also the coordinate system as GEE needs epsg 4326:

In [19]:
cmd = [
    "gdalsrsinfo",
    "-o", "epsg",
    gpkg_india  
]

# Run the command and capture the output
result = subprocess.run(cmd, capture_output=True, text=True)

# Print the EPSG code
print("EPSG code:", result.stdout.strip())

EPSG code: EPSG:4326


Check the different planted years and their count and get the output into a list to get an idea of the distribution across the years

In [25]:
layer_name = "ind_plant_v2"

cmd = [
    "ogrinfo",
    gpkg_india,
    "-sql",
    f'SELECT plantedYear, COUNT(*) AS count FROM {layer_name} GROUP BY plantedYear ORDER BY plantedYear'
]

# Run the command
result = subprocess.run(cmd, capture_output=True, text=True)

# Raw output
output = result.stdout

# Optional: print warnings/errors
if result.stderr:
    print("Warnings / Errors:\n", result.stderr)

features = re.split(r'OGRFeature\(.+\)', output)  # split by features
print("PlantedYear counts:")

PlantedYear counts:
1982: 184032
1983: 1
1986: 1
1987: 17
1988: 7990
1989: 13976
1990: 15701
1991: 5257
1992: 66324
1993: 14831
1994: 128666
1995: 3918
1996: 22741
1997: 19858
1998: 18231
1999: 10988
2000: 8179
2001: 7883
2002: 35753
2003: 3652
2004: 19302
2005: 47721
2006: 9905
2007: 18733
2008: 9917
2009: 27190
2010: 51044
2011: 6666
2012: 93325
2013: 10676
2014: 14050
2015: 12441
2016: 8978
2017: 10393
2018: 13776
2019: 11666


In [23]:
planted_years = []

print("PlantedYear counts:")
for feat in features:
    if not feat.strip():
        continue
    year_match = re.search(r'plantedYear.*= (\d+)', feat)
    count_match = re.search(r'count.*= (\d+)', feat)
    if year_match and count_match:
        year = int(year_match.group(1))
        count = int(count_match.group(1))
        print(f"{year}: {count}")
        planted_years.append(year)

# Now planted_years contains all unique years
print("List of planted years:", planted_years)

Do the same but this for the different plantations:

In [28]:
cmd = [
    "ogrinfo",
    gpkg_india,
    "-sql",
    f'SELECT originalName, COUNT(*) AS count FROM {layer_name} GROUP BY originalName ORDER BY originalName'
]

result = subprocess.run(cmd, capture_output=True, text=True)
output = result.stdout

if result.stderr:
    print("Warnings / Errors:\n", result.stderr)

features = re.split(r'OGRFeature\(.+\)', output)

print("OriginalName counts:")

for feat in features:
    if not feat.strip():
        continue
    name_match = re.search(r'originalNa.*= (.+)', feat)
    count_match = re.search(r'count.*= (\d+)', feat)
    if name_match and count_match:
        name = name_match.group(1).strip()
        count = count_match.group(1)
        print(f"{name}: {count}")


OriginalName counts:
Acacia: 3121
Almonds: 740
Alnus: 19
Apples: 151
Areca: 12853
Cashew nut: 62
Casuriana: 335
Citrus: 11
Coconut: 1023
Coffee: 10
Cryptomeria: 146
Eucalyptus: 12131
Gliricidia plantation: 153
Mango: 3002
Mixed plantation: 55795
Oil Palm: 8859
Oil palm: 89
Orchard: 298261
Padauk: 109
Pine: 667078
Red oil palm: 36
Rubber: 157
Sal: 116489
Tea: 523
Teak: 66284


If we want to convert the entire GeoPackage to a shape file, we get some warnings (which may potentially read to problems). This is what the command gives. Please uncomment to see the output (and especially the warnings at the last line):

In [29]:
import subprocess

cmd = [
    "ogr2ogr",
    "-f", "ESRI Shapefile",
    "india_plant_v2.shp",
    "india_plant_v2.gpkg",
    "ind_plant_v2"
]

print("Running GDAL command:\n", " ".join(cmd), "\n")

# Run command and stream GDAL output live to the console
process = subprocess.Popen(
    cmd,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

for line in process.stdout:
    print(line, end="")  # Print each line as it comes

process.wait()

if process.returncode == 0:
    print("\nConversion completed successfully!")
else:
    print(f"\nogr2ogr failed with return code {process.returncode}")


Running GDAL command:
 ogr2ogr -f ESRI Shapefile india_plant_v2.shp india_plant_v2.gpkg ind_plant_v2 


Conversion completed successfully!


Let us extract one shape file for one year only and upload this in GEE:

In [30]:
year = 1982
output_shp = f"plantedYear_{year}.shp"
sql = f"SELECT * FROM ind_plant_v2 WHERE plantedYear={year}"

cmd = [
        "ogr2ogr",
        "-f", "ESRI Shapefile",
        output_shp,
        gpkg_india,
        "-sql", sql,
        "-nln", f"plantedYear_{year}"
]

In [31]:
process = subprocess.Popen(
    cmd,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

for line in process.stdout:
    print(line, end="")  # print each line as it arrives

process.wait()



0

It looks like there are no errors. Let us check the size of the file too:

In [33]:
import os
import glob

base = "plantedYear_1982"  

## Collect all parts of the shape file
files = glob.glob(base + ".*")

total_bytes = sum(os.path.getsize(f) for f in files)
total_gb = total_bytes / (1024 ** 3)

print("Shapefile components:")
for f in files:
    size_mb = os.path.getsize(f) / (1024 ** 2)
    print(f"  {os.path.basename(f)}: {size_mb:.2f} MB")

print(f"\nTotal size: {total_gb:.3f} GB")


Shapefile components:
  plantedYear_1982.dbf: 1545.86 MB
  plantedYear_1982.prj: 0.00 MB
  plantedYear_1982.shp: 128.75 MB
  plantedYear_1982.shx: 1.40 MB

Total size: 1.637 GB


After uploading this file as a GEE asset (which took about 

We can repeat the same across the different years using a loop. The different shape files are written to an external hard drive:

In [None]:
base_path = 

In [39]:
for year in years:
    print(f"Processing year {year}:")
    output_shp = rf"D:\plantedYear_{year}.shp"
    sql = f"SELECT * FROM ind_plant_v2 WHERE plantedYear={year}"

    cmd = [
        "ogr2ogr",
        "-f", "ESRI Shapefile",
        output_shp,
        gpkg_india,
        "-sql", sql,
        "-nln", f"plantedYear_{year}"
    ]

    process = subprocess.Popen(
    cmd,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

    for line in process.stdout:
        print(line, end="")  # print each line as it arrives

    process.wait()




Processing year 1982:
Processing year 1983:
Processing year 1986:
Processing year 1987:
Processing year 1988:
Processing year 1989:
Processing year 1990:
Processing year 1991:
Processing year 1992:
Processing year 1993:
Processing year 1994:
Processing year 1995:
Processing year 1996:
Processing year 1997:
Processing year 1998:
Processing year 1999:
Processing year 2000:
Processing year 2001:
Processing year 2002:
Processing year 2003:
Processing year 2004:
Processing year 2005:
Processing year 2006:
Processing year 2007:
Processing year 2008:
Processing year 2009:
Processing year 2010:
Processing year 2011:
Processing year 2012:
Processing year 2013:
Processing year 2014:
Processing year 2015:
Processing year 2016:
Processing year 2017:
Processing year 2018:
Processing year 2019:
