# Geospatial Machine Learning: Classifying Land Use Based on Environmental and Economic Features

# 1. Introduction

In this project, the dataset we decided to analyze is the **Geospatial Environmental and Socioeconomic Data**. We believe we can extract valuable insights by developing a **classification model** that can **predict the primary land cover type** of a geographic area based on spatial indicators such as population density, GDP, temperature, and proximity to roads and cities. To accomplish this, we will use **machine learning algorithms** to investigate whether there are **patterns linking socioeconomic and environmental conditions** to different types of **land use**.


# 2. Description of the Dataset

## Geospatial Environmental and Socioeconomic Data  
**by Cathetorres (2020) – Kaggle**

This dataset is a comprehensive compilation of global **geospatial vector and raster data**, integrating diverse environmental and socioeconomic indicators. Collected from multiple authoritative sources, it provides detailed spatial layers useful for analysis in sustainability, development, and environmental monitoring.

The dataset is structured into **12 thematic folders**, each covering a distinct data category:

1. **Cities and Towns** – Geospatial data for urban areas in point and polygon formats.  
2. **Roads and Railroads** – Global transportation infrastructure networks.  
3. **Airports and Ports** – Locations of major air and sea transportation hubs.  
4. **Power Plants** – Spatial distribution of energy production facilities.  
5. **Gridded Population (2015)** – High-resolution (250 m) global population estimates.  
6. **Gridded GDP and HDI (1990–2015)** – Socioeconomic data showing economic productivity and development trends over time.  
7. **Land Cover (2015)** – Classification of global land cover types based on satellite data.  
8. **Tree Cover Loss by Dominant Driver** – Causes of deforestation disaggregated by primary driving factors.  
9. **Carbon Accumulation Potential** – Estimates of carbon sequestration from natural forest regrowth in forests and savannas.  
10. **Solar Energy Potential** – Spatial potential for solar power generation across the globe.  
11. **Air Temperature** – Global surface temperature data for climate studies.  
12. **Global Cattle Distribution (2010)** – Spatial density of cattle populations with 5-arcminute resolution.

## Dataset Folders Used

For this project, we utilize selected folders from the **Geospatial Environmental and Socioeconomic Data** collection to build a classification model that predicts the **primary land cover type** of a geographic area based on environmental and socioeconomic indicators. Below is a breakdown of the relevant folders used in the analysis:

### Target Variable

- **7. Gridded Land Cover 2015**  
  This dataset provides the land cover classification for each 250 m grid cell. It serves as the **label** for our machine learning model, representing categories such as forest, cropland, urban area, grassland, water, etc.

### Input Features

#### Human and Economic Indicators
- **5. Gridded Population 2015 (250 m)**  
  Provides population density data per grid cell, which may correlate with urban or peri-urban land use.

- **6. Gridded GDP and HDI (1990–2015)**  
  Economic indicators used to understand development intensity, which may affect land use patterns.

#### Environmental Variables

- **11. Air Temperature**  
  Climatic data used to correlate temperature patterns with land cover types like tundra, desert, or tropical forest.

- **12. Global Cattle Distribution (2010)**  
  Used as a proxy for agricultural activity, particularly pastoral land use.


# 3. Importing Files and Python Libraries

Here are the libraries and modules that will be used in this notebook:

In [3]:
import rasterio
from rasterio.mask import mask
import geopandas as gpd
import numpy as np
import xarray as xr
import rioxarray

from rasterio.enums import Resampling
from skimage.transform import resize

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

### Importing the Files

In [4]:
paths = {
    'land_cover': 'ESACCI-LC-L4-LCCS-Map-300m-P1Y-2015-v2.0.7.tif',
    'population': 'GHS_POP_E2015_GLOBE_R2019A_54009_250_V1_0.tif',
    'temperature': 'TEMP.tif',
    'cattle': '6_Ct_2010_Aw.tif',
    'gdp_nc': 'GDP_PPP_1990_2015_5arcmin_v2.nc',
    'philippines_boundary': 'gadm41_PHL_1.shp'
}

# 4. Data Precessing and Cleaning

**Loading Philipphines Boundary**

In [6]:
ph_boundary = gpd.read_file(paths['philippines_boundary'])
ph_boundary = ph_boundary.to_crs("EPSG:4326")

**Loading .tif raster layers**

In [None]:
geo_data = {}

for key in ['land_cover', 'population', 'temperature', 'cattle']:
    with rasterio.open(paths[key]) as src:
        clipped, transform = mask(src, ph_boundary.geometry, crop=True)
        geo_data[key] = clipped[0]  # First band

        # Print sample value (non-nan)
        sample = geo_data[key][~np.isnan(geo_data[key])].flatten()
        print(f"Sample value from {key}: {sample[0] if len(sample) > 0 else 'No valid data'}")

📌 Sample value from land_cover: 0
📌 Sample value from population: -200.0
📌 Sample value from temperature: 0.0
📌 Sample value from cattle: -1.7e+308


**Loading .nc**

In [9]:
gdp_data = xr.open_dataset(paths['gdp_nc'])
gdp_data = gdp_data.rio.write_crs("EPSG:4326", inplace=False)
gdp_data_ph = gdp_data.rio.clip(ph_boundary.geometry, ph_boundary.crs)

print("\nGDP NetCDF Sample Values:")
for var in gdp_data_ph.data_vars:
    arr = gdp_data_ph[var].values
    sample = arr[~np.isnan(arr)].flatten()
    print(f"- {var}: {sample[0] if len(sample) > 0 else 'No valid data'}")


GDP NetCDF Sample Values:
- GDP_PPP: 0.0


**Array Shapes**

As seen, each has different shapes.

In [10]:
print("\nRaster file shapes:")
for key, data in geo_data.items():
    print(f"{key}: {data.shape}")

print("\nNetCDF variables:")
for var in gdp_data.data_vars:
    print(f"{var}: {gdp_data[var].shape}")


Raster file shapes:
land_cover: (5935, 3484)
population: (1, 1)
temperature: (1979, 1162)
cattle: (198, 117)

NetCDF variables:
GDP_PPP: (26, 2160, 4320)


**Resampling**

New Array Shape

In [None]:
land_cover = geo_data['land_cover']
transform = transforms['land_cover']

# 5. Exploratory Data Analysis

## 1. What is the distribution of household sizes (PUFHHSIZE)?

In [None]:
## 2. What is the distribution of age (PUFC05_AGE)?

## 3. What is the distribution of regions (PUFREG)?

## 4. What is the distribution of provinces (PUFPRV)?

## 5. What is the ratio of males vs. females (PUFC04_SEX)?

## 6. What are the most common education levels (PUFC07_GRADE)?

## 7. What are the most common occupations (PUFC14_PROCC)?

## 8. What are the most common industries (PUFC16_PKB)?

## 9. How do daily working hours differ by occupation? (PUFC14_PROCC, PUFC18_PNWHRS)?

In [None]:
## 10. What are the top 5 occupations (PUFC07_GRADE) per nature of employment (PUFC14_PROCC)?

# Resampling

# Scaling

# Splitting the Dataset into training, testing, and validation

# Machine Learning

## Initial Model Training

## Error Analysis

## Improving Model Performance

# Insights and Conclusion

# References