# Workflow for the entire project

> ## "What's this?"

Before I write any code, I want to make sure that I have a detailed plan to follow and a clear goal at each stage.

This project is concerned with predicting the intensity of the Urban Heat Island effect (the temperature essentially) in places where there is inadequate monitoring technology.

This will be done by analysing the environmental variables that ARE available and using them as predictors for UHIs (Urban Heat Island Intensity).

> ## "But what is an Urban Heat Island?"

Well, I'm glad you asked!

A **UHI** is a place where the kinds of structures present make it difficult for heat to dissipate easily, and days and (especially?) nights are significantly warmer as a result.

The two biggest culprits for this effect are:  
1. Close building proximities (because buildings today just loooove to retain heat)
2. The lack of vegetation (commonly referred to as green cover) to either shield from or distribute (I don't know how yet) the heat.

> ## "Okay, I get that. But why does this matter?"

You are very curious Padawan.

It is often the case that cities build first and think about liveability later. Well, this is certainly the case in Lagos.  
Nigeria's (Africa for the hopeful) largest economy with a population that overwhelms infrastructure.

In this city, green spaces are a luxury and builders are constantly trying to fit the most people in the smallest possible spaces.  
Human rights concerns aside, this makes it very difficult for a city perfectly in the tropics to manage the inevitable buildup of heat.

The nail in this coffin is the absence of any kind of monitoring technology. This means that many people probably die unnoticed in the warmer months.  
This doesn't have to be the case.

> ## "Oof. So how do we start modelling?"

Good question. The first thing any statistician needs (after identifying a problem) is data.

What data do we need? What are the variables we're examining?

We've mentioned building proximity and vegetation (green cover). To help us find more, we can look at related studies.

What did THEY examine?

### 1. Ana Oliveira et al.
**Method**: Random Forest

#### **Response**
Nocturnal Landsat LST

#### **Predictors**
1. lat
2. long
3. alt
4. nocturnal lst
5. diurnal ndvi
6. diurnal latent heat flux
7. diurnal sensible heat flux
8. diurnal storage heat flux

### 2. Andreas Wicki et al.
**Method**: Multiple Linear Regression

#### **Response**
Nocturnal Air Temperature

#### **Predictors**
1. Landsat Data (LST, NDVI, and Albedo)
2. Urban Morphology, which is the 3D arrangement and density of buildings and vegetation
3. Land Cover (Building to Surface Fraction, Impervious Surface Fraction, and Pervious Surface Fraction)

### 3. Shun Fu et al.
**Method**: CNN

#### **Response**
LST

#### **Predictors**
Spectral/Land Cover Indices:
   1. NDVI (Normalized Difference Vegetation Index)
   2. NDBI (Normalized Difference Built-up Index)
   3. MNDWI (Modified Normalized Difference Water Index)
   4. SAVI (Soil Adjusted Vegetation Index)

### 4. Alireza Attarhay Tehrani et al.
**Method**: Artificial Neural Network, Deep Neural Network, Gates Reccurent Unit

#### **Response**
UHI Intensity - Measured as difference in temperature between urban and rural areas.

#### **Predictors**
1. Topological & Spatial Features
   1. Geographical Location of Urban Areas
   2. Azimuth Angle (Building Orientation Relative to the Sun)
2. Urban Form Metrics
   1. Building Density (% of land covered by buildings)
   2. Average Building Height
   3. Average Building Volume
   4. Facade-to-Site Ratio (Ratio of Building Surface Area to Plot Area)
3. Land Use & Green Infrastructure:
   1. Green Space Ratio (% of vegetation cover)
   2. Occupied Area (space covered by buildings)
   3. Unoccupied Area (open/vacant land)
4. Climatic Data (from UWG):
   1. Dry-bulb temperature (from Typical Meteorological Year 3 (TMY3))
   2. Humidity, wind patterns (indirectly considered via UWG simulations)


> ## "So, what did you find?"

In these four works, we can see the following commonalities:

**Response: Urban Heat Island Intensity**  
This is measured as:  
- Noctural Landsat Land Surface Temperature
- Nocturnal Air Temperature
- Temperature difference between urban areas and rural surroundings

**Predictors:**
1. Geographical Location
2. Nocturnal Land Surface Temperature
3. Normalized Difference Vegetative Index (Land Cover)  
   Measures vegetation health and density.  
   **Near-Infrared (NIR)** – High reflectance from healthy plants.  
   **Red** – Absorbed by chlorophyll.
   $$
     NDVI = \frac{NIR - Red}{NIR + Red}
   $$
   - High NDVI (0.6–1.0): Dense vegetation (forests, crops).  
   - Low NDVI (0–0.2): Urban areas, bare soil, water.  
   - Negative NDVI: Water bodies.  
4. Normalized Difference Built-up Index (Land Cover)  
   **Short-Wave Infrared (SWIR)** – High reflectance from impervious surfaces.  
   **Near-Infrared (NIR)** – Low reflectance from built-up areas.
   $$
     NDBI = \frac{SWIR - NIR}{SWIR + WIR}
   $$
   - High NDBI (>0): Urbanized regions (concrete, asphalt).  
   - Low/Negative NDBI: Vegetation or water.  
5. Modified Normalized Difference Water Index (Land Cover)  
   **Green** – High reflectance from water.  
   **Short-Wave Infrared (SWIR)** – Low reflectance from water.  
   $$
     MNDWI = \frac{Green - SWIR}{Green + SWIR}
   $$
   - High MNDWI (>0.2): Water bodies (lakes, rivers).  
   - Low MNDWI (<0): Urban/vegetated areas.  
6. Soil Adjusted Vegetation Index (Land Cover)
7. Albedo
8. Building Density
9. Altitide (Elevation)
10. Vegetation Density (Land Cover)
11.  Green Space Ratio (Land Cover)
12.  Humidity
13.  Building Orientation Relative to the Sun (Azimuth Angle)

> ## "That's... a lot of features"

Right on.

After I select the model I'm going to use for this, I will perform **feature reduction** (using, say, Principal Component Analysis) and use the optimal number of features.

> ## "So what now?"

We find our data.

From my discussion with GPT, I have been able to understand the following.

|Feature|Purpose|
|---|---|
|NDVI|Measures vegetation density|
|NDBI|Measures built-up urban areas|
|MNDWI|Identifies water bodies|
|SAVI|Like NDVI, but corrects for soil effects|
|Albedo|Reflectivity of surfaces, which affects heat absorption|
|Elevation|Higher altitudes are generally cooler|
|Land Cover Class|Helps distinguish urban, vegetation, water, etc.|


## Data Collection

The first thing we do is collect our Landsat images from Google Earth Engine. This is done in the code below:

```python
collection = (
    ee.ImageCollection("LANDSAT/LC08/C02/T1_L2")
    .filterBounds(lagos_aoi)
    .filterDate("2021-11-01", "2024-03-31")
    .filterMetadata("CLOUD_COVER", "less_than", 20)
)
```

What this does is select the images from 2021 to 2024 within the Lagos boundary I defined, and with adequately small cloud cover (so I can actually see the ground).

### Image Bands

Now, because each of the landsat images actually has several bands (e.g., Infrared, RED, NIR, SWIR, etc., sort of the way images store R, G and B values), this code:

```python
image = collection.median().clip(lagos_aoi)
```

It goes to a pixel and, within each band (like Red), calculates the median value across that entire date range (2021 - 2024). Essentially, it is crunching all the potential values within that date range into a single representative figure.

> "Why the hell are we doing this?"

Well, some of the final spectral indices we want, like NDVI, actually rely on these values. Remember that NDVI is:

$$
	NDVI = \frac{NIR - Red}{NIR + Red}
$$

That median **red** we calculate at that point is what we use in this formula to calculate the NDVI at that point.

From the immediate code above, we only perform that calculation for the Lagos boundary.

---

**Comment:**
The NDVI formula above measures how much near infrared light a surface reflects and how much red light it absorbs to determine whether it is a plant or not, and how healthy it is.

- Healthy plants: Closer to 1
- Bare Soil: Closer to 0
- Water and Non-plant surfaces: Closer to -1

## Calculating Spectral Indices

The lines below calculate the rest of the spectral indices and add them — with the ndvi band already calculated — into the **features** image.

```python
ndbi = image.normalizedDifference(["SR_B6", "SR_B5"]).rename("NDBI")

mndwi = image.normalizedDifference(["SR_B3", "SR_B6"]).rename("MNDWI")

savi = image.expression(
    "((NIR - RED) / (NIR + RED + L)) * (1 + L)",
    {
        "NIR": image.select("SR_B5"),
        "RED": image.select("SR_B4"),
        "L": 0.5,  # soil brightness correction factor
    },
).rename("SAVI")

albedo = image.expression(
    "0.356 * B2 + 0.130 * B4 + 0.373 * B5 + 0.085 * B6 + 0.072 * B7 - 0.0018",
    {
        "B2": image.select("SR_B2"),  # Blue
        "B4": image.select("SR_B4"),  # Red
        "B5": image.select("SR_B5"),  # NIR
        "B6": image.select("SR_B6"),  # SWIR1
        "B7": image.select("SR_B7"),  # SWIR2
    },
).rename("Albedo")

elevation = ee.Image("USGS/SRTMGL1_003").clip(lagos_aoi).rename("Elevation")

landcover = ee.Image("ESA/WorldCover/v100/2020").clip(lagos_aoi).rename("LandCover")

features = ndvi.addBands([ndbi, mndwi, savi, albedo, lst, elevation, landcover])
```

## Sampling from Feature Image

We could probably use all the data in our analysis, but depending on the resolution of the image, that could be hundreds of thousands to millions of rows of data. 500 should be enough, and we sample that many points in the line below:

```python
points = features.sample(region=lagos_aoi, scale=30, numPixels=500, geometries=True)
```

We export those points to CSV in the line below using `geemap` and continue the rest of our analysis with pandas.

```python
geemap.ee_to_csv(points, filename="resources/data/UHI_features.csv")
```
