# Predicting Population Density within US Census Tracts 
##### By: Manny Fors, Liam Smith, Alex Xu

## Abstract


Our project aims to tackle the challenge of inadequate census population data in underdeveloped countries, while also providing a means to forecast populations in new developments. Leveraging landcover image data and tract geometry, our approach involves computing zonal statistics and employing a regression model. Our success will be evaluated based on meeting our deliverables. However, the success of our model will be based on its accuracy, thereby offering a quantitative measure of our impact in addressing the lack of population in certain areas.

## Introduction

In countries such as the US, there is a large and accurate amount of census data. However there are many ountries in areas where the resources for gathering census data is [lesser](https://nigeria.iom.int/news/npc-we-lack-accurate-figures-nigerias-population). This is potentially due to geographic inaccessibility, political conflict, administrative failiure, and as mentioned previously, a lack of resources. **Thus, we want a way to predict human populations around the world with the data of the land itself, satellite imagery**. With this imaging, the geography is divided into classes which we can then use as variables for our model. Research into this topic has stagnated to a degree, however [Tian et al.](https://www.ccap.pku.edu.cn/docs/2018-04/20180406194434899406.pdf) produced a hallmark paper which tested the effectivity of modeling population with land cover data. It found that a similar model could have "feasible" and can have "high accuracy". They utilized Linear Regression, and also manually broke down China into even 1km by 1km cells. Because of availablity of census data, we instead used census tracts, but we continued with the idea of utilizing Linear Regression. With some exploratory graphs of Connecticut, we discovered there might be a Spatial Pattern within our data. In order to take this into account during modeling, we started researching into machine learning algorithms with a spatial component. We came across a paper by [Liu et al.](https://www.mdpi.com/2220-9964/11/4/242), which concluded that models with a spatial component, such as spatial lag, garner better results than those without. They used spatial lag, and eigvenvectors spatial filtering to predict things beyond our datasets such as soil types. Thus, we sought to create Linear Regression Models and Spatial Autoregressive models, and compare the them to see which is more effective in predicting population density based on land cover.

## Values Statement

NASA in a webinar session called “Humanitarian Applications Using NASA Earth Observations” presented how satellite remote-sensing data could be useful in monitoring humanitarian conditions at refugee settlements. Human settlements could be detected through remote sensing images and therefore could be used to predict the population in a region. This talk alerted us that we still lack necessary population data in many parts of the world, but also demonstrated how remote sensing could be a powerful tool in tackling this problem and solving lack of population data in different countries. Thus, we decide to investigate the connection between remote sensing land cover data and population density in a context with better data coverage. 

This type of model would be most beneficial by governments and government organizations. These users would most likely be hospital contractors, policy makers, emergency services providers such as ambulances and firefighers, and sociologists. Population census data is crucial for policy makers as it assists in city management so that the equitable distribution of resources can be better calculated. 

The implications extend beyond helping users. Real people would be affected by this technology. Those who are workers in fields such as emergency service work, or school teachers who might have been over-worked previously may be relieved by the building of new hospitals and schools to compensate for population changes. However, the negative effects are also extremely real. 

Imagining that this model expanded beyond the barriers of Connecticut and is being used in countries with much lower census data such as Brazil, there might be a calculation for a forestry company to continue harvesting wood from the Amazon, but they do not want to affect populations. Our algorithm calculates there are very few people in the area, as there is very dense land cover in the Amazon. This company starts to cut down trees and discovers that they are in an area of Indigenous peoples. A minority group that is already negatively affected continues to be disenfranchised. The issue of undercalculating the population density in an area can also affect the amount of resources a policymaker might provide to a region with a much greater population and lacking resources. This would also continue to negatively impact an already negatively impacted area.

Ultimately, the world would be a more equitable and sustainable place if this type of technology could assist countries lacking population data. The positive aspects of providing data where there is none provides the potential for great resource partioning, and better understanding of a countries population.

## Materials and Methods

### Necessary Data
With this project being the entire state of Connecticut, we utilized landcover data, population, shape files for graphing, and synthesized data which combined our various data sets into manageable datasets suitable for modeling. 

The bread and butter of our data stems from a 1-meter resolution landcover imagery covering the entire state of Connecticut. 
Derived from NAIP, the data has already been processed such that every pixel represents a certain class of landcover.

At over 800 MB, the dataset is too large to share via GitHub, and is downloadable by clicking on the first option at [this link](https://coastalimagery.blob.core.windows.net/ccap-landcover/CCAP_bulk_download/High_Resolution_Land_Cover/Phase_2_Expanded_Categories/Legacy_Land_Cover_pre_2024/CONUS/index.html). This landcover dataset was one of the most complete datsets we could find, which is why we wanted to use it for our modelling.

Our other data sources are the geometries and population data on the Census tract level for the state of Connecticut.
We downloaded tract geometries directly into our Jupyter Notebook **final_project.ipynb** using the Pygris package, and we downloaded the population data from Social Explorer, storing it at **data/population.csv**.



### Methods

First, we clean and prepare our data for the model.
We start by combining our Tract Geometry of CT with the Population Data of CT to form a new dataset. We utilize both the CT Landcover Data and the Tracts Data in a calculation of Zonal Statistics. This means we calculate the proportion of pixels within each tract that are of a given landcover class. This then is saved as a combined dataset which we then continue to clean by imputing values, performing more advanced Zonal Statistics, and dropping any NA Columns. From there, we are left with data ready to be used in a model. 

The flowchart below more elegantly outlines this process

```{mermaid}
flowchart LR
  A(Population Data) --> B(Tracts Data)
  C(Tracts Geometry Data) --> B(Tracts Data)
  B --> D{Zonal Statistics}
  E(CT Landcover Data) --> D{Zonal Statistics}
  D{Zonal Statistics} --> F(Combined Data)
  F(Combined Data) --> |Impute Data| G[Ready for Model]
  F --> |Additional Landcover Statistics| G[Ready for Model]
  F --> |Drop Uncommon Landcover| G[Cleaned Data]
```

We then implement three types of Linear Regression: 

* Linear Regression with No Penalty Term 

* Linear Regression with $\ell_1$ Regularization (Lasso Regression)

* Linear Regression with $\ell_1$ Regularization (Ridge Regression)


By utilizing the $R^2$ and Mean Squared Error, we quantified the success of each of our models against one another as well as comparing them to `sci-kit learn`'s own implementations of each of these Linear Regression Models.

Following Linear Regression, we then wanted to implement two types of Spatial AutoRegression: 

* Endogenous Spatial Autoregression

* Exogenous Spatial Autoregression 

As our data can be plotted on a map of Connecticut, we felt it would be amiss to not explore Spatial Autogression. Through this style of model, we can take into account the spatial aspect of each tract when we are predicting. We chose both Endogenous and Exogenous Models. Endogenous Models take into account the neighboring tract population densities of a given tract. Exogenous Models take into account the zonal statistics of a given tract's neighbors. 

We merge our data with shape file and calculate the spatial lag of a each tract's neighbors. The spatial lag is this case is the average population density of a given tracts of land. We also calculate the average landcover types of a given's tracts neighbors. 

In total, we create 8 models which we compare in order to determine the best way to predict population density with landcover data

```{mermaid}
flowchart 
A[Cleaned Data] --> B{No Penalty LR}
A --> C{Lasso LR}
B --> K{ours}
B --> L{sci-kit learn}
C --> G{ours}
C --> H{sci-kit learn}
A --> D{Ridge LR}
D --> I{ours}
D --> J{sci-kit learn}
A --> |Spatial Lag Pop Density| E{Endogenous}
A --> |Spatial Lag Landcover| F{Exogenous}

```


## Group Contributions Statement


## Personal Reflection

Alex: throughout the process of researching and implementing my project, I learned the background logic and the implementation of spatial autoregressive model. I was initially trying to use packages that do spatial lag regression, but found the packages not helpful enough for  train-test cross validation and model fitting. This motivates me to learn the background knowledge and mathematics of the model. The process of communicating the spatial autoregressive model to my teamates and during the presentation also improved my understanding of the model. I feel very satisfied with what we have accomplished, because the performance of the spatial lagged model is better than the OLS, as expected. We did not have enough time for the Poisson regression, which deviates from our initial goal. I would also love to add penalty term to the regression of the spatial model, but unfortunately we did not have enough time for this. What I learned from this project, especially the spatial lagged model, will contribute to future research on topics with spatial component. 