-
Notifications
You must be signed in to change notification settings - Fork 299
Google Summer of Code 2020
PySAL is inviting students to join in PySAL's development by applying for Google Summer of Code 2020. This is the fifth year PySAL will be seeking to participate, and are submitting underneath the NumFocus Organization.
PySAL is an open source library of spatial analysis functions written in Python intended to support the development of high level applications. See our documentation for more details. The developer guide describes in more details how to make contributions to PySAL and our work flow for contributing to the project. Our issues are also on github, which include bug reports and 'wishlist' items and enhancement plans and ideas.
If you are interested in participating in GSoC as a student, the best approach is to become an active and engaged contributor to the project right away. You should take a look at some of the existing issues on GitHub and see if there are any you think you might be able to take a crack at. Try submitting a pull request for something and start getting the hang of the process and interacting with the PySAL code base and development community. It is a good idea to start on your proposal early, post a draft to the pysal chat room and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.
Below are a listing of possible projects that students might consider. We also encourage students to propose their own projects, though several of the following topics are relatively high on our priority list. Our priority list is flexible, and it is important that the topic matches the interest and background of the student.
When considering the following projects, don't be put off by the knowledge prerequisites -- you don't need to be an expert, and there is some scope for research and learning within the GSoC period. However, familiarity with and interest in the subject area and involved technologies will be helpful!
Throughout PySAL, much of the current functionality implemented is, in
principle, suitable for both vector and raster data. Examples of this
include global and local statistics in esda
like the Moran's I, (Spatial)
Markov chains in giddy
, or inequality and segregation measures in their
respective packages. Although PySAL methods abstract from
data sources, the library is currently designed to make workflows based on
vector much smoother than those on raster. However, applications of these methods that rely on raster rather
than vector data will likely appear more and more as new sources for data that
traditionally have been generated in vector format are being released in raster
(e.g. population data in the GHSL, or
demographic statistics from the WorldPop
project).
This project will develop a thin layer that interfaces functionality in PySAL
with data provided in raster format. At their core, most methods in PySAL
require an array-like data structure (e.g. one/two-dimensional
numpy.ndarray
, pandas.Series
/pandas.DataFrame
) and a spatial weights
matrix expressed in the the pysal.lib.weights.W
form. An interface to raster
data will provide functionality to seamlessly build these objects from
rasters. Raster access will be offloaded to rasterio
, but the project will
cover the steps required to transform the data structures provided by
rasterio
into those that PySAL expects. This includes, for example: identify
missing values in the raster and eliminate them from the computation of
spatial weights matrix, reshaping and alignment of raster data with spatial
weights matrix, or processing of outputs from pysal to write them out using
rasterio
. Once these steps have been accomplished, the project could also
identify areas in the analytical libraries of interest (e.g. esda
, giddy
)
that could be optimised to scale them to the size raster datasets usually
feature.
- Experience in dealing with raster data (e.g. GeoTIFF), ideally in python
(through
rasterio
) - Familiarity with PySAL main data structures
- General familiarity with
pandas
andgeopandas
data structures (e.g.Geo/DataFrames
,Geo/Series
), and data manipulation operations available inpandas
- Documentation for
esda
,giddy
,inequality
,segregation
- Documentation for
rasterio
With the exception of seemingly unrelated regressions (SUR), the models covered in pysal.spreg only deal with cross-sectional data. There is a lack of support to deal with common spatial panel model settings, i.e., situations with observations in both the spatial and time domain.
The goal of this project is to extend the functionality in pysal.spreg with data handling, estimation methods and specification tests for both static and dynamic spatial panel models. This will cover fixed effects as well as random effects specifications. The initial focus will be on models where the cross-sectional dimension dominates (N >> T), and include estimation methods and specification tests for spatial lag, spatial error and spatial Durbin specifications. The ultimate goal is to also include functionality to deal with more general spatial effects in models with both large N and large T.
- Familiarity with pysal.spreg, Scipy sparse matrices (scipy.sparse) and Numpy
- Solid understanding of panel data econometrics and fundamentals of spatial econometrics
- Anselin, Luc, Julie Le Gallo and Hubert Jayet (2008). Spatial panel econometrics. In L. Matyas and P. Sevestre (Eds.), The Econometrics of Panel Data, Fundamentals and Recent Developments in Theory and Practice (3rd Edition), pp. 627-662. Berlin: Springer-Verlag.
- Lee, Lung-Fei and Jihay Yu (2011). Estimation of spatial panels. Foundations and Trends in Econometrics 4, 1-164.
- Elhorst, J. Paul (2014). Spatial Econometrics, From Cross-Sectional Data to Spatial Panels. Berlin: Springer-Verlag.
Gaussian processes are becoming a critical part of machine learning. These models have fundamental utility in geographical applications because they allow us to model the effects of interactions between places. Spatial Gaussian processes are two-dimensional extension of the classic one-dimensional Gaussian Process. This project could include implementations of Kriging/co-Kriging or Bayesian Gaussian Process regression like Gelfand et al (2003) in Python tools. This project encourages students to examine the implementation and exploration of Gaussian Processes for spatial data using Scikit-Learn, GPy, or PyMC3.
- Introductory example from
sklearn
- Gaussian Processes for Machine Learning
- MacKay's Introduction to Gaussian Processes
- interest in spatial statistics and machine learning
- experience with PyMC3, GPy, or Sklearn preferred
ESDA enhancement (esda#61)
-
local join counts
-
multinomial join counts (i.e. multiclass/color)
-
multivariate join count statistics (i.e. more than one binary variate)
-
cage statistic criterion for aggregation error
-
local gamma
-
multivariate geary (d) ljwolf/multi_c.py
-
local indicator of spatial heterogeneity (LOSH, getis & ord?)
Point pattern analysis is a fundamental discipline to GIS and spatial statistics. It is the study of the spatial arrangement of points in space. PySAL/pointpats was developed to accommodate the need for high-level, easy to use functionality for statistical analysis of planar point patterns in Python.
The goal of this project is to extend methodologies of existing point pattern analysis in PySAL and pointpats, i.e. through clustering workflows, and implement new functionality to allow for the analysis of the relationship between two or more point patterns, while keeping an eye on analysis performance. This project could include the implementation of functionality to facilitate a workflow for cluster analysis: though reducing point cloud dimensionality, clustering point clouds with existing clustering methodology and hulling identified clusters. Visualisations that allow for the comparison and quick iteration over different clustering methodologies could be developed for pysal/splot
and exposed in pysal/pointpats
. Additionally, this project encourages students to examine the implementation and suitability of methodologies for comparative or relational analysis
between two or more point patterns in Python, i.e. Wasserstein distance in Scipy
.
- Ideally experience with spatial statistics, spatial clustering and point pattern analysis
- Familiarity with PySAL and pointpats main data structures
- Familiarity with visualisation in
matplotlib
, and data analysis withsklearn
andscipy
- Documentation for
pointpats
, andsklearn.cluster
- `Introduction to point pattern analysis’
Examples for Indices of Dependence Between Types in Multivariate Point Patterns
PySAL is an open source project and as such we invite contributions from any interested developer. If you have an idea for an enhancement for PySAL please contact one of the developers to discuss the possibilities for the project in GSOC20.
Some of the above guidelines were 'borrowed' from previously successful GSoC Mentoring Organizations, such as Julia and Statsmodels.
- January 15-Feb 5 - sub-organization applications due
- February 5-19 - organizations reviewed by Google
- February 20 - List of accepted mentoring organizations published
- February 20-March 16 Student participants discuss application ideas with mentoring organizations
- March 16 - Application period begins
- March 31 - Student application deadline
- April 27 - Accepted student proposals announced
- April 27 - May 18 - community bonding
- May 18 - Coding begins!
- August 10-17 - final week
- August 25 - Final results of Google Summer of Code 2020 announced
Source: https://developers.google.com/open-source/gsoc/timeline