Skip to content

Google Summer of Code 2020

Stefanie Lumnitz edited this page Jan 27, 2020 · 7 revisions

Google Summer of Code 2020

PySAL is inviting students to join in PySAL's development by applying for Google Summer of Code 2020. This is the fifth year PySAL will be seeking to participate, and are submitting underneath the NumFocus Organization.

Introduction

PySAL is an open source library of spatial analysis functions written in Python intended to support the development of high level applications. See our documentation for more details. The developer guide describes in more details how to make contributions to PySAL and our work flow for contributing to the project. Our issues are also on github, which include bug reports and 'wishlist' items and enhancement plans and ideas.

If you are interested in participating in GSoC as a student, the best approach is to become an active and engaged contributor to the project right away. You should take a look at some of the existing issues on GitHub and see if there are any you think you might be able to take a crack at. Try submitting a pull request for something and start getting the hang of the process and interacting with the PySAL code base and development community. It is a good idea to start on your proposal early, post a draft to the pysal chat room and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.

Project Ideas

Below are a listing of possible projects that students might consider. We also encourage students to propose their own projects, though several of the following topics are relatively high on our priority list. Our priority list is flexible, and it is important that the topic matches the interest and background of the student.

When considering the following projects, don't be put off by the knowledge prerequisites -- you don't need to be an expert, and there is some scope for research and learning within the GSoC period. However, familiarity with and interest in the subject area and involved technologies will be helpful!

GSOC brainstorming

Raster awareness in PySAL

Throughout PySAL, much of the current functionality implemented is, in principle, suitable for both vector and raster data. Examples of this include global and local statistics in esda like the Moran's I, (Spatial) Markov chains in giddy, or inequality and segregation measures in their respective packages. Although PySAL methods abstract from data sources, the library is currently designed to make workflows based on vector much smoother than those on raster. However, applications of these methods that rely on raster rather than vector data will likely appear more and more as new sources for data that traditionally have been generated in vector format are being released in raster (e.g. population data in the GHSL, or demographic statistics from the WorldPop project).

This project will develop a thin layer that interfaces functionality in PySAL with data provided in raster format. At their core, most methods in PySAL require an array-like data structure (e.g. one/two-dimensional numpy.ndarray, pandas.Series/pandas.DataFrame) and a spatial weights matrix expressed in the the pysal.lib.weights.W form. An interface to raster data will provide functionality to seamlessly build these objects from rasters. Raster access will be offloaded to rasterio, but the project will cover the steps required to transform the data structures provided by rasterio into those that PySAL expects. This includes, for example: identify missing values in the raster and eliminate them from the computation of spatial weights matrix, reshaping and alignment of raster data with spatial weights matrix, or processing of outputs from pysal to write them out using rasterio. Once these steps have been accomplished, the project could also identify areas in the analytical libraries of interest (e.g. esda, giddy) that could be optimised to scale them to the size raster datasets usually feature.

Skills

  • Experience in dealing with raster data (e.g. GeoTIFF), ideally in python (through rasterio)
  • Familiarity with PySAL main data structures
  • General familiarity with pandas and geopandas data structures (e.g. Geo/DataFrames, Geo/Series), and data manipulation operations available in pandas

Related Readings

Difficulty Level: low/intermediate

Mentors

Panel Data Spatial Econometrics

With the exception of seemingly unrelated regressions (SUR), the models covered in pysal.spreg only deal with cross-sectional data. There is a lack of support to deal with common spatial panel model settings, i.e., situations with observations in both the spatial and time domain.

The goal of this project is to extend the functionality in pysal.spreg with data handling, estimation methods and specification tests for both static and dynamic spatial panel models. This will cover fixed effects as well as random effects specifications. The initial focus will be on models where the cross-sectional dimension dominates (N >> T), and include estimation methods and specification tests for spatial lag, spatial error and spatial Durbin specifications. The ultimate goal is to also include functionality to deal with more general spatial effects in models with both large N and large T.

Skills

  • Familiarity with pysal.spreg, Scipy sparse matrices (scipy.sparse) and Numpy
  • Solid understanding of panel data econometrics and fundamentals of spatial econometrics

Related Readings:

  • Anselin, Luc, Julie Le Gallo and Hubert Jayet (2008). Spatial panel econometrics. In L. Matyas and P. Sevestre (Eds.), The Econometrics of Panel Data, Fundamentals and Recent Developments in Theory and Practice (3rd Edition), pp. 627-662. Berlin: Springer-Verlag.
  • Lee, Lung-Fei and Jihay Yu (2011). Estimation of spatial panels. Foundations and Trends in Econometrics 4, 1-164.
  • Elhorst, J. Paul (2014). Spatial Econometrics, From Cross-Sectional Data to Spatial Panels. Berlin: Springer-Verlag.

Difficulty Level: intermediate

Mentors: Pedro Amaral, Luc Anselin, Sergio Rey

spatial gaussian process models

Gaussian processes are becoming a critical part of machine learning. These models have fundamental utility in geographical applications because they allow us to model the effects of interactions between places. Spatial Gaussian processes are two-dimensional extension of the classic one-dimensional Gaussian Process. This project could include implementations of Kriging/co-Kriging or Bayesian Gaussian Process regression like Gelfand et al (2003) in Python tools. This project encourages students to examine the implementation and exploration of Gaussian Processes for spatial data using Scikit-Learn, GPy, or PyMC3.

Related Reading

Skills

  • interest in spatial statistics and machine learning
  • experience with PyMC3, GPy, or Sklearn preferred

Difficulty Level: high

Mentors

ESDA enhancement (esda#61)

  • local join counts

  • multinomial join counts (i.e. multiclass/color)

  • multivariate join count statistics (i.e. more than one binary variate)

  • cage statistic criterion for aggregation error

  • local gamma

  • multivariate geary (d) ljwolf/multi_c.py

  • local indicator of spatial heterogeneity (LOSH, getis & ord?)

spopt models [Serge, Levi, ...]

Point pattern analysis extension

Point pattern analysis is a fundamental discipline to GIS and spatial statistics. It is the study of the spatial arrangement of points in space. PySAL/pointpats was developed to accommodate the need for high-level, easy to use functionality for statistical analysis of planar point patterns in Python.

The goal of this project is to extend methodologies of existing point pattern analysis in PySAL and pointpats, i.e. through clustering workflows, and implement new functionality to allow for the analysis of the relationship between two or more point patterns, while keeping an eye on analysis performance. This project could include the implementation of functionality to facilitate a workflow for cluster analysis: though reducing point cloud dimensionality, clustering point clouds with existing clustering methodology and hulling identified clusters. Visualisations that allow for the comparison and quick iteration over different clustering methodologies could be developed for pysal/splot and exposed in pysal/pointpats. Additionally, this project encourages students to examine the implementation and suitability of methodologies for comparative or relational analysis between two or more point patterns in Python, i.e. Wasserstein distance in Scipy.

Skills

  • Ideally experience with spatial statistics, spatial clustering and point pattern analysis
  • Familiarity with PySAL and pointpats main data structures
  • Familiarity with visualisation in matplotlib, and data analysis with sklearn and scipy

Related Readings:

Difficulty Level: intermediate

Mentors:

Other

PySAL is an open source project and as such we invite contributions from any interested developer. If you have an idea for an enhancement for PySAL please contact one of the developers to discuss the possibilities for the project in GSOC20.

Some of the above guidelines were 'borrowed' from previously successful GSoC Mentoring Organizations, such as Julia and Statsmodels.

Timeline

  • January 15-Feb 5 - sub-organization applications due
  • February 5-19 - organizations reviewed by Google
  • February 20 - List of accepted mentoring organizations published
  • February 20-March 16 Student participants discuss application ideas with mentoring organizations
  • March 16 - Application period begins
  • March 31 - Student application deadline
  • April 27 - Accepted student proposals announced
  • April 27 - May 18 - community bonding
  • May 18 - Coding begins!
  • August 10-17 - final week
  • August 25 - Final results of Google Summer of Code 2020 announced

Source: https://developers.google.com/open-source/gsoc/timeline

Clone this wiki locally