"Geo Data Science with Python", Fall 2021

### Notebook Final Project Instructions

---

# Introduction

In the final project, we want to apply our gained literacy in the Python language and with data analysis methods to explore and mine climate or geoscience data for a region of interest.

The learning objectives of the final projects are:

- Create a Python program that processes large geo datasets fast and efficiently.
- Apply various data analysis and machine learning methods useful to mine geo data and using Python.
- Visualize, summarize and interpret results from a Python data anlysis algorithm.



---
# Task

The major tasks of this project are:

1. Define a data analysis problem (a set of questions to explore) relevant to you, which includes analysis of gridded spatiotemporal data for a certain region and applies at least one machine learning method. I would like everyone find a project that excites them. The next section lists some criteria for the problem.
2. Summarize your geospatial data analysis problem in written form and send it to me via e-mail by Saturday, November 20th, 2021.
3. Develop a Python algorithm to map the data and approach the defined problem for at least one region.
4. Perform the data analysis. Include code to visualize your results in a professional manner.
5. Summarize your work in a paper and present your dataset and results in a lightening talk.


## Criteria for choosing your data analysis problem:

- Choose a geographical region. And therein at as many non-uniform boundaries you are interested in, these can be countries, US states, river basins.

- Select at least **two geophysical variables** you want to study. Both should exhibit spatial variablity and at least one of them temporal variability. Choose appropriate grided data sets on NASA archives or other data services and use python code to download at least one of the datasets. Example data sets are given below.
    - At least one dataset should be "large", with the slice you are using having a minimum spatial size of ~ $latsize * lonsize > 100 * 100 $ and either covering a very long period (e.g. 100 years) or providing high temporal sampling (e.g. monthly for 20 years, daily for 5 years) 
    - The second dataset can also be any other non-uniformly sampled dataset you are interested in (e.g. groundwater well time series, seismic data, deformation data, ...).


- **The data analysis problem should include**:
    - Characterization and visualization of spatial and/or temporal variations in the geophysical variables (e.g. via statistical metrics for each grid point and the region you chose). And comparison of the two geophysical variables (e.g., temporal or spatial changes, statistical metrics and output of the next item). **Define at least 3 questions you want to approach for these tasks.**
    - **Apply at least one machine learning approach** exploring features in your dataset. **Define at least 1 questions you want to approach.**
        - Build a regression model either for estimation of periodic and long-term changes in at least one datasets for the region of interest. Or to predict a third geophysical variable from two others.
         - Apply one of the unsupervised machine learning methods (clustering and/or PCA) to explore features in your datasets.
     - Discussion of the meaning of the extracted features and how they can support your research.


- Please consult me for ideas or discussion on how to utilize the methods. 

- **Undergraduate students** may focus on **one** gridded dataset and application of the machine learning approach will be graded for extra credit.


- If you work in a team, please indicate names of all collaborators in the doc-string of the script file. And make a statement on individual contributions at the end of the report. (Provide detail, who coded which part, and/or which parts you worked on together, and/or who wrote which part of the summary, etc.)!

---
# Data Sets

## Region boundary

Please consult me for this. We have not covered import of shapefiles in Python. I can either provide text/.csv-file for boundaries you are interested in, or help with the conversion into a textfile.

## Gridded Datasets

Here a list of datasets suggestions you may explore during your project (see lecture slides from today and links below for more information):


- Monthly GPM precipitation: 
    - https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGM_06/summary?keywords=%22IMERG%20final%22


- GISTEMP
    - https://data.giss.nasa.gov/gistemp/
    - https://data.giss.nasa.gov/pub/gistemp/gistemp250.nc.gz
    - if you want a higher spatial resolution, use Temperature from the GLDAS model dataset 


- Soil Moisture Model GLDAS, including soil moisture, but also other hydrological variables (evapotranspiration, temperature, precipitation, snow) 
    - https://daac.gsfc.nasa.gov/datasets/GLDAS_NOAH025_M_2.1/summary?keywords=GLDAS%20v2.1


- Terrestrial Water Storage from the GRACE satellites 
    - http://www2.csr.utexas.edu/grace/RL06_mascons.html
    - https://podaac.jpl.nasa.gov/dataset/TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06_V2?ids=Processing%20Levels&values=3%20-%20Gridded%20Observations&search=grace%20jpl%20mascon&provider=POCLOUD


- Global drought index SPEI: 
    - https://spei.csic.es/, more information also here: 
    - https://climatedataguide.ucar.edu/climate-data/standardized-precipitation-evapotranspiration-index-spei  


- Historical Land-Cover Change and Land-Use Conversions Global Dataset:
    - https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00814




- Any other data you find interesting on NASA Earth data distributions through ESDS: https://earthdata.nasa.gov/


- Other datasets:
    - Natural Earth: raster and vector map data for visualization: http://www.naturalearthdata.com/downloads/
    - Koeppen climate zones: http://climateviewer.org/history-and-science/geoscience-and-oceanography/maps/geography-of-the-koppen-climate-classification-system-overlay/
    - 15 free satellite imagery sources: https://gisgeography.com/free-satellite-imagery-data-list/  
    - Global Administrative Areas (GADM): http://gadm.org/
    - Global Lake and Wetland Database: https://www.worldwildlife.org/pages/global-lakes-and-wetlands-database
    - Shoreline, Coastlines: https://www.ngdc.noaa.gov/mgg/shorelines/gshhs.html


---
# What to deliver and when?

1. Write down your geospatial data anslysis problem (a set of questions to explore) and send it to me via e-mail. Wait for confirmation, before beginning the analysis. (**Saturday, November 20th, 2021**).
2. Data analysis code (in the form of a script-file or notebook) that you used to conduct the required data analysis. (**December 8**)
3. A paper summarizing the work (minimum 3 pages, incl. figures, introduction, data, methods, results, conclusions) as pdf file. (**December 8**)
4. Google slides for a **3-minute** presentation that you will use on December 7 for presenting your methods and results. Submit as link to the slides (**December 6**)
5. Lightening talk on **December 7, 8.00-9.15 am**.

**Note:** Part 2 could also be implemented in a jupyter notebook, but it should contain only the most important results to summarize the work. In this case, submit also a pdf version of the notebook and keep in mind that code does not count toward the 3-page minimum.

---
# Final Project Presentation

**When?**
December 7 during classes 8.00-9.15 pm

**Where?**
Computer Lab. Please let me know, if you need anything specific for the presentation.

**How?** 
Prepare for a 3-minutes lightening talk presented via google slides and followed by 1-2 minute questions. We will have a tight schedule on this day, so please practice your presentation in advance, and stick to the time limit.

Presentations will be given in the following order:
- Mohammad            
- Aaron               
- Alexander           
- Alina               
- Ben                 
- Dhari               
- Faisal              
- Tyler      
- Junyao              
- Maddie              
- Rose      
- Leonard             
- Sonia        
- Ahmed               


---
# Grading

- Python code: 40%
- Report paper: 30%
- Presentation: 30%