# Welcome to SIO(C) 209 - Deep Learning for Geo and Environmental Science

 - This hands-on course will introduce you to the application of deep learning techniques in the field of environmental sciences. 
 - It covers the main classes of supervised and unsupervised machine learning algorithms and provides practical experience in training and validating real-world models. 
 - You will gain the skills necessary to analyze environmental data, make predictions, and uncover hidden patterns.

### Prerequisites 

This is not an 'introduction to ML' course. I'll assume you have:
 - Prior knowledge of machine learning fundamentals, including linear algebra, gradient descent and backpropagation. 
 - Basic programming skills (e.g., Python


### Course Objectives
By the end of the course, students should be able to:
 - Understand the principles of machine learning and its applications in environmental sciences.
 - Apply supervised and unsupervised machine learning techniques to environmental data.
 - Evaluate and validate machine learning models using appropriate metrics.
 - Interpret and communicate results effectively.
 - Develop proficiency in using machine learning libraries and tools in Python.


### Course Objectives - specifics

Specifically, we will cover:
- Preparing your data for machine learning
- Choosing a model to use
- Basic introduction to different classes of models:
  - Deep Neural Networks
  - Gaussian processes
  - Tree based models
- Fine tuning / hyper parameter estimation for specific example tasks

### Course Objectives - extras

If we have time, we will also cover:
 - Generative models
 - Large language models
 - A deep dive on *why* these models work at all

### Course Objectives - not covering 

For the avoidance of doubt, we won't cover:
- The mathematical underpinnings of:
 - Linear algebra (including Ridge, SVM, PCA, etc)
 - Neural networks and back propagation
 - Gradient descent
- Time series analysis
- Designing your own machine learning algorithms

### Course Objectives

Instead, we will try and build your understanding of these methods through practice.

We will work through published examples of:
- Climate model emulation
- Plankton classification
- Whale call detection
- Learning stream modification patterns in a river basin
- Satellite image clustering  

## Introductions

### About me
 - Assistant Professor at SIO and HDSI, joined in Spring 2023
 - Run the [Climate Analytics Lab (CAL)](http://climate-analytics-lab.github.io/) - harnessing machine learning to improve climate projections

<center><img src="_images/CAL.png" alt="logo"/></center>

### About me
 - Assistant Professor at SIO and HDSI, joined in Spring 2023
 - Run the Climate Analytics Lab (CAL) - harnessing machine learning to improve climate projections
 - PhD from University of Manchester 2011 in Theoretical Physics
 - Worked as a software consultant 2011-2015 - many projects!
 - Postdoc + Senior Postdoc at University of Oxford 2015-2023 
 - Moved over from the UK with my wife and two kids (11 & 13)

### Tell me about you
 
  - What year are you in and which program?
  - Previous experience with Python
  - Previous experience with machine learning / statistics
  - Any related projects you might have worked on

### Expectations
 - This is a graduate course so please feel free interrupt to ask questions
 - Let me know if you find the pace too fast, or too slow
 - I would encourage you to work in pairs during class exercises
 - Let me know if there is something I can do to make your learning easier
 - Be courteous and respectful of staff and other learners at all times 
 - Feel free to use ChatGPT and other AI assistants in your work!

### Logistics

 - Most course material will be made available on GitHub
 - Course specific material will be on Canvas:
   - Lecture schedule and recordings
   - Links to key datasets
   - Office hours
   - Grading policy
   - Academic integrity policy

### Grading and course structure

- **TBD**
- Coursework and final project can be completed collaboratively, but must be submitted individually. Co-authors should be explicitly declared.

### Course textbook and resources

There is no official textbook but we will be heavily leaning on both:
 - Deep Learning with Python by Fran√ßois Chollet
 - Understanding Deep Learning by Simon Prince

 We also build heavily on the "[Machine Learning for Earth and Environmental Sciences](https://tbeucler.github.io/2023_MLEES_Published_Book/)" course by Tom Beucler at UNI Lausanne and "[An Introduction to Earth and Environmental Data Science](https://earth-env-data-science.github.io/intro.html)" by Ryan Abernathy.

### Course textbook and resources

As the first running of this class many notes and lectures will appear as we go along, but a Jupyter Book with extended notes (and the lecture slides) is available on GitHub: https://github.com/climate-analytics-lab/sioc209-2024-sp

If you find any mistakes or have suggestions for improvements you can always make a pull request :-)

## Python Setup

During the course we will explore a number of real use-cases which use a variety of Python libraries, reflecting the broad options available. 

We have tried to harmonize the Python environment for the course to make it as straight-forward as possible, but you may still want to use multiple environments to keep the examples clean and separate.

We will use the rest of this lecture to ensure everyone has at least the base environment setup somewhere.

### Where to setup your environment

As we will discover, deep learning often involves large amounts of data - 10s Gb or more. Sometimes much more.

At the same time, training these algorithms can be much more efficient using specific hardware, such as GPUs and TPUs. Getting the data on to these chips is invariably the main bottleneck and so we want the data as close to them as possible (and ideally in memory).

Therefore, we often find ourselves wanting, or needing, to use remote environments on large clusters which host the data and the hardware.

### Remote options

Ideally you will have access to a dedicated cluster in your lab, or perhaps at a larger facility (e.g. [SDSC](https://www.sdsc.edu)). 

For this course we have deliberately cut down data sizes to make working on your laptop feasible, but you also have a couple of remote options:
- UCSD DataHub
- Google Colab

### Remote options - DataHub

DataHub provides free JupyterHub instances for classes in UCSD: https://datahub.ucsd.edu/

Click 'Log In' and select the SIO 209 course to get a specific configuration that includes most of the key libraries we will use pre-installed.

Shared datasets will also be made available in the read-only `public` folder for ease of access

### Remote options - Google Colab

Google also provides a free (as in beer) platform for running small machine learning pipelines in the cloud: http://colab.research.google.com

With this you can open notebooks in Drive, from GitHub or uploaded locally. You can also make use of Google's TPUs for efficient learning on small-ish tasks.

Shared data can be added directly by mounting the 'CAL Shared Data' Drive in your workspace. Specific libraries (such as `xarray`) need to be installed each time

### Local options

As noted, most of the examples used in this course should be runnable on your laptop, albeit slowly. 

Setting up a local environment is best done using `conda` and `pip`. 

I usually create a new environment using `conda` with e.g. `xarray` and `cartopy` and then install the ML libraries with `pip` which then includes optimizations for your hardware. YMMV!


See detailed instructions [here](python_env_setup.md). Note, you can make use of Mac GPU's with `pip install tensorflow-metal`

### Python requirements

The full list of python packages required is in the [evironment.yml](../../environment.yml) file, but these are the important ones:
 - `python=3.10` (for compatibility)
 - `xarray`
 - `netcdf4`
 - `cartopy`
 - `scikit-learn`
 - `tensorflow`

