<a href="https://colab.research.google.com/github/IsabelleLebTay/Python-for-Ecologists/blob/main/Python_for_Ecologists.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a course designed to give a high-level overview of that ecologists can do in Python. it is not intended to teach how to program in Python -- for in-depth learning, see this recommendations.

Participants who are comfortable in any programming language will get the most out of this workshop. Most ecologists work in R, so there will be an emphasis on comparing actions in R to Python.

This runs on Python 3.

This is a tentative program:

1. Design differences between the two languages, strengths of both, similarities
- where to find install and packages info, IDE
- running scripts in terminal as an option
- Everything Is An Object
- Troubleshooting

2. Syntax and basics
- syntax and naming conventions
- basic data types in python: string, int, float, list, dict, set
- basic operations like indexing, math
- include a cheat sheet from R to python (for loop, calling/defining a function, etc)

3. Data manipulation and filtering
- pandas: the one and only dataframe package
- lots of exercises in pandas
-  introduce a few very useful basic packages, like numpy, os, datetime, rpy2

4. Data visualisation
- seaborn, matplotlib (and how they compare with ggplot)

5. Stats
- statsmodel for fitting models, and showcase a few homebrew packages that are built off of statsmodel
- sklearn for predicting, regression trees, etc
- tensor flow for machine learning?
- cmdstanpy? :)

5.5 Machine Learning?

6. GIS
- how to read in and use GIS data, manipulation
- focus on geopandas and fiona to get the basics of the data types
- briefly introduce gee (Google Earth Engine integration) and ArcPy for more complex GIS exercises

7. API integrations
- requests package: how to send HTTPS requests, and using the json method



# Welcome to Python for Ecologists!


## Intent
⚛ *ChatGPT can write you the code* ⚛ <br><br>
The goal of this workshop is to help you develop a structural understanding of the language, so you can <br>

*   ask the best prompts
*   understand errors
*   troubleshoot
*   learn what tools are available




## Useful resources
Python for Data Analysis: https://wesmckinney.com/book/

# Language Design

**Overview**


*   **R:**
  - Satistical language
  - Strengths: data analysis, stats modelling, plotting
  - Package ecosystem: CRAN

* **Python:**
  - General purpose language
  - Web dev, scripting, stats, data science, machine learning
  - Package ecosystem: Pip or Conda





### *Interpreted and sequential*

Python is an interpreted language. Like R, errors come up **after** running a line of code.

It is also sequential: the order in which lines are written matters.

In [8]:
# let's get some packages
import rpy2

In [9]:
%load_ext rpy2.ipython

In [6]:
# each line is executed fully before the next
my_list =
for i in range(6):
  print(i)

0
1
2
3
4
5


In [10]:
%%R
# same as R
for (i in 1:6) {
  print(i)
}


[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6


## *Object-oriented*

The number on rule in Python: Everything Is An Object.

Some popular IDEs for Python are JupyterLab, Visual Studio Code. You can also run R in all of these IDEs. However, VS Code has a R extension that is made to ressemble the RStudio layout and functionalities. It supports all you love about R: the pane layouts, documentation search, linting, Rmarkdown, Quarto, etc.

Unlike RStudio, the install and management of you python packages are more hands on. The key to understanding how your IDE finds your program installs. This is the Python Environment.

**Python Environment**

A Python environment is a self-contained directory that contains a Python installation and various additional packages. It allows you to manage project-specific dependencies and versions, separate from other Python projects or the system-wide Python installation.

Virtual Environment: A virtual environment is a tool that helps to keep dependencies required by different projects in separate places. They are created by using a module, such as venv (which is part of the standard library in Python 3.3 and later) or virtualenv (a third-party tool). By using a virtual environment, you can avoid installing Python packages globally which could break system tools or other projects.

Usually, your virtual environment is created in the terminal. This will come up in a practical tip in the GIS section again. I *highly* recommend creating new Virtual Environments when you are using a large package that dominates your work for the given task. For example, I have a virtual env for GIS tasks, and another one for the Bayesian tasks, and yet another for internet API SSL tasks.

Dependencies: These are external libraries that your Python project needs to function properly. Each dependency may, in turn, have its own dependencies. Managing dependencies is one of the main reasons to use isolated Python virtual environments. For example, your GIS workflow includes geopandas, which is compatible with pandas v =< 4.1. Your base environment is always updated with the latest pandas (because as an ecologist, you are using pandas all day, every day), which is at a version geopandas cannot support.

Requirements File: Often called requirements.txt, this file lists all of the packages that a project depends on. This file can be used with pip to install all of the necessary packages at once with the command pip install -r requirements.txt.



## Functions, methods, property
Functions are called, and can take arguments.
Methods

# Syntax & Operations

**Assignment**
  - **R**: <-
```
>>> this_vector <- c(1, 2, 3)
```


  - **Python**: =
```
>>> this_list = [1, 2, 3]
>>> print(this_list, type(this_list))
[1, 2, 3] <class 'list'>
```

<br>

**Vectoring**
Single-dimensional array of values
  - **R**: natural (built-in)
```
>>> cat(this_vector, class(this_vector))
1 2 3 "numeric"
```

  - **Python**: not built-in. Use the NumPy library
```
>>> numpy_vector = np.array(this_list)
>>> print(numpy_vector, type(numpy_vector))
[1 2 3] <class 'numpy.ndarray'>
```
<br>

**Data Structure**
  - **R**: Rich structure, like built-in dataframes

  - **Python**: Workhorses are lists, tuples & dictionaries. No built-in dataframe structure.

<br>

**Semantics**
Python emphasises readability, clarity, and simplicity.
  - **R**: Curly {} brackets for code blocks

```{R}
less <- c()                               # initialise empty vectors
greater <- c()                            # initialise empty vectors

for (x in array) {                        # for loop, {}
    if (x < pivot) {
        less <- c(less, x)
    } else {
        greater <- c(greater, x)
    }
}

# if someone shared you this code, can you read & understand what happens, without having to run it?
```

  - **Python**: Indents, and semi-colons

```{python}
less = []                                 # initialise empty list
greater = []                              # initialise empty list

for x in array:                           # for loop, : and indents
    if x < pivot:
        less.append(x)
    else:
        greater.append(x)

# How fast do you understand what is happening here?
```


<br>

**Libraries**
  - **R**
    - Install from CRAN to your local machine, and load to the script
 ```   
    install.packages("package.name")
    library(package.name)
```
  - **Python**
    - Install either from Pip or Conda (depends on the package and your environment set-up). This is done outside the script, in terminal/shell, and is loaded into your environment of choice. For example, when you run a .py file in terminal, choose the environmental variable path that includes the loaded libraries.
```
# In terminal:
>>> pip install pandas
# In script:
>>> import pandas as pd
```

<br>

**Data Handling**
  - **R**: native dataframe handling
  ```
df <- read.csv("path/file.csv")
  ```

  - **Python**: only through advanced libraries
  ```
>>> df = pd.read_csv("path/file.csv")
>>> type(df)
pandas.core.frame.DataFrame
  ```

<br>

**Plotting and Viz**
  - **R**: some built-in with plot(), plus libs like ggplot2
  - **Python**: also has base plot(), plus libs like Matplotlib, Seaborn, etc

## Object-Oriented Programming: OOP

***Everything is an object***

*From the McKinney book:*

"Every number, string, data structure, function, class, module, and so on exists in the Python interpreter in its own “box,” which is referred to as a Python object. Each object has an associated type (e.g., integer, string, or function) and internal data. In practice this makes the language very flexible, as even functions can be treated like any other object."

<br>

One of the main fundamental differences between the two languages is that OOP is central to Python. In R, there is some OOP support, so you should recognize some similarities.

**What is Object-Oriented Programming?**


The focus is to create objects which combine both data and the functions that operate on that data.

> Tip! Did you notice we always checked the _type_ of the objects we created? This tells us two important things:
> 1. How is the data stored
> 2. What functions can we apply to the object

In Python, everything is an object, from numbers, strings, functions, and user-defined classes. They each belong to a specific type (aka class) and have attributes and methods.

**Attribute**
* A property of an object. The information is stored along with the data.

**Method**
* A function that acts on the object

Example: Every object in Python is an instance of a Class. Users can create custom objects, which have attributes and methods, and can interact with other objects and that object's methods.



In [None]:
class Animal:
    def __init__(self, name, species, age):
        self.name = name        # Name of the animal
        self.species = species  # Species/type of the animal
        self.age = age          # Age of the animal
        self.is_awake = False   # By default, the animal is not awake

    def speak(self):
        """
        Simulate the animal speaking. This will be more generic
        for the base Animal class and can be overridden in subclasses.
        """
        return f"I'm a {self.species}. My name is {self.name}."

    def celebrate_birthday(self):
        """
        Increase the age of the animal by 1 and return a birthday message.
        """
        self.age += 1
        return f"Happy Birthday, {self.name}! You are now {self.age} years old."

    def wake_up(self):
        """
        Wake up the animal.
        """
        self.is_awake = True
        return f"{self.name} is awake."

    def sleep(self):
        """
        Make the animal sleep.
        """
        self.is_awake = False
        return f"{self.name} is asleep."


We've create an Animal Class. Let's create an object from this class, and learn more about it.

In [None]:
# Create an animal object
my_pet = Animal(name="Woolly", species="caterpillar", age=4)

# Accessing attributes. Note there are no () when calling an attribute
print(my_pet.name)
print(my_pet.species)
print(my_pet.age)
print(" ")
# Using methods. Note the (). This means the function can take arguments, if that option exists.
print(my_pet.speak())
print(my_pet.celebrate_birthday())
print(my_pet.wake_up())
print(my_pet.sleep())
print(my_pet.celebrate_birthday())
print(f"{my_pet.name}'s age: ", my_pet.age)

Woolly
caterpillar
4
 
I'm a caterpillar. My name is Woolly.
Happy Birthday, Woolly! You are now 5 years old.
Woolly is awake.
Woolly is asleep.
Happy Birthday, Woolly! You are now 6 years old.
Woolly's age:  6


We can interact with the object of this class.

For example, let's write a class that monitors the time. If it is nighttime, the animal should be asleep.

In [None]:
class WhatTimeIsIt:
    def __init__(self, hour=0):
        self.hour = hour

    def update_time(self, hour):
        """
        Set the current time.
        """
        self.hour = hour

    def get_time(self):
        """
        Return the current time.
        """
        return self.hour

    def notify_animal(self, animal):
        """
        Notify the animal of the time change and have the animal act accordingly.
        """
        if 6 <= self.hour < 20:  # If it's day (6am to 8pm)
            return animal.wake_up()
        else:  # If it's night
            return animal.sleep()



In [None]:
# Creating a time object
current_time = WhatTimeIsIt(hour=2)  # Set to 2am
print(current_time.notify_animal(my_pet))


Woolly is asleep.


In [None]:
# Change the time to 7am. The obeject.update_time() method takes 1 argument
current_time.update_time(17)
print(current_time.notify_animal(my_pet))

Woolly is awake.


**Why Object-Oriented?**
We interacted with the Animal object to wake it up and put it to sleep. We could do that without having to change anything about the object _itself_!

This refers to **modularity**: each class is encapsulated in a single theme. If you've written long scripts where multiple objects interact, like in spatial analysis, evolving SDMs, community dynamics, an OOP approach can make the code cleaner, compartmentalise the complexity, reproducible.

## Lists, Dictionaries, Tuples

* List: single dimension, iterable
* You can iterage through a list

In [None]:
%load_ext rpy2.ipython

In [None]:
import random

# init an empty list
list_of_ints = []

# populate the list
for i in range(8):
  int_to_add = random.randrange(start= 0, stop=15)
  list_of_ints.append(i+int_to_add)
list_of_ints

[11, 12, 3, 14, 12, 5, 6, 14]

### Assigning and referencing variables

In Python, when assigning a name, we create a *reference* to the object on the righthand side.

In [None]:
site_a_count = [1, 2, 3]                                               # counting owls at site a
site_b_count = site_a_count                                            # actually, this is site b
site_b_count.append('maybe saw one more owl')                          # update owl spotting at site b

print(f"Conclusion for site a: {site_a_count}")
print(f"Conclusion for site b: {site_b_count}")

Conclusion for site a: [1, 2, 3, 'maybe saw one more owl']
Conclusion for site b: [1, 2, 3, 'maybe saw one more owl']


What do you think is happening here? Why did the list stored in variable site_a_count change as well?

*Tip: what is the object referenced by the variable name site_a_count?*

In R, assigning a variable to another will create a *copy*.

In [None]:
%%R
site_a_count <- list(1, 2, 3)                                                     # counting owls at site a
site_b_count <- site_a_count                                                      # actually, this is site b
site_b_count <- c(site_b_count, 'maybe saw one more owl')                         # update owl spotting at site b

cat("Conclusion for site a:", toString(site_a_count), "\n")
cat("Conclusion for site b:", toString(site_b_count), "\n")


Conclusion for site a: 1, 2, 3 
Conclusion for site b: 1, 2, 3, maybe saw one more owl 


We saw another owl at site b, but this update is not reflected in the vector assigned to site a.

<br>

*Note that in R, a vector generally holds a single data type.*

### List comprehension: single line coding for nerds
You can write the above for loop in a single line:

In [None]:
list_of_ints = [i+random.randrange(start= 0, stop=15) for i in range(8)]
list_of_ints

[4, 4, 10, 8, 4, 10, 14, 11]

You can have lists of lists:

### Dictionaries
Dictionaries are made of keys and values. Think of the keys as the way you lookup entires in the dictionary. The

```
>>> this_dict = {1: a, 2: b, 3: c}
>>> print(this_dict(this_list))
[1, 2, 3] <class 'list'>
```

In [None]:
this_dict = {1: 'a', 2: 'b', 3: 'c'}
print(this_dict
      )

{1: 'a', 2: 'b', 3: 'c'}


In [None]:
this_dict?


### If else statement

In [None]:
type(empty_list)
isinstance(empty_list, list)

True

### Indexing
Access a value by its position

In [None]:
random_list

In [None]:
# Define a list
my_list = [1, 2, 3, 4, 5]
print(type(my_list))

# 1. Append an element to the list
my_list.append(6)  # Now, my_list is [1, 2, 3, 4, 5, 6]

# 2. Extend the list with another list
my_list.extend([7, 8])  # Now, my_list is [1, 2, 3, 4, 5, 6, 7, 8]

# 3. Insert an element at a specific position
my_list.insert(2, 2.5)  # Now, my_list is [1, 2, 2.5, 3, 4, 5, 6, 7, 8]

# 4. Remove a specific element (first occurrence)
my_list.remove(2.5)  # Now, my_list is [1, 2, 3, 4, 5, 6, 7, 8]

# 5. Pop an element from the list by index (default is the last element)
popped_element = my_list.pop()  # popped_element is 8, and my_list is [1, 2, 3, 4, 5, 6, 7]

# 6. Find the index of an element (first occurrence)
index_of_4 = my_list.index(4)  # index_of_4 is 3

# 7. Count occurrences of an element
count_of_3 = my_list.count(3)  # count_of_3 is 1

# 8. Sort the list in place
my_list.sort(reverse=True)  # my_list is [7, 6, 5, 4, 3, 2, 1]

# 9. Reverse the list
my_list.reverse()  # my_list is [1, 2, 3, 4, 5, 6, 7]


<class 'list'>


In [None]:
this_list = [1, 2, 3]
import numpy as np
this_array = np.array(this_list)
print(this_list, type(this_list) )
print(this_array, type(this_array))
print(type(this_array))

[1, 2, 3] <class 'list'>
[1 2 3] <class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [None]:
%%R

x <- c(1, 5, 4, 9, 0)
print(x)
print(typeof(x))

cat(x, typeof(x))
length(x)
x <- c(1, 5.4, TRUE, "hello")
x
print(typeof(x))


this_vector <- c(1, 2, 3)
cat(this_vector, "    ", class(this_vector))

[1] 1 5 4 9 0
[1] "double"
1 5 4 9 0 double[1] "character"
1 2 3      numeric

# Data manipulation

First, connect your Colab work session to your google drive. You'll be able to load the csv's this way.

*Alternative:* If you do not want to connect your drive, you can upload the data directly from your local machine using the upload button on the top left, under the folder section. The files will uncache when you close this Colab session.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


R users: should I use the base data.frame? Tidyverse? DataTables?

Python users: PANDAS!



Load the package

In [None]:
import pandas as pd

Read in some data with pandas

In [None]:
penguins = pd.read_csv("drive/MyDrive/Python for Ecologists/penguins.csv")

## Explore the penguins dataframe


In [None]:
type(penguins)

pandas.core.frame.DataFrame

df.head() by default looks at the top 5 rows. You can put any integer as an argument. Test it out.

In [None]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


df.tail() looks at the last x rows, where deault is x = 5.

In [None]:
penguins.tail(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
334,Chinstrap,Dream,50.2,18.8,202.0,3800.0,male,2009
335,Chinstrap,Dream,45.6,19.4,194.0,3525.0,female,2009
336,Chinstrap,Dream,51.9,19.5,206.0,3950.0,male,2009
337,Chinstrap,Dream,46.8,16.5,189.0,3650.0,female,2009
338,Chinstrap,Dream,45.7,17.0,195.0,3650.0,female,2009
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009
343,Chinstrap,Dream,50.2,18.7,198.0,3775.0,female,2009


Check the names of the columns

What are the columns, and how are the values stored in each column?
Use the dtypes attribute (pandas.DataFrame object attribute)

In [None]:
penguins.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
year                   int64
dtype: object

What more should we know about the object?

- Check the data type (ie, in what format is this data stored)

The object *penguins* is a pandas.core.frame.DataFrame object. This dataframe is an instance of a class. You can operate on the object by calling

In [None]:
penguins.shape

(344, 8)

In [None]:
type(penguins['species'])

pandas.core.series.Series

Filtering, viewing

In [None]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


### Common operations in R (tidyverse example)
To run a cell in R, use the rpy2.ipython extension and the %R magic:


*   %%R to run the cell in R
*   %R to run the line in R

Learn more about R <-> Python magic here: https://rpy2.github.io/doc/latest/html/interactive.html


In [None]:
%%R
library(tidyverse)


Copy the penguins pandas dataframe to the R environment.

In [None]:
%R -i penguins

  for name, values in obj.iteritems():


In [None]:
%%R
class(penguins)

[1] "data.frame"


In [None]:
%%R
filter(penguins, )

# Visualisation

In [None]:
from matplotlib import plot
import seaborn as sns


# Stats

# GIS

In [None]:
import fiona
import geopandas as gpd

In [None]:
!pip install arcgis

Collecting arcgis
  Downloading arcgis-2.2.0.1.tar.gz (47.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.9/47.9 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pandas<3,>=2.0.0 (from arcgis)
  Using cached pandas-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting pylerc (from arcgis)
  Using cached pylerc-4.0-py3-none-any.whl
Collecting ujson>=3 (from arcgis)
  Using cached ujson-5.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53 kB)
Collecting jupyterlab (from arcgis)
  Using cached jupyterlab-4.0.8-py3-none-any.whl (9.2 MB)
Collecting geomet (from arcgis)
  Using cached geomet-1.0.0-py3-none-any.whl (28 kB)
Collecting requests_toolbelt (from arcgis)
  Using cached requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB)
Collecting pyspnego>=0.8.0 (from arcgis)
  Using cached pyspnego-0.10.2-py3-none-any.whl (129 kB)
Collecting requests-kerber

In [None]:
from arcgis.gis import GIS
# Create a GIS object, as an anonymous user for this example
gis = GIS()

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

Support for third party widgets will remain active for the duration of the session. To disable support:

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
# Create a map widget
map1 = gis.map('Paris') # Passing a place name to the constructor
                        # will initialize the extent of the map.
map1

MapView(layout=Layout(height='400px', width='100%'))

In [None]:
map1.zoom


-1.0

In [None]:
map2 = gis.map() # creating a map object with default parameters
map2

MapView(layout=Layout(height='400px', width='100%'))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Support for third party widgets will remain active for the duration of the session. To disable support:

In [None]:
from google.colab import output
output.disable_custom_widget_manager()

# API integrations