<a href="https://colab.research.google.com/github/MScEcologyAndDataScienceUCL/BIOS0032_AI4Environment/blob/main/1_Work_environment/2_Intro_to_libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Our software toolset**

Python is a general programing language. This means you can make Python do anything a computer can do.

It is also a lenient and intuitive language - compared to others - and useful for fast prototyping and exploration.

Python provides a limited set of tools for code building. These are the **builtin functions**, such as `print`, and the **standard library**. With these basic tools you can build anything. 

However, Python has been a popular language for decades and the wide community of its users has developed amazing tools - built upon the basic ones - to help write code for many tasks. These are third-party **libraries**, and you will find a huge variety of them, from tools to do ML, create web apps, make games, make beatiful plots, do statistics, etc.

The use of these third-party libraries is key if you wish to do complex stuff in Python **fast**. It is worthwhile to learn the tools in your programming niche. The popular libraries are generally well-writen, safe, efficient and tested, and so it is not advised to write your own solution, unless it does not exist, you have very good reasons to make a new one, or for educational purposes. 

This notebook is a brief introduction to some of the libraries we will be using across these Labs. We cannot go deeply into any one, but we encourage you to seek out more information should you need to by looking into the documentation, tutorials or examples.

## The Standard Library

The standard library is a set of tools that comes included in any Python installation. These tools are what Python developers consider essential, such as

* Exploring the filesystem
* Reading/writing in files in basic formats (such as CSV, JSON)
* Parallel computation

And many others.

### `pathlib` and `os`

Handling file paths is tricky, as this is OS-dependent - simple example: Windows uses the forward slash `\`, Linux and MacOS uses the backward slash `/`.

If not careful, a Python program that runs in Windows might break in MacOS.

To avoid this use `pathlib` module (or the older `os` module).

In [None]:
# Import Path object from pathlib. pathlib is in the standard library and will always be available
from pathlib import Path

In [None]:
# Create a path instance pointing to this notebook
path = Path("2_Intro_to_libraries.ipynb")

In [None]:
# Print the full path
print(f"Full path: {path.resolve()}")

In [None]:
# Test if the file exist
print(f"{path.exists() = }, {path.is_file() = }, {path.is_dir() = }")

In [None]:
# Get the containing directory or folder
parent_directory = path.parent
print(f"Parent directory: {parent_directory.resolve()}")

In [None]:
# Use path objects to navigate
other_notebook = (
    parent_directory / "1_Intro_to_colab.ipynb"
)  # Get file in parent directory
parent_parent_dir = parent_directory / ".."  # Go up one folder

In [None]:
temp_file = parent_directory / "notes.txt"
print(f"{temp_file.exists() = }")

# Create empty file
temp_file.touch()
print(f"{temp_file.exists() = }")

# Write to file
temp_file.write_text("Hello world!")

# Read file
with temp_file.open() as fp:
    print(fp.read())

In [None]:
# Break path into its components
full_path = path.absolute()
print(f"{full_path = }")
print(f"{full_path.name = }")
print(f"{full_path.stem = }")
print(f"{full_path.suffix = }")
print(f"{full_path.root = }")
print(f"{full_path.parent = }")
print(f"{full_path.parts = }")

In [None]:
# Use the os module for communication with the OS
import os

# List all files in the parent directory
print(os.listdir(parent_directory))

In [None]:
# Check if file exists
if temp_file.exists():
    # And remove if it does
    os.remove(temp_file)

## Numpy

Python is considered very flexible and easy to write. This comes at a cost as, in general, Python is not the fastest language out there.

However, you **can** get fast computation if you use the right tools. In particular `numpy` is the **numerical computation** library for Python. 

All Python packages for scientific computing are built upon - or are compatible with - `numpy`.

### Numpy arrays

In [None]:
# Import numpy. Will be preinstalled in a colab environment, but is not part of the standard library.
import numpy as np  # Usually numpy is abreviated as np

In [None]:
# Numpy stores data in "arrays".
array = np.array([1, 2, 3, 4])

# Array vaues can be accesed through indices
print(array[0])

In [None]:
# They can have multiple "dimensions" or "axis"
matrix = np.array([[0, 1], [-1, 1]])
print(f"{matrix.ndim = }, {matrix.shape = }")

In [None]:
# Access array values using multiple indices
print(matrix[0, 1])

![numpy arrays](https://predictivehacks.com/wp-content/uploads/2020/08/numpy_arrays-1024x572.png)

A graphical depiction of arrays of dimensions 1, 2 and 3

### Creating arrays

In [None]:
# Create array by providing its values
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

array

In [None]:
# Create array of zeros
zeros = np.zeros([3, 3])

zeros

In [None]:
# Create an array containing a range of values
range_array = np.linspace(0, 50, 11)  # Array of values from 0 to 50 of length 11

range_array

In [None]:
# Numpy provides many random number generators
# np.random.random provides random numbers between 0 and 1.
random_array = np.random.random(size=(10, 10, 10))

### Computing with arrays

In [None]:
# Compute the minimum, maximum, mean and total sum of array values
minimum = random_array.min()
maximum = random_array.max()
mean = random_array.mean()
total_sum = random_array.sum()

print(f"{minimum = }, {maximum = }, {mean = }, {total_sum = },")

In [None]:
# Numpy provides many common mathematical functions
x = 10

square_root = np.sqrt(x)
sine = np.sin(x)
exponential = np.exp(x)
logarithm = np.log(x)

# Apply function to each array value
np.cos(array)

In [None]:
# Operate arrays
y = array + np.sqrt(array)  # Sum
z = 10 * array  # Multiplication
w = z @ y  # Matrix  multiplication
u = w >= 1000  # Comparison

## Pandas

Pandas is **the** Python library for handling **tabular data**.

If you are familiar with R, you will be used to handle `dataframe`s. Pandas provides a `DataFrame` object that is very similar to the R `dataframe`

### Loading or creating dataframes

In [None]:
import pandas as pd  # Pandas is usually abreviated as pd

In [None]:
numerical_column = [1, 2, 3]
categorical_column = ["a", "b", "a"]
boolean_column = [True, True, False]

# Build dataframe by specifying the column data and column names
dataframe = pd.DataFrame(
    {
        "numerical": numerical_column,
        "categorical": categorical_column,
        "boolean": boolean_column,
    },
)

dataframe

In [None]:
# Create a csv file
csv_file = Path("table.csv")

# Write a few columns
csv_file.write_text(
    """
column_a, column_b, column_c
1.3, open, 10-01-2023
4.2, closed, 09-01-2023
-9.3, semiopen, 30-12-2022
"""
)

# Read csv with pandas
dataframe_from_csv = pd.read_csv(csv_file)

dataframe_from_csv

In [None]:
# Save a dataframe to disk
dataframe.to_csv("table2.csv")

# Print the contents of the new file
print(Path("table2.csv").read_text())

Pandas dataframes can be read and saved in many formats:
* csv
* json
* excel
* html
* ...

### Basic dataframe statistics

In [None]:
# Load dataframe from file in the internet
iris = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
)

In [None]:
# Show the first rows of the dataframe
iris.head()

In [None]:
# Describe the columns
iris.info()

In [None]:
# Get quick statistics on the column values
iris.describe()

# Note that only numerical columns get summarized

In [None]:
# Extract the "species" column
species = iris["species"]

# Count the number of entries per species
species.value_counts()

In [None]:
# Get the minimum value of a column
min_sepal_length = iris["sepal_length"].min()

# The the mean value of another column
mean_petal_width = iris["petal_width"].mean()

print(
    f"Minimum sepal length = {min_sepal_length}; mean petal width = {mean_petal_width}"
)

In [None]:
# Note: In the background dataframe are a 2D numpy array and columns are 1D numpy arrays.
underlying_array = iris.values

print(f"type = {type(underlying_array)}, shape = {underlying_array.shape} (rows, cols)")

species_column_array = iris["species"].values

print(
    f"type = {type(species_column_array)}, shape = {species_column_array.shape} (rows,)"
)

### Accessing & editing dataframe values

Pandas dataframes can be though as a collection of rows or columns. Each row and column has a unique identifier or index. The indices can be used to reference particular values in the dataframe.

Usually the column name is its index, and rows are indexed by the row number.

![pandas dataframe](https://pynative.com/wp-content/uploads/2021/02/dataframe.png)

(taken from [PyNative blog entry on pandas dataframes](https://pynative.com/python-pandas-dataframe/))

In [None]:
# Get a specific entry in the dataframe

# Get value by indices
value = iris.loc[10, "species"]  # value @ row with index 10, species column

# Get value by numerical position
value2 = iris.iloc[10, 4]  # value @ 10th row, 5th column (Python is 0-indexed)

# A combination of the previous approaches
value3 = iris["species"][10]  # 10th value in the species column

# All are equivalent ways of retrieving the same value
print(value == value2 == value3)

In [None]:
# Edit a single entry
iris.loc[0, "sepal_length"] = 10

# Create a new column
iris["petal_area"] = iris["petal_width"] * iris["petal_length"]

### Filtering values

In [None]:
# Check which rows have "setosa" in the species column
is_setosa = iris["species"] == "setosa"

# Select the rows for which the condition "is_setosa" is True.
filtered = iris[is_setosa]

filtered.head()

In [None]:
# Numerical comparisons
has_short_petal = iris["petal_length"] < 1.2

# Combine conditions
filtered = iris[is_setosa & has_short_petal]

filtered

### Compute on every row

In [None]:
# Iterate over every row
total_area = 0
for index, row in iris.iterrows():
    flower_area = (
        row["sepal_length"] * row["sepal_width"]
        + row["petal_length"] * row["petal_width"]
    )
    total_area += flower_area


# Apply a function to each row
iris_areas = iris.apply(
    lambda row: row["sepal_length"] * row["sepal_width"]
    + row["petal_length"] * row["petal_width"],
    axis=1,
)

print(total_area == iris_areas.sum())

In [None]:
total_area

In [None]:
iris_areas.sum()

### Grouping by values

In [None]:
# Group dataframe rows by species and compute the standard deviation of column values within each group.
iris.groupby("species").std()

## Matplotlib & Seaborn

### Creating a simple scatter and line plot

### Modifying titles and axes labels

### Creating compound figure

### Adding text

### Seaborn

## Other libraries

* Scikit-Learn
* Pillow and Scikit-Image
* Stats, PyStan and PyMC3
* Geopandas,

## Further resources