# SciPy Stack introduction

This is a condensed SciPy Stack introduction created initially for 2021 KIPAC summer research students. It borrows material from a number of other tutorials in [KIPAC computing boot camp](http://kipac.github.io/BootCamp), but is designed to be covered in a 90 minute bootcamp session. 

Authors: Sidney Mau and Jessie Muir, adapting content from noteboks by [Yao-Yuan Mao](http://yymao.github.io), [Sean McLaughlin](https://github.com/mclaughlin6464), [Joe DeRose](https://github.com/j-dr), [Mike Baumer](https://mbaumer.github.io).

### What is the SciPy Stack?

The [SciPy Stack](https://scipy.org/index.html) is a set of Python software designed for scientific computing.
Most scientific computing in Python heavily relies on SciPy Stack software, including
- NumPy: numbers, vectors, arrays; implemented in C and Fortran so typically much faster than "normal" Python
- pandas: data structures and tables (more accessible than "raw" NumPy arrays)
- Matplotlib: plotting and visualization
- SciPy: general functions for optimization, fitting, integration, signal processing, etc.

Most scientific computing in Python uses these libraries, so it's important (1) to have some basic familiarity with them, and (2) to know how and where to look for more information on them.

In this notebook, we'll try and distill some of basics of each of these packages and introduce you to how to *teach yourself* more.
Each section will start with two links: (1) a more in-depth introductory tutorial for the software, and (2) the API reference for the software.
We encourage you to look into those resources after you feel comfortable with the basics covered in this notebook.

Definition: API
> An Application Programming Interface (API) is the primary interface between the user (i.e., you) and the code.
> For our purposes, the API of a Python package is how you will call functions, initialize classes, and more generally *use* that package.
> Understanding how to navigate the documentation of a package's API is a critical skill for learning how to use new software.

## NumPy

[NumPy quickstart](https://numpy.org/devdocs/user/quickstart.html)

[NumPy Reference](https://numpy.org/devdocs/reference/index.html)

In [None]:
# First, we need to import numpy, which is typically imported as "np" to reduce verbosity
import numpy as np

### Arrays

One of the fundamental objects in NumPy is the `ndarray`, or N-dimensional array ([documentation](https://numpy.org/devdocs/reference/arrays.ndarray.html)).

These objects cover most of the mathematical constructs we use in physics:
- scalar: 0-dimensional array
- vector: 1-dimensional array
- matrix: 2-dimensional array
- and so on...

Most coding in NumPy involves manipulating `ndarray`s (whether you're storing data in an array, trying to find the mean, doing linear algebra computations, they all involve `ndarray`s).
We'll start by making a few.

In [None]:
# First, a scalar array which holds the value 2.
sca = np.array(2)

# Second, a vector array, which holds the vector [1,0,0] (i.e., unit vector in the x-direction).
vec = np.array([1,0,0])

# Third, a matrix array, which holds the matrix [[1,0,0],[0,1,0],[0,0,1]] (i.e., a 3-dimensional identity matrix)
mat = np.array([[1,0,0],[0,1,0],[0,0,1]])

# Let's print them all out. The "\n" is a "newline" character so that the spacing is nice
print(f"scalar:\n{sca}\n")
print(f"vector:\n{vec}\n")
print(f"matrix:\n{mat}")

In [None]:
# We can "ask" each of these objects about themselves:
print(f"scalar:\n    ndim:{sca.ndim}\n    shape:{sca.shape}\n")
print(f"vector:\n    ndim:{vec.ndim}\n    shape:{vec.shape}\n")
print(f"matrix:\n    ndim:{mat.ndim}\n    shape:{mat.shape}\n")

Tip: if you want to know what you can do, there are a few options
- Check the API reference: https://numpy.org/devdocs/reference/arrays.ndarray.html
- Use the `help()` function to access the documentation within Python
- Use the `dir()` function to see what else you can "ask" the function about itself

In [None]:
# You can use help on `np.array` or any instance of `np.array` (e.g., our `sca`, `vec`, and `mat`)
#help(np.array)
help(mat)

In [None]:
# You'll see that `ndim` and `shape` are both listed in here.
# Generally, variables between double underscores (__) are for
# "internal" processing and you usually won't need or want to access these
# Don't worry if a lot of these terms are arcane right now—you will learn
# them in time, or you may never even need them!
dir(mat)

Hopefully this give you an idea of why NumPy is *so* useful for physics and scientific computing.
Pretty much everything we do in Physics involves scalars, vectors, matrices, and tensors—things which NumPy does well and *fast*.

This also highlights an ideological difference between physics and programming:
- in physics, we are often taught that vectors are objects with a magnitude and direction (i.e., an arrow in space)
- in programming, vectors are lists of numbers (you might note that this is a vector with some choice of basis)

As long as you keep your bases straight (so let's imagine we're always working with Cartesian x, y, z coordinates), NumPy has a number of functions that make linear algebra really easy for us.

In [None]:
# Let's multiply `sca` and `vec`—we're scaling a vector by a constant
print(sca*vec)

In [None]:
# What about multiply `mat` and `vec` (multiply a matrix and a column vector)
# Note that this should give us a matrix
print(mat*vec)

In [None]:
# And if we want to scale this by a scalar...
print(sca*mat*vec)

### Selections

One of the most useful things you can do in NumPy is to make selections of your data.
There are two primary ideas here:
1. Selecting by position in an array (e.g., the first 5 elements, or the last 3 elements, or evey fourth element, etc.).
2. Selecting according to some "filter" (e.g., every element which meets some condition).

In [None]:
# First, we'll make a list of 11 evenly spaced numbers from 0 to 1:
xs = np.linspace(0, 10, 11)
print(xs)

In [None]:
# We can grab specific elements or slices:
print(f"Zeroth element of `xs`: {xs[0]}")
print(f"Last two elements of `xs`: {xs[-2:]}")
print(f"Every third element of `xs`: {xs[::3]}")

In [None]:
# We can find every element exactly equal to 1...
print(xs == 1)
# and we can use this list to pick out elements:
print(xs[xs==1])

In [None]:
# Maybe we want every element which is even:
print(xs % 2 == 0) # % is the modulus operator
print(xs[xs%2==0])

While this may feel a bit obvious, it's this exact functionality which will allow us to extract particular features from more complex higher-dimensional data later on!

### Mathematical Functions

Finally, NumPy has *lots* of [mathematical functions](https://numpy.org/devdocs/reference/routines.math.html).
If you need a mathematical function that's not in NumPy, it's probably either (1) in SciPy, or (2) relatively straightforward to implement *using NumPy functions*.

## pandas

[10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)

[API reference](https://pandas.pydata.org/docs/reference/index.html)

In [None]:
# First, we need to import pandas, which is typically imported as "pd" to reduce verbosity
import pandas as pd

NumPy is very fast and powerful, but it's not always the *easiest* tool, especially if you have data that is organized in something like a `.csv` file or spreadsheet.
In this case, pandas is the standard in Python.

Now, we're going to load in some data into this Notebook using pandas.
We're going to grab the [iris](https://en.wikipedia.org/wiki/Iris_flower_data_set), which the seaborn plotting library has hosted on GitHub.

In [None]:
# We load in a `.csv` file through pandas
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

In [None]:
# iris is a pandas DataFrame
# Notice that DataFrames have a nice visual representation in notebooks
iris

In [None]:
# If you just want to know the column names, you can get these with `.columns`:
iris.columns

In [None]:
# We can access any of these from the dataframe in a few ways:
print(f"You can use brackets:\n{iris['sepal_length']}\n")
print(f"You can also use attribute syntax:\n{iris.sepal_length}")

In [None]:
# You can also get a NumPy array out of your pandas DataFrame:
iris.to_numpy()

In [None]:
# You can sort your data
iris.sort_values(by='sepal_length')

In [None]:
# You can grab a subset of your data
iris[iris['species'] == 'setosa']

In [None]:
# Or you can use a filter!
iris[iris['sepal_length'] < 5]

In [None]:
# The mean and median are easy to get
print(f"Sepal length mean: {iris['sepal_length'].mean()}")
print(f"Sepal length median: {iris['sepal_length'].median()}")

In [None]:
# You can also pass the data from pandas into NumPy:
print(f"Sepal length mean: {np.mean(iris['sepal_length'])}")
print(f"Sepal length median: {np.median(iris['sepal_length'])}")

## Matplotlib

[Pyplot tyutorial](https://matplotlib.org/stable/tutorials/introductory/pyplot.html)

[API Overivew](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html)

In [None]:
# First, we need to import Matplotlib; usually only the pyplot interface is imported as "plt"
import matplotlib.pyplot as plt
# We use the following "magic" command to allow plots to display directly in this notebook
%matplotlib inline

Matplotlib is the standard plotting interface in Python.
It's defaults are not always the prettiest, but it is highly configurable.
A relatively new plotting library is [seaborn](https://seaborn.pydata.org/), which builds on top of Matplotlib to provide a higher level interface that produces more visually pleasing results.

In [None]:
# We'll import seaborn as well; you usually see it imported as "sns"
import seaborn as sns

In [None]:
# Let's first plot some histograms (plots that show how often different values occur for some data)
plt.hist(iris['sepal_length'])
plt.xlabel('sepal_length')
plt.ylabel('N')

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(18,6))

axs[0].hist(iris['sepal_length'])
axs[0].set_xlabel('sepal_length')

axs[1].hist(iris['sepal_width'])
axs[1].set_xlabel('sepal_width')

In [None]:
# Or, using seaborn...
sns.jointplot(data=iris, x="sepal_length", y="sepal_width", hue="species")

In [None]:
# If you want to see all of the correlations between the variables in your data
sns.pairplot(data=iris, hue='species')

Hopefully this gives you a sense for how useful tools like pandas as seaborn can be for exploring and investigating data.
This now leads us to...

## <span style="color:green">Practice!</span>

We've covered some basics in NumPy, pandas, and Matplotlib/seaborn.
Now, let's try and apply them in practice to do a basic analysis.

First, we're going to use seaborn to load in a different dataset—this one has data about penguins!

In [None]:
penguins = sns.load_dataset("penguins") # this loads the data into a pandas DataFrame
penguins.head() # lets look at the first few entries

Here are your objectives:
1. Make some visualizations of this data. This can be similar to the plots we showed above, or anything else that you think may be useful. Remember the [matplotlib.pyplot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html) and [seaborn](https://seaborn.pydata.org/api.html) APIs are useful references to learn what you can do.
2. Try to identify which variables can be used (either independently or together) to separate the penguins by species (not including, of course, the `species` column itself!). Don't worry about a "correct" or even optimal solution here. We want you to focus on using the data to extract information that you may not necessarily have (imagine this is your "training" data and you want to be able to determine the species of any new penguin you see using the other available information).
3. Using those variables, or maybe some function of those variables, try to separate the penguins by species.
4. What's *more important than being correct* is understanding *how well you did*. That is, how many penguins are misclassified? Is it a lot of them or a small number? Again, being able to answer this question is *more important* than if you succesfully classified each penguin.

### 1. Visualizations

### 2. Discriminant Variables

### 3. Classification

### 4. Performance

## SciPy

[Introduction](https://docs.scipy.org/doc/scipy/tutorial/general.html)

[SciPy API](https://docs.scipy.org/doc/scipy/reference/index.html)

You typically will use SciPy for more specialized functions or in specific contexts, so we won't go into it at any depth here.
We encourage you to take a look at the general introduction and to skim through the modules and functions available through the API.

# Other Useful Libraries

While the SciPy Stack is generally useful for anyone doing scientific computing, it does not cover every use case.
Below are listed a few other common libraries that are useful to at least be aware of.

## Astrophysics Libraries

### Astropy

General purpose astrophysics routines

[Tutorials](https://learn.astropy.org/tutorials.html)

[Documentation](https://docs.astropy.org/en/stable/)

### Healpy

Working with healpixels

[healpy tutorial](https://healpy.readthedocs.io/en/latest/tutorial.html)

[Documentation](https://healpy.readthedocs.io/en/latest/)

## Machine Learning Libraries

### scikit-learn

Straightforward library for machine learning including classifying, regression, clustering, dimensionality reduction, model selection, and preprocessing.

[Getting Started](https://scikit-learn.org/stable/getting_started.html)

[API Reference](https://scikit-learn.org/stable/modules/classes.html)

### TensorFlow

Tensor algebra and machine learning library, including the Keras API.

[Quickstart](https://www.tensorflow.org/tutorials/quickstart/beginner)

[API](https://www.tensorflow.org/api_docs/python/tf)

### PyTorch

Another machine learning framework.

[Learn the Basics](https://pytorch.org/tutorials/beginner/basics/intro.html)

[Documentation](https://pytorch.org/docs/stable/index.html)