# Introduction to `Numpy` and `Scipy`

In this course we'll several external Python libraries. If you have trouble installing Python locally, you can always use [Google Colab](https://colab.research.google.com).

1. `Numpy`, the fundamental package for scientific computing with Python. Numpy does all the heavy mathematical lifiting, such as matrix multiplication and summing. We use Numpy due to its *speed* and *convenience*. The syntax of Numpy is very similar to that of [TensorFlow](https://www.tensorflow.org), which is used extensively in heavy-duty machine learning applications.
2. `Scipy`, which contains "fundamental algorithms for scientific computing in Python". We will mostly use Scipy for its statistical distributions. 
3. `Pandas` deals with data frames. Most of our data will be on the Pandas format. 
4. `statsmodels`. Statsmodels is the Python package for basic statistical analysis in Python. It closely mimics `R` in syntax and functionality.
5. `scikit-learn` is somewhat similar to statsmodels, but contains much more functionality and is geared towards [machine learning instead of statistics](https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3). 
6. `matplotlib`, the basic plotting library in Python.
7. `seaborn`. A package that simplifies plotting. 
8. `tinybench`. Used for doing benchmarks, i.e., timing how long functions take to run. You can learn a lot about how to write efficient code by routinely using this package while coding. It's also quite fun - optimizing code is one of the pleasures of programming, and `tinybench` makes it easy to check if any of your optimizations make sense.

## Curriculum

### Numpy
We will use Numpy extensively in this course. [Numpy for absolute beginners](https://numpy.org/doc/stable/user/absolute_beginners.html) serves as the main Numpy curriculum. Expect to come back it multiple times! Roughly speaking you're expected to be able to quickly figure out how to solve a given Numpy task in e.g. the home exam, using methods laid out in that document, the Numpy documentation, the lectures, and lecture notes. Take Numpy seriously and do the exercises!

Below I write you should *familiarize yourself with* the numpy documentation and how Scipy handles distributions. This means that you should: 
1. Fire up and instance of [Visual Studio Code](https://code.visualstudio.com) (recommended), Jupyter Notebook, or your prefered way to write Python.
2. Go to the supplied links and *actively* read them. You can't just print out the documents and read them in the shade of a tree! You should make an hypothesis about how a snippet of code works, copy the Python code to your editor, and then modify it to check if your hypothesis is true.

### The speed of `Numpy`

Python is very slow language. So slow, in fact, that most optimizations in Python is about moving as many computations as possible to Numpy.


In [None]:
import numpy as np

def sum_python(n):
  numbers = range(0, n)
  acc = 0
  for i in numbers:
    acc = acc + i
  return acc

def sum_numpy(n):
  numbers = np.arange(0, n, dtype = np.int64)
  return numbers.sum()
  

sum_python(10 ** 6)
sum_numpy(10 ** 6)

Take note of the following:
1. The Numpy code is faster to type and arguably easier to read. There is no doubt what the `.sum` method does. (But to be fair, Python implements a `sum` function too.)
2. We use the `dtype = np.int64` argument in the `np.arange` function. This makes `int64` the data type of the resulting Numpy array. These are 64 bits (signed) integers, but the standard is 32 bits integer. The difference between these lie in their maximum and minimum values. The maximal value of a 64 bits integer is `9,223,372,036,854,775,807`, but the maximal value of an `i32` is merely `2,147,483,647`. You have to manually specify `i64` when dealing with big integers in Numpy, but you do not need to do that in Python, as it can use integers of arbitrary size (at the cost of speed).

We compare the execution speed of these functions using the `benchmark` function from the `tinybench` package. As always, type `help(benchmark)` in a Python interpreter to get help for the function. Below, we sample `ntimes = 10` and use a warmup of `10` (to get the processor running). The `g` argument tells `benchmark` where to find the functions in the list, and the argument `globals()` tells it to look at the top level.  

In [None]:
from tinybench import benchmark, benchmark_env
bench = benchmark(['sum_python(10 ** 6)', 'sum2_python(10 ** 6)', 'sum_numpy(10 ** 6)'], ntimes = 100, warmup = 10, g = globals())
bench.plot()

The Numpy version is much faster. To pinpoint by exactly how much, we need to look at the mean execution times.

In [None]:
bench.means
bench.means['sum_python'] / bench.means['sum_numpy']

Thus, on my machine, the Numpy implementation is roughly $10$ times faster. One can expect speedups much larger than this in more complex applications.

## Exercises

### Data types
1. Answer the following questions. 
 * What is the minimal value of a 32 bit integer in Numpy? 
 * What is the minimal value of a 64 bit integer in Numpy?
 * What other integer types does Numpy allow? Is there an integer type even larger than `int64`, provided you restrict yourself to non-negative numbers?
2. Decimal numbers in computer science are called *floats*, or floating point numbers. 
 * What types of floats are available in Numpy?
 * What is the default float type when using `linspace`?
 * What are the maximal and minimal values of these float types?

### Benchmarking and Numpy


1. Python implements a method `sum` that sums every member of an iterable such as list. Implement a function `sum2_python` that uses `sum` instead of a for-loop, but does not use Numpy. Compare its performance to my implementation, using a loop. What do you see?
2. Write a function that squares and sums the numbers from `1` to `n`, one in Numpy and one in pure Python. Roughly how much faster is the Numpy implementation than the Python implementation when using `n=1000`, `n=10000`, or `n=10**6`? (*Hint*: Make sure to use a Numpy function to square are the elements!)


In [None]:
def sum_sq_python(n):
  numbers = range(0, n)
  acc = 0
  for i in numbers:
    acc = acc + i**2
  return acc

def sum_sq_numpy(n):
  numbers = np.arange(0, n, dtype = np.int64)
  return np.square(numbers).sum()

from tinybench import benchmark, benchmark_env
bench = benchmark(['sum_sq_python(10 ** 6)', 'sum_sq_numpy(10 ** 6)'], ntimes = 100, warmup = 10, g = globals())
bench.plot()  

In [None]:
bench.means['sum_sq_python'] / bench.means['sum_sq_numpy']

3. Recall that the sample variance is defined as $\sum (x_i - \overline{x})^2 / (n-1)$, where $\overline{x}$ is the sample mean and $n$ is the number of observations. Compare a Numpy-free implentation to the `var` method of `Numpy` (using the optional argument `ddof = 1`.), on the numbers from `1 .. 10 ** 5`, but this time, let `dtype = float64`. Be sure to check that your functions return the same result!

(*Note*: You algorithm and `Numpy` might give different results for very large `n`. This is due to a phenomenon called [numerical instability](https://en.wikipedia.org/wiki/Numerical_stability), which we ignore in this course.)


In [None]:
def var_python(n):
  numbers = range(0, n)
  mean = sum(numbers) / n
  return sum([(x - mean) ** 2 for x in numbers]) / (n - 1)

def var_numpy(n):
  numbers = np.arange(0, n, dtype = np.float64)
  return numbers.var(ddof = 1)

var_numpy(10**5)
var_python(10**5)

### Numpy exercises

1. **Some simple exercises.** How do you do the following in Numpy? Make sure to make your own example in Python!
    1. Make an identity matrix with `n` rows?
    2. Make a matrix consisting of $0$s only?
    3. Calculate the empirical mean of a vector? 
    4. Calculate the standard deviation of a vector normalized so that the variance is unbiased? (*Hint:* Read the docs to find out what I mean.) Which of these do you think we are most interested in in this course?
    5. Calculate "cumulative sum" operation on a vector `x`? (This operation creates a new vector `y` whose first element is `x[0]`, second `x[0] + x[1]`, etc.)
    6. Take the element-wise logarithm of a matrix?

2. **Codewars.** Do this [Codewars exercise](https://www.codewars.com/kata/52fba2a9adcd10b34300094c) using Numpy indexing, i.e., not the built-in function `transpose`. You need to register at Codewars to do this exercise (which is harmless, and highly recommended).

3. Do the [following exercise](https://www.codewars.com/kata/568ff914fc7a40a18500005c/train/python), both with and without Numpy. (*Hint*: Use the `round` method to round the Numpy arrays. Remember to convert between `list` and `array` types!) 

4. Do the [following exercise](https://www.codewars.com/kata/57102bbfd860a3369300089c/train/python), both with and without Numpy.

5. **numpy-100.** The Github repo [Numpy-100](https://github.com/rougier/numpy-100) contains 100 Numpy exercises of variable difficulty. The Github page also includes hints and solutions. You can read the 100 problems [here](https://hackmd.io/@JonasMoss/numpy-100). Do exercises 1 - 11 in this repo.

There is no upper limit to how many of the `numpy-100` you should do, but I would recommend you do as many as you can find time for. (These exercises are pretty short!) The point is to [*grok*](https://www.vocabulary.com/dictionary/grok) Numpy. (*Hint:* Use the Numpy documentation, Google, StackExchange, and so on. Check the hints if you have to.)

### Scipy exercises
You will have to use to Scipy documention to solve these exercises. You should also use wikipedia liberally. Always remember, if you struggle a lot with an exercise, try the next one! You can always come back later.
1. **Exponential distribution.** Calculate the mean and standard deviation of the exponential distribution with scale parameter $\lambda$ using Scipy. Use wikipedia to find the true values of the mean and standard deviation. Do they match?
2. **Normal distribution.** Use Scipy to calculate all $k$th moments (i.e., the expectation $E(X^k)$) of a normal distribution with mean $0$ and $\sigma=1,2$, $k < 20$. Do you notice a pattern? Use wikipedia to figure out exactly what the pattern is. 
3. **Plotting.** Make a function that creates a plot a `pdf` for user-specified bounds and an object of class `scipy.stats._continuous_distns`. The function should look `plotter(obj, lower = 0, upper = 4)`.
4. **Nakagami distribution.** The Nakagami distribution is a special case of the gamma distribution, but with a different parameterization. Figure out how to translate the parameters of the Nakagami distribution to the parameters of the gamma distribution. Then plot both densities in the same window to verify your calculations. Use $\nu = 3$ in the Nakagami distribution.

## Additional resources