<div class="alert alert-block alert-danger">
<b>Check the Kernel you are using:</b> Before we get started, if you are running this on HiPerGator, double check the kernel in use. This is shown in the top right of the window and should look like: <img src="images/kernel.python310.png" alt"Image showing that the notebook is using the Python 3.10 Full kernel" style="float:right">
</div>

# NumPy and inclusive communities

[NumPy](https://numpy.org/) is undoubtedly an important package for Python and its developers (mostly volunteer) have provided a great service to the community, not only with NumPy itself, but enabling development of packages that use NumPy under the hood to add even more functionality. The developers have however done a great disservice  in failing to address issues of inclusion of diverse talents. 

On September, 16, 2020, *Nature* published the paper [Array Programming with NumPy](https://www.nature.com/articles/s41586-020-2649-2?amp%3Bcode=573df4db-16bd-47ad-b138-d0d9c14134f1) with 26 authors. **All** 26 authors are male! There have been many excuses offered, and commitments to improve ([NumPy Diversity and Inclusion Statement](https://numpy.org/diversity_sep2020/)).

This is not news however, a [2018 analysis by Anthony Scopatz](https://nbviewer.jupyter.org/github/scopatz/nf-project-inequality/blob/9b83df3090c9b9b1b953d2905d428b71165ce607/nf-project-inequality.ipynb), found huge "Inequality of underrepresented groups in PyData Leadership" (this is an interesting read on its own and is presented as a Jupyter Notebook). Here's the main figure from Anthony's analysis:

![image from Anthony Scopatz's analysis, linked from Reshama Shaikh's article on "Why Women Are Flourishing In R Community But Lagging In Python"](https://reshamas.github.io/assets/images/numfocus_os.png)

The analysis showed high inequality in NumPy and many other Python projects.

There is also a very interesting analysis by Reshama Shaikh on "[Why Women Are Flourishing In R Community But Lagging In Python](https://reshamas.github.io/why-women-are-flourishing-in-r-community-but-lagging-in-python/)" which contrasts the Python and R communities. I highly recommend reading Rashama's article, it has many good insights as to why R, in general, has succeeded in attracting a more diverse developer community.

In my Computational Tools for Research in Biology course, I discuss [Git and Github](https://comptoolsres.github.io/TLCL_4.html) and the need for developer communities to be more inclusive of racial diversity, stop using offensive terms, and actively work to foster racial diversity. The same is true for gender diversity (Anthony's article also makes a great point about including non-binary people in assessment of diversity).

While I am disappointed in the NumPy history and will encourage reforms, if we choose to stop using NumPy, we would not be able to use Python for a wide array (pun intended) of applications. So, we will use NumPy, but also commit to increasing diversity and acknowledge historical wrongs.

# Introduction to NumPy

This notebook is based on [chapter 2 of Jake VanderPlas' Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html). [<img src="images/PDSH-cover-small.png" alt="PDSH Cover Image" style="width: 50px;float:right"/>](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)

> This chapter, along with [chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html), outlines techniques for effectively loading, storing, and manipulating in-memory data in Python. The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else. **Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.**

> For example, images–particularly **digital images**–can be thought of as simply two-dimensional arrays of numbers representing pixel brightness across the area. **Sound clips** can be thought of as one-dimensional arrays of intensity versus time. **Text** can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words. **No matter what the data are, the first step in making it analyzable will be to transform them into arrays of numbers.** (We will discuss some specific examples of this process later in [Feature Engineering](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html))

> For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science. We'll now take a look at the specialized tools that Python has for handling such numerical arrays: the NumPy package, and the Pandas package (discussed in Chapter 3).

> This chapter will cover NumPy in detail. NumPy (short for **Numerical Python**) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python's built-in list type, but **NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size.** NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.



Now we can import and look at the version of NumPy:

<div class="alert alert-block alert-info">
    <b>Note:</b> the __ methods of functions, like <code>__version__</code> are referred to as the "double underscore" or "dunder" methods and are generally not intended to be directly interacted with by users.
</div>

Remember the built in documentation with `<TAB>` and `?`.

## Understanding Data Types in Python

As we've mentioned earlier, Python is **dynamically typed**. This flexibility can be handy, but, especially as sizes of datasets grow, if becomes a liability.

As the PDSH chapter points out, languages such as C and Java are **statically typed**, meaning that the programmer has to declare, when the variable is created, the type of data that the variable will store.

For example in C, you could make a loop like this:

```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

While in Python the same thing is done like this:

```Python
# Python code
result = 0
for i in range(100):
    result += i
```

Notice that in C, the data types, all `int`s in the example, is explicitly declared, while in Python it is dynamically inferred.

## A Python Integer is More Than Just and Integer

We also briefly saw in the Intro to Jupyter session that a lot of the underlying code for Python is actually written in C (that is why the `??` function can't always display the code for a function). 

In the Intro to Python session, we learned about Object Oriented Programming, and that everything in Python is an object--even strings and integers.

All of this leads to the reality that if we do something like `x=1000`, `x` is not just a "raw" integer--bits stored in memory. 

> It's actually a pointer to a compound C structure, which contains several values. Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```
> A single integer in Python 3.4 actually contains four pieces:
> * `ob_refcnt`, a reference count that helps Python silently handle memory allocation and deallocation
> * `ob_type`, which encodes the type of the variable
> * `ob_size`, which specifies the size of the following data members
ob_digit, which contains the actual integer value that we expect the Python variable to represent.

> This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:

<figure>
  <img src="images/cint_vs_pyint.png" alt="Memory storage for C vs Python integers from Python Data Science Handbook">
  <figcaption>Memory storage for C vs Python integers, from Python Data Science Handbook</figcaption>
</figure>

> Here `PyObject_HEAD` is the part of the structure containing the reference count, type code, and other pieces mentioned before.

> Notice the difference here: **a C integer is essentially a label for a position in memory whose bytes encode an integer value**. **A Python integer is a pointer to a position in memory containing all the Python object information**, including the bytes that contain the integer value. This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically. All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.

## A Python List is More Than Just a List

Practice creating some lists, and remember that a list can contain one or more data types--i.e. a Python list can have heterogenous data types.

In [None]:
# Create some lists


As PDSH notes, this sets up the situation where each element of a list needs to store its own information about the element's data type:

<figure>
  <img src="images/PDSH_list.png" alt="Python list image from Python Data Science Handbook">
  <figcaption>Memory storage for a Python list, from Python Data Science Handbook</figcaption>
</figure>

As you can imagine, this becomes exceedingly inefficient if, for example you have a list of 1,000,000 integers. 

## Creating Array from Python Lists

PDSH mentions that there is a `array` module, but I rarely see anyone using it, so let's skip to the NumPy [`ndarry`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)--a multi-dimensional array. NumPy not only provides the data structure, but also highly efficient operations on the data.

First, we can create a NumPy array from a Python list:

In [None]:
# Integer array from list


NumPy arrays are constrained such that **all** elements need to be of the same type. 

If possible, NumPy will *upcast* items to create an array of matching type.

In [None]:
# Create an array with a mix of integers and floats



In [None]:
# What about strings?



In [None]:
# You can also specify the type if you want



And, as the `ndarray` name implies, arrays can be multidimensional.

## Creating Arrays from Scratch

> Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy. Here are several examples:

In [None]:
# Create a length-10 integer array filled with zeros


In [None]:
# Create a 3x5 floating-point array filled with ones


In [None]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)


In [None]:
# Create an array of five values evenly spaced between 0 and 1


In [None]:
# Create a 3x3 array of uniformly distributed random values between 0 and 1


In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1


In [None]:
# Create a 3x3 array of random integers in the interval [0, 10)


In [None]:
# Create a 3x3 identity matrix


In [None]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location


### Exercise 1

Create the following types of NumPy arrays.
 * A 4X4 matrix with ones in every cell
 * A 6X6 matrix with ones on the diagonal from top left to bottom right.
 * A 3X3X3 matrix with normally distributed random numbers with mean of 5 and standard deviation of 2
 * A vector with 1,000,000 evenly spaced numbers between five and 10.

In [None]:
# Your code here


In [None]:
# Uncomment and run the line below for a solution
#%load snippets/NumPy_Ex_01.matrices.py

## NumPy Standard Data Types

> NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.

Notice that there are **a lot** of data types, and there are even more options if needed. This is one reason Python is handy. But again, that flexibility comes at a cost. 

Data type | Description
----------|------------
bool_ | Boolean (True or False) stored as a byte
int_ | Default integer type (same as C long; normally either int64 or int32)
intc | Identical to C int (normally int32 or int64)
intp | Integer used for indexing (same as C size_t; normally either int32 or int64)
int8 | Byte (-128 to 127)
int16 | Integer (-32768 to 32767)
int32 | Integer (-2147483648 to 2147483647)
int64 | Integer (-9223372036854775808 to 9223372036854775807)
uint8 | Unsigned integer (0 to 255)
uint16 | Unsigned integer (0 to 65535)
uint32 | Unsigned integer (0 to 4294967295)
uint64 | Unsigned integer (0 to 18446744073709551615)
float_ | Shorthand for float64.
float16 | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_ | Shorthand for complex128.
complex64 | Complex number, represented by two 32-bit floats
complex128 | Complex number, represented by two 64-bit floats

## The bascis of NumPy Arrays

> Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas ([Chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)) are built around the NumPy array. Though note that as datasets grow, tools like Nvidia's [RAPIDS](https://rapids.ai/) framework are replacing Pandas by moving calculations onto the GPU, accelerating calculations. But, fear not, the Pandas and RAPIDS are largely compatible.


### A note on random numbers

PDSH moves on to creating three arrays to use for the following examples. Before we get there, let's take a look at the first thing that is done:

`np.random.seed(0)`

This sets the random number generator seed to 0. What does that mean?? Well, computers really can't make truly random numbers. What they use is a complex series of manipulations to generate numbers that appear random, sometimes called *pseudorandom*. If you start with the same number, the seed, the sequence of "random" numbers generated is **guaranteed** to be identical. This has good and bad properties. On the good side, we can set a seed and all have the same numbers, you can also use this for troubleshooting, etc. On the bad side, we are often lulled into a false sense of having simulated something repeatedly only to find that we failed to consider the biases that may be introduced by the random number generator--or worse, repeatedly simulating something using the same seed!

Also, as a note, this guarantee only applies to identical code. PDSH used NumPy version 1.11.1, while we (using Python 3.8 full kernel on HiPerGator on 1/17/21) are using 1.13.1--while we will get consistent numbers from run to run and student to student, our numbers are different than in the text.

### Create some arrays to use

In [None]:
# Create some sample arrays

  # Set random number generator seed for reproducibility

  # One-dimensional array
  # Two-dimensional array
  # Three-dimensional array

### NumPy Array Attributes:

> Each array has attributes `ndim` (the number of dimensions), `shape` (the size of each dimension), and `size` (the total size of the array):

In [None]:
# Print the data type of the array


In [None]:
# Print the itemsize and nbytes


### Array Indexing: Accessing Single Elements

This is similar to using lists in Python.

In [None]:
# Indexing from the end of the array


In multi-dimensional arrays, items are accessed using a comma-separated list of indices:

### Array Slicing: Accessing Subarrays

As with lists, NumPy array use slices.

#### Slices for one-dimensional arrays

In [None]:
 # first 5 elements

In [None]:
 # elements from index 5 on

In [None]:
 # middle sub-array

In [None]:
 # every other element

In [None]:
 # every other element, starting at index 1

> A potentially confusing case is when the `step` value is negative. In this case, the defaults for `start` and `stop` are swapped. This becomes a convenient way to reverse an array:

In [None]:
  # all elements, reversed

In [None]:
  # reversed every other from index 5

#### Slices for multi-dimensional subarrays

> Multi-dimensional slices work in the same way, with multiple slices separated by commas. For example:

In [None]:
  # two rows, three columns

In [None]:
 # all rows, every other column

In [None]:
 # first column of x2

In [None]:
  # first row of x2

In [None]:
 # equivalent to x2[0, :]

## Subarrays as no-copy views

An important--and at times both useful and confusing--thing to know about array slices is that they return *views* rather than *copies* of the array data. Changing data in a subarray, changes the data in the originating array.

In [None]:
# Extract a 2X2 subarray from this



In [None]:
# Modify element of x2_sub


> This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

#### Creating copies of arrays

If what you want is really a copy, you can use the `.copy()` method.

### Reshaping Arrays

Another common action is to reshape the dimensions of an array. The `.reshape()` method is the easiest way to do this. 

In [None]:
# Create a row array



In [None]:
# Reshape to column vector using newaxis


## Concatenation of arrays 

There are several functions to concatenate two arrays in NumPy: `np.concatenate`, `np.vstack`, and `np.hstack` are common methods

In [None]:
# Concatenating multiple arrays


In [None]:
# Concatenating 2-dimensional arrays



In [None]:
# Concatenating along the second axis (zero-indexed)



In [None]:
# For mixed dimension arrays, vstack and hstack are more clear



In [None]:
# Need to be careful of dimensions


### Splitting Arrays

In [None]:
# Split after 3rd and 5th elements



# Computation on NumPy Arrays: Universal Functions

> Up until now, we have been discussing some of the basic nuts and bolts of NumPy; in the next few sections, we will dive into the reasons that NumPy is so important in the Python data science world. Namely, it provides an easy and flexible interface to optimized computation with arrays of data.

> Computation on NumPy arrays can be very fast, or it can be very slow. The key to making it fast is to use vectorized operations, generally implemented through NumPy's universal functions (ufuncs). This section motivates the need for NumPy's ufuncs, which can be used to make repeated calculations on array elements much more efficient. It then introduces many of the most common and useful arithmetic ufuncs available in the NumPy package.

## The Slowness of Loops

Both the dynamic typing and the interpreted nature of Python lead to slowness. PDSH talks about several options to circumvent some of this, and we will return to some throughout the semester.

One thing to keep in ming though is that: 

> The relative sluggishness of Python generally manifests itself in situations where many small operations are being repeated – for instance looping over arrays to operate on each element. For example, imagine we have an array of values and we'd like to compute the reciprocal of each. A straightforward approach might look like this:

In [None]:
import numpy as np
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

In [None]:
# Let's time this on a big array:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

> It turns out that the bottleneck here is not the operations themselves, but the type-checking and function dispatches that CPython must do at each cycle of the loop. Each time the reciprocal is computed, Python first examines the object's type and does a dynamic lookup of the correct function to use for that type. If we were working in compiled code instead, this type specification would be known before the code executes and the result could be computed much more efficiently.

## Introducing UFuncs

NumPy provides a convenient interface into a statically types, compiled routine in a **vectorized** operation.

Let's compare the Python implementation to the UFunction that 

I think we can skip the rest of this section...there may be things we come back to, but I think this gets a bit into the weeds here.

## Aggregations: Min, Max, and Everything In Between

### Summing the Values in an Array

Again the main take home here is that NumPy, both through its compiled code and its explicit typing, speeds up calculations. For example, summing a NumPy array of 1,000,000 random numbers can be done with both the built-in `sum()` function and the `np.sum()` function, which is much faster:

### Minimum and Maximum

Similarly, the NumPy versions of these are faster:

> For min, max, sum, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:

> Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!

## Multidimensional aggregates

For N-dimensional matrices, you can aggregate along different axes:

In [None]:
# E.g. a two-dimensional matrix



In [None]:
# By default, the aggregation is over the entire array


In [None]:
# You can specify the axis


> The way the axis is specified here can be confusing to users coming from other languages. The `axis` keyword specifies the *dimension of the array that will be collapsed*, rather than the dimension that will be returned. So specifying `axis=0` means that the first axis will be collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.