# Numpy & Pandas
In this notebook, we'll encounter two packages for scientific computing in Python: NumPy and Pandas.

**At the end of this notebook, you'll be able to:**
* Install and import packages for Python
* Create NumPy arrays
* Execute methods & access attributes of arrays
* Create & manipulate Pandas dataframes


## Importing packages

Before we can use numpy or pandas, we need to import them. We can also nickname the modules when we import them.

The convention is to import `numpy` as `np` and `pandas` as `pd`.

In [None]:
# Import packages
import numpy as np
import pandas as pd

# Use whos 'magic command' to see available modules
%whos

## Numpy

**Numpy** is the fundamental package for scientific computing with Python. It'll allow us to work with bigger datasets more efficiently.

### Creating `numpy` arrays

A numpy **array** is a grid of values which are all the same type (they’re homogenous).

We can create a numpy array in a few different ways:

* from a Python list or tuples
* by using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, `empty`,`zeroes`, etc.
* reading data from files

In [None]:
# Create a list
lst = [1,2,3,4,5]

# Make our list into an array
my_vector = np.array(lst)
type(my_vector)
print(my_vector)

In [None]:
# If we give numpy a list of lists, it will create a matrix
my_matrix = np.array([lst,lst])
print(my_matrix)

### Accessing attributes of numpy arrays
We can test shape and size either by looking at the attribute of the array, or by using the `shape()` and `size()` functions.

Other attributes that might be of interest are `ndim` and `dtype`.

In [None]:
# Check the dimensions of vector
print(my_vector.dtype)

# Check the dimensions of matrix
print(my_matrix.dtype)

Array data type is decided upon creation of the array.

You can explicitly define the data type by using `dtype= ` when you use `np.array()`. You can set the dtype to be `int, float, complex, bool, object`, etc

In [None]:
# my_matrix.dtype 

my_complex_array = np.array([lst,lst],dtype='complex')
my_complex_array.dtype


<div class="alert alert-success">

**Task**: Create an array of booleans called `bool_array` that is 2 rows x 3 columns. Access the `shape` and `ndim` attributes to confirm its size, and the `dtype` attribute to confirm that it is boolean.

</div>

In [None]:
# Your code here

### Indexing & slicing arrays

Indexing and slicing 1D arrays (vectors) is similar to indexing lists.

You can index matrices using `[row,column]`. If you omit the column, it will give you the whole row.

If you use `:` for either row or column, it will give you all of those values.

In [None]:
my_matrix[[0,2],[0,2]]

We can also index arrays using booleans or lists. When we use Booleans, we can think of this as *filtering* the array. For example:

In [None]:
# Index with an operator
bool_matrix = my_matrix[my_matrix>2]

# Index with a list of coordinates
list_matrix = my_matrix[[0,1],[1,3]]

print(my_matrix)
print(bool_matrix)
print(list_matrix)

<div class="alert alert-success">

**Task**: Filter the `bool_array` you created above to create a `true_array` that only contains True values.

</div>

We can also change values in an array similar to how we would change values in a list.

In [None]:
print(my_matrix)
#my_matrix[0] = 8
#print(my_matrix)

my_matrix[0]

### Benefits of using arrays
In addition to being less clunky & a bit faster than lists of lists, arrays can do a lot of things that lists can't. For example, we can add and multiply them. Alternatively, we can use the `sum` method to sum across a specific axis.

In [None]:
sum_list = [1,3,5] + [3,5,7]
sum_array = np.array([1,3,5]) + np.array([3,5,7])
mult_array = np.array([1,3,5]) * np.array([3,5,7])

print(sum_list)
print(mult_array)

In [None]:
this_array = np.array([[1,3,5],[3,5,7]])
sum_rows = this_array.sum(axis=1)
print(this_array)
print(sum_rows)

### Numpy also includes some very useful array generating functions:

* `arange`: like `range` but gives you a useful numpy array, instead of an interator, and can use more than just integers)
* `linspace` creates an array with given start and end points, and a desired number of points
* `logspace` same as linspace, but in log.
* `random` can create a random list (there are <a href="https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html">many different ways to use this</a>)
* `concatenate` which can concatenate two arrays along an existing axis [<a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html">documentation</a>]
* `hstack` and `vstack` which can horizontally or vertically stack arrays

Whenever we call these, we need to use whatever name we imported numpy as (here, `np`).

In [None]:
# When using linspace, both end points are included!
np.linspace(0,147,10)

<div class="alert alert-success">

**Task**: Create an array called `big_array` that has two rows. The first row should be a list of 10 numbers that are evenly spaced, and range from exactly 1 to 100. The second row should be a list of 10 numbers that begin at 0 and are exactly 10 apart. `big_array` should have a shape (2,10): two rows, and ten columns. Lastly, reassign the last value of each row in the array to be -100. 

*Hint*: Create your two arrays, and then use the `vstack` method to stack them.

</div>

Numpy also has built in methods to save and load arrays: `np.save()` and `np.load()`. Numpy files have a .npy extension.

See full documentation <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html">here</a>.

In [None]:
# Save method takes arguments 'filename' and then 'array':
np.save('matrix',my_matrix)

In [None]:
my_new_matrix = np.load('matrix.npy')
my_new_matrix

# Pandas

Pandas is a useful module that creates **dataframes** (think of these like Excel spreadsheets, but much faster!). 

We can think of Pandas as "numpy with labels".


### Benefits of Pandas
* Great for real-world, heterogeneous data
* Similar to Excel spreadsheets (but way faster!)
* Smartly deals with missing data

We can work with the gene expression data from a2 as a Pandas dataframe!


In [None]:
# Import necessary packages
import pandas as pd

# Read in the list of lists as a data frame
gene_df = pd.read_csv('brainarea_vs_genes_exp_w_reannotations.tsv',sep='\t',index_col = 'gene_symbol')
gene_df.head() # Show the first five rows

Indexing in Pandas  works slightly different. Similar to a dictionary, we can index values by their names.

* Use `df['index']` for columns, and method `.loc` for rows.
* Use `.iloc` to index by #.

In [None]:
DISC1_data = gene_df.loc['DISC1']
DISC1_data['CA4 field']
#print(DISC1_data)


Pandas has many, many useful methods that you can use on your data, including `describe`, `mean`, and more.

In [None]:
mean_DISC1 = DISC1_data.mean()
mean_DISC1

## Resources
Check out the <a href="https://docs.scipy.org/doc/numpy/user/index.html">NumPy user guide</a> if you ever have a question about a NumPy array!

## About this notebook
This notebook is largely derived from UCSD COGS18 Materials, created by Tom Donoghue & Shannon Ellis, as well as <a href="https://github.com/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb">JR Johannson's Scientific Python Lecture on Numpy</a>.


Want to run this notebook as a slideshow? If you have Python (or Anaconda) follow <a href="http://www.blog.pythonlibrary.org/2018/09/25/creating-presentations-with-jupyter-notebook/">these instructions</a> to setup your computer with the RISE plugin.