# NumPy & Pandas
In this notebook, we'll encounter two packages for scientific computing in Python: NumPy and Pandas.

**At the end of this notebook, you'll be able to:**
* Install and import packages for Python
* Create NumPy arrays
* Execute methods & access attributes of arrays
* Create & manipulate Pandas dataframes
<hr>

## Importing packages

Before we can use NumPy or Pandas, we need to import them. We can also nickname the modules when we import them.

The convention is to import `numpy` as `np` and `pandas` as `pd`.

In [1]:
# Import packages
import numpy as np
import pandas as pd

# Use whos 'magic command' to see available modules
%whos

Variable   Type      Data/Info
------------------------------
np         module    <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
pd         module    <module 'pandas' from '/h<...>ages/pandas/__init__.py'>


## NumPy

**NumPy** is the fundamental package for scientific computing with Python. It'll allow us to work with bigger datasets more efficiently.

### Creating `numpy` arrays

A numpy **array** is a grid of values which are all the same [data type](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html).

We can create a numpy array in a few different ways:

* from a Python list or tuples
* by using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, `empty`,`zeroes`, etc.
* reading data from files

In [2]:
# Create a list
lst = [1,2,3,4,5]

# Make our list into an array
my_vector = np.array(lst)
my_vector

array([1, 2, 3, 4, 5])

In [3]:
# If we give np.array() a list of lists, it will create a matrix
my_matrix = np.array([lst,lst])
my_matrix

array([[1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5]])

### Accessing attributes of numpy arrays
We can test shape and size either by looking at the attribute of the array, or by using the `shape()` and `size()` functions.

Other attributes that might be of interest are `ndim` and `dtype`.

In [7]:
# Check the dimensions of vector
print(my_vector.ndim)
print(my_vector.shape)
print(my_vector.dtype)

# Check the dimensions of matrix
print(my_matrix.ndim)
print(my_matrix.shape)

1
(5,)
int64
2
(2, 5)


Array data type is decided upon creation of the array.

You can explicitly define the data type by using `dtype= ` when you use `np.array()`. You can set the dtype to be `int, float, complex, bool, object`, etc

In [8]:
# Use dtype here
float_matrix = np.array(lst,dtype='float')
float_matrix

array([1., 2., 3., 4., 5.])

### Indexing & slicing arrays

Indexing and slicing 1D arrays (vectors) is similar to indexing lists.

You can index NumPy arrays using `array_name[row,column]` to select a single value. If you omit the column, it will give you the entire row. You can also use `:` in place of either `row` or `column` to indicate you want to return all those values.

<div class="alert alert-success">

**Task**: Create an array of booleans called `bool_array` that is 2 rows x 3 columns. Access the `shape` and `ndim` attributes to confirm its size, and the `dtype` attribute to confirm that it is boolean.

</div>

In [9]:
# Option 1
bool_array = np.array([[1,0,1],[0,0,1]],dtype=bool)
bool_array

array([[ True, False,  True],
       [False, False,  True]])

In [12]:
# Option 2

bool_lst = [True,False,False]
bool_array = np.array([bool_lst,bool_lst])
bool_array

array([[ True, False, False],
       [ True, False, False]])

In [18]:
my_matrix

array([[1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5]])

In [20]:
my_matrix[0,2]

3

You may want to look at a slice of columns or a slice of rows. You can slice your array like the following: `array(start_row:stop_row, start_col:end_col)`. 

In [21]:
# Look at the first 3 columns of each row 
my_matrix[: ,0:3]

array([[1, 2, 3],
       [1, 2, 3]])

## Subsetting arrays
We can also subset our original array to only include data that meets our criteria. We can think of this as **subsetting** the array by applying a condition to our array. The syntax for this would be new_array = original_array[condition]. This is essentially indexing/slicing the array using a Boolean.

In [13]:
my_array = np.array([lst, lst])

# Return only values greater than 3 from our array 
condition = (my_array > 3)
filtered_array = my_array[condition]
print(filtered_array)

[4 5 4 5]


<div class="alert alert-success">

**Task**: Subset `bool_array` so that it is only the true values.

</div>

In [15]:
bool_array[bool_array==True]

array([ True,  True])

We can also change values in an array similar to how we would change values in a list.

In [25]:
# Assign a value to an index in my_matrix
my_matrix[0,0] = 100
my_matrix

array([[100,   2,   3,   4,   5],
       [  1,   2,   3,   4,   5]])

### Benefits of using arrays
In addition to being less clunky & a bit faster than lists of lists, arrays can do a lot of things that lists can't. For example, we can add and multiply them. Alternatively, we can use the `sum` method to sum across a specific axis.

In [27]:
# Demonstrate differences between lists and arrays
sum_list = [1,3,5] + [3,5,7]
sum_array = np.array([1,3,5]) + np.array([3,5,7])
mult_array = np.array([1,3,5]) * np.array([3,5,7])

print(sum_list)
print(sum_array)
print(mult_array)

[1, 3, 5, 3, 5, 7]
[ 4  8 12]
[ 3 15 35]


In [28]:
# Sum over a specific axis
this_array = np.array([[1,3,5],[3,5,7]])
sum_rows = this_array.sum(axis=1)
print(this_array)
print(sum_rows)

[[1 3 5]
 [3 5 7]]
[ 9 15]


### Numpy also includes some very useful array generating functions:

* `arange`: like `range` but gives you a useful numpy array, instead of an interator, and can use more than just integers)
* `linspace` creates an array with given start and end points, and a desired number of points
* `logspace` same as linspace, but in log.
* `random` can create a random list (there are <a href="https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html">many different ways to use this</a>)
* `concatenate` which can concatenate two arrays along an existing axis [<a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html">documentation</a>]
* `hstack` and `vstack` which can horizontally or vertically stack arrays

Whenever we call these, we need to use whatever name we imported numpy as (here, `np`).

In [None]:
# When using linspace, both end points are included!
np.linspace(0,147,10)

Numpy also has built in methods to save and load arrays: `np.save()` and `np.load()`. Numpy files have a .npy extension.

See full documentation <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html">here</a>.

In [None]:
# Save method takes arguments 'filename' and then 'array':
np.save('matrix',my_matrix)

In [None]:
my_new_matrix = np.load('matrix.npy')
my_new_matrix

# Pandas

Pandas is a useful module that creates **dataframes** (think of these like Excel spreadsheets, but much faster!). We can think of Pandas as "numpy with labels".

### Benefits of Pandas
* Great for real-world, heterogeneous data
* Similar to Excel spreadsheets (but way faster!)
* Smartly deals with missing data

The two data structures of Pandas are the `Series` and the `DataFrame`. A `Series` is a one-dimensional onject similar to a list. A `DataFrame` can be thought of as a two-dimensional numpy array or a collection of `Series` objects. Series and dataframes can contain multiple different data types such as integers, strings, and floats, similar to an Excel spreadsheet. Pandas also supports `string` lables unlike numpy arrays which only have numeric labels for their rows and columns. For a more in depth explanation, please visit the [Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) section in the Pandas User Guide. 

You can create a Pandas dataframe by inputting dictionaries into the Pandas function `pd.DataFrame()`, by reading files, or through functions built into the Pandas package. The function [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) reads a comma- or tab-separated file and returns it as a `dataframe`.


## Loading in data as a dataframe
Below we will create a dataframe by reading the file `brainarea_vs_genes_exp_w_reannotations.tsv` which contains information on gene expression accross multiple brain areas. 

>**About this dataset:**
This dataset was created by Derek Howard and Abigail Mayes for the purpose of accelerating advances in data mining of open brain transcriptome data for polygenetic brain disorders. The data comes from normalized microarray datasets of gene expression from 6 adult human brains that was released by the Allen Brain Institute and then processed into the dataframe we will see below. For more information on this dataset please visit the <a href = "https://github.com/derekhoward/HBAsets"> HBAsets repository</a>. 


In [29]:
# Read in the file as a data frame
file_name = 'Data/brainarea_vs_genes_exp_w_reannotations.tsv'

# Use pd.read_csv
gene_df = pd.read_csv(file_name,delimiter='\t')

# '.head()' returns the first 5 rows in the dataframe
gene_df.head()

Unnamed: 0,gene_symbol,CA1 field,CA2 field,CA3 field,CA4 field,"Crus I, lateral hemisphere","Crus I, paravermis","Crus II, lateral hemisphere","Crus II, paravermis",Edinger-Westphal nucleus,...,"temporal pole, inferior aspect","temporal pole, medial aspect","temporal pole, superior aspect",transverse gyri,trochlear nucleus,tuberomammillary nucleus,ventral tegmental area,ventromedial hypothalamic nucleus,vestibular nuclei,zona incerta
0,A1BG,0.856487,-1.773695,-0.678679,-0.986914,0.826986,0.948039,0.935427,1.120774,-1.018554,...,0.27783,0.514923,0.733368,-0.104286,-0.910245,1.03961,-0.155167,-0.444398,-0.901361,-0.23679
1,A1BG-AS1,0.257664,-1.373085,-0.619923,-0.636275,0.362799,0.353296,0.422766,0.346853,-0.812015,...,1.074116,0.821031,1.219272,0.901213,-1.522431,0.598719,-1.709745,-0.054156,-1.695843,-1.155961
2,A1CF,-0.089614,-0.546903,0.282914,-0.528926,0.507916,0.577696,0.647671,0.306824,0.089958,...,-0.030265,-0.187367,-0.428358,-0.465863,-0.136936,1.229487,-0.11068,-0.118175,-0.139776,0.123829
3,A2M,0.552415,-0.635485,-0.954995,-0.259745,-1.687391,-1.756847,-1.640242,-1.73311,-0.091695,...,-0.058505,0.207109,-0.161808,0.18363,0.948098,-0.977692,0.911896,-0.499357,1.469386,0.557998
4,A2ML1,0.758031,1.549857,1.262225,1.33878,-0.289888,-0.407026,-0.358798,-0.589988,0.944684,...,-0.472908,-0.598317,-0.247797,-0.282673,1.396365,0.945043,0.158202,0.572771,0.073088,-0.88678


At the moment, the first column  of information above, the **index** just contains a list of numbers. We can reassign the row labels by using the method `set_index()`. We can choose any column in our present dataframe to be the row values. Let's assign the row lables to be the `gene_symbol` and reassign the dataframe. 

In [30]:
# Set the index
gene_df = gene_df.set_index('gene_symbol')
gene_df.head()

Unnamed: 0_level_0,CA1 field,CA2 field,CA3 field,CA4 field,"Crus I, lateral hemisphere","Crus I, paravermis","Crus II, lateral hemisphere","Crus II, paravermis",Edinger-Westphal nucleus,Heschl's gyrus,...,"temporal pole, inferior aspect","temporal pole, medial aspect","temporal pole, superior aspect",transverse gyri,trochlear nucleus,tuberomammillary nucleus,ventral tegmental area,ventromedial hypothalamic nucleus,vestibular nuclei,zona incerta
gene_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,0.856487,-1.773695,-0.678679,-0.986914,0.826986,0.948039,0.935427,1.120774,-1.018554,0.170282,...,0.27783,0.514923,0.733368,-0.104286,-0.910245,1.03961,-0.155167,-0.444398,-0.901361,-0.23679
A1BG-AS1,0.257664,-1.373085,-0.619923,-0.636275,0.362799,0.353296,0.422766,0.346853,-0.812015,0.903358,...,1.074116,0.821031,1.219272,0.901213,-1.522431,0.598719,-1.709745,-0.054156,-1.695843,-1.155961
A1CF,-0.089614,-0.546903,0.282914,-0.528926,0.507916,0.577696,0.647671,0.306824,0.089958,0.14982,...,-0.030265,-0.187367,-0.428358,-0.465863,-0.136936,1.229487,-0.11068,-0.118175,-0.139776,0.123829
A2M,0.552415,-0.635485,-0.954995,-0.259745,-1.687391,-1.756847,-1.640242,-1.73311,-0.091695,0.003428,...,-0.058505,0.207109,-0.161808,0.18363,0.948098,-0.977692,0.911896,-0.499357,1.469386,0.557998
A2ML1,0.758031,1.549857,1.262225,1.33878,-0.289888,-0.407026,-0.358798,-0.589988,0.944684,-0.466327,...,-0.472908,-0.598317,-0.247797,-0.282673,1.396365,0.945043,0.158202,0.572771,0.073088,-0.88678


## Indexing

Indexing in Pandas works slightly different than in NumPy. Similar to a dictionary, we can index dataframes by their names. 

The syntax for indexing single locations in a dataframe is `dataframe.loc[row_label,column_label]`. To index an individual column, we use the shorthand syntax `dataframe[column_label]`. To index an individual row, we use the syntax `dataframe.loc[row_label]`. To index by index #, we use the syntax `dataframe.iloc[index_number]`. Below are some examples on how to access rows, columns, and single values in our dataframe. For more information on indexing dataframes, visit the <a href = "https://pandas.pydata.org/docs/user_guide/indexing.html#indexing"> "Indexing and selecting data"</a> section in the Pandas User Guide.

In [31]:
# Get 'DISC1' data using loc
DISC1_data = gene_df.loc['DISC1']
DISC1_data

CA1 field                            0.102347
CA2 field                           -0.035143
CA3 field                           -0.140160
CA4 field                            0.377563
Crus I, lateral hemisphere          -1.288241
                                       ...   
tuberomammillary nucleus            -0.389785
ventral tegmental area               1.393981
ventromedial hypothalamic nucleus    0.269831
vestibular nuclei                    1.381197
zona incerta                         1.515921
Name: DISC1, Length: 232, dtype: float64

In [32]:
gene_df['CA1 field']

gene_symbol
A1BG        0.856487
A1BG-AS1    0.257664
A1CF       -0.089614
A2M         0.552415
A2ML1       0.758031
              ...   
ZYG11A     -0.496398
ZYG11B     -0.856866
ZYX        -1.941816
ZZEF1      -0.015748
ZZZ3       -0.924901
Name: CA1 field, Length: 20869, dtype: float64

Pandas has many, many useful methods that you can use on your data, including `describe`, `mean`, and more. To learn more about all the different methods that can be used to manipulate and analyze dataframes, please visit the <a href = "https://pandas.pydata.org/docs/user_guide/index.html"> Pandas User Guide </a>. 
* The `describe` method returns descriptive statistics of all the columns in our dataframe. 
* The `mean` and `std` method return the mean and standard deviation of each column in the dataframe, respectfully. 

In [33]:
# Try out describe
gene_df.describe()

Unnamed: 0,CA1 field,CA2 field,CA3 field,CA4 field,"Crus I, lateral hemisphere","Crus I, paravermis","Crus II, lateral hemisphere","Crus II, paravermis",Edinger-Westphal nucleus,Heschl's gyrus,...,"temporal pole, inferior aspect","temporal pole, medial aspect","temporal pole, superior aspect",transverse gyri,trochlear nucleus,tuberomammillary nucleus,ventral tegmental area,ventromedial hypothalamic nucleus,vestibular nuclei,zona incerta
count,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,...,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0,20869.0
mean,0.003664,0.017002,0.015315,-0.016633,0.093686,0.08881,0.096118,0.087608,0.047124,-0.042517,...,-0.05605,-0.051731,-0.049905,-0.039528,0.059726,0.014856,0.009535,-0.002853,-0.002527,0.013018
std,0.924456,1.129368,1.078987,0.897192,1.146146,1.118501,1.172986,1.136823,0.973224,0.500526,...,0.56795,0.651098,0.636729,0.494012,1.158916,0.897387,0.68632,0.830602,0.723977,0.72577
min,-4.076424,-5.923691,-5.994731,-3.971984,-2.739924,-2.662897,-2.908676,-2.864308,-3.671242,-1.666268,...,-1.840486,-2.433961,-2.412614,-1.655962,-6.330275,-3.14149,-1.977225,-3.541112,-2.369304,-2.348784
25%,-0.570475,-0.644093,-0.631248,-0.573605,-0.802651,-0.778933,-0.824404,-0.794683,-0.651691,-0.414882,...,-0.472481,-0.517562,-0.529458,-0.404416,-0.751329,-0.602619,-0.497988,-0.542164,-0.532177,-0.493921
50%,-0.025821,0.011189,0.00614,-0.044566,0.100558,0.098665,0.109358,0.096585,0.02276,-0.057495,...,-0.091029,-0.083971,-0.089572,-0.051758,0.03741,-0.030474,-0.023799,-0.023471,-0.045003,-0.031623
75%,0.561571,0.706946,0.706398,0.547903,0.985159,0.951919,1.007791,0.960813,0.736614,0.312739,...,0.352796,0.412321,0.403382,0.310551,0.844723,0.587382,0.503613,0.50841,0.501191,0.492262
max,7.062717,7.387742,6.413603,7.178692,2.679149,2.717237,2.963899,2.857205,7.55244,2.234669,...,2.199291,2.631498,3.065735,2.238555,6.892682,5.968364,7.267837,6.650673,2.723777,2.845665


In [34]:
# Try out mean & std
DISC1_data.mean()

-0.006323929904501832

<div class="alert alert-success">

**Challenge Task**: Reading Malformed `.csv` Files
    
</div>

`Data/malformed.csv` is a file of comma-separated values, containing the following fields:

|column name|description|type|
|---|---|---|
|`'first'`|first name of person|`str`|
|`'last'`|last name of person|`str`|
|`'weight'`|weight of person (lbs)|`float`|
|`'height'`|height of person (in)|`float`|
|`'geo'`|location of person; comma-separated latitude/longitude|`str`|

Unfortunately, the entries contains errors with the placement of commas (`,`) and quotes (`\"`) that cause `pandas`' `read_csv` function to fail parsing the file with the default settings:

In [None]:
pd.read_csv('Data/malformed.csv')

As a result, instead of using `pd.read_csv`, you must read in the file manually using Python's built-in `open` function.

### Using the open function
The built-in function [`open`](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) takes in a file path and returns a file object (sometimes called a file handle), which we can then iterate over:

In [None]:
with open('Data/malformed.csv') as fh:
    for line in fh:
        print(line)

Below, complete the implementation of the function `parse_malformed`, which takes in a file path (`fp`) and returns a parsed, properly-typed DataFrame with the information in the corresponding file. For example, `fp` may be `'Data/malformed.csv'`. The DataFrame should contain the columns described in the data description table above (with the specified types).

**Note:**
* The only kinds of issues you need your function to handle are comma and quote misplacements; don't try and find any other issues with the CSV.
* With that said, you should assume that `Data/malformed.csv` is a sample of a larger file that has the same sorts of errors, but potentially in different lines. For example, `Data/malformed.csv` has an unnecessary quote `\"` in line 4, but your function may be called on another CSV that has a perfectly fine line 4 but an unnecessary quote on some other line.
* So, **don't** implement `parse_malformed` assuming that the commas and quotes are mispositioned on specific lines; rather, implement `parse_malformed` such that it can handle these issues on every single line they appear in.
* A good way to proceed is to open `Data/malformed.csv` and look carefully at the comma and quote placements.
* You may want to use `fields = line.strip().split(',')`

In [9]:
def parse_malformed(fp):
    ...

In [None]:
parse_malformed('Data/malformed.csv')

## Resources
Check out the <a href="https://docs.scipy.org/doc/numpy/user/index.html">NumPy user guide</a> if you ever have a question about a NumPy array!

## About this notebook
This notebook is largely derived from UCSD COGS18 Materials, created by Tom Donoghue & Shannon Ellis, <a href="https://github.com/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb">JR Johannson's Scientific Python Lecture on Numpy</a>, and DSC80 materials 


Want to run this notebook as a slideshow? If you have Python (or Anaconda) follow <a href="http://www.blog.pythonlibrary.org/2018/09/25/creating-presentations-with-jupyter-notebook/">these instructions</a> to setup your computer with the RISE plugin.