# NumPy & Pandas
In this notebook, we'll encounter two packages for scientific computing in Python: NumPy and Pandas.

**At the end of this notebook, you'll be able to:**
* Install and import packages for Python
* Create NumPy arrays
* Execute methods & access attributes of arrays
* Create & manipulate Pandas dataframes
<hr>

## Importing packages

Before we can use NumPy or Pandas, we need to import them. We can also nickname the modules when we import them.

The convention is to import `numpy` as `np` and `pandas` as `pd`.

In [None]:
# Import packages
import numpy as np
import pandas as pd

# Use whos 'magic command' to see available modules
%whos

## NumPy

**NumPy** is the fundamental package for scientific computing with Python. It'll allow us to work with bigger datasets more efficiently.

### Creating `numpy` arrays

A numpy **array** is a grid of values which are all the same [data type](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html).

We can create a numpy array in a few different ways:

* from a Python list or tuples
* by using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, `empty`,`zeroes`, etc.
* reading data from files

In [None]:
# Create a list
lst = [1,2,3,4,5]

# Make our list into an array
my_matrix = ...

In [None]:
# If we give np.array() a list of lists, it will create a matrix


### Accessing attributes of numpy arrays
We can test shape and size either by looking at the attribute of the array, or by using the `shape()` and `size()` functions.

Other attributes that might be of interest are `ndim` and `dtype`.

In [None]:
# Check the dimensions of vector

# Check the dimensions of matrix


Array data type is decided upon creation of the array.

You can explicitly define the data type by using `dtype= ` when you use `np.array()`. You can set the dtype to be `int, float, complex, bool, object`, etc

In [None]:
# Use dtype here

### Indexing & slicing arrays

Indexing and slicing 1D arrays (vectors) is similar to indexing lists.

You can index NumPy arrays using `array_name[row,column]` to select a single value. If you omit the column, it will give you the entire row. You can also use `:` in place of either `row` or `column` to indicate you want to return all those values.

<div class="alert alert-success">

**Task**: Create an array of booleans called `bool_array` that is 2 rows x 3 columns. Access the `shape` and `ndim` attributes to confirm its size, and the `dtype` attribute to confirm that it is boolean.

</div>

In [None]:
# Your code here


In [None]:
my_matrix[[0,2],[0,2]]

You may want to look at a slice of columns or a slice of rows. You can slice your array like the following: `array(start_row:stop_row, start_col:end_col)`. 

In [None]:
# Look at the first 3 columns of each row 
my_matrix[: ,0:3]

## Subsetting arrays
We can also subset our original array to only include data that meets our criteria. We can think of this as **subsetting** the array by applying a condition to our array. The syntax for this would be new_array = original_array[condition]. This is essentially indexing/slicing the array using a Boolean.

In [None]:
my_array = np.array([lst, lst])

# Return only values greater than 3 from our array 
condition = (my_array > 3)
filtered_array = my_array[condition]
print(filtered_array)

<div class="alert alert-success">

**Task**: Subset `bool_array` so that it is only the true values.

</div>

We can also change values in an array similar to how we would change values in a list.

In [None]:
# Assign a value to an index in my_matrix


### Benefits of using arrays
In addition to being less clunky & a bit faster than lists of lists, arrays can do a lot of things that lists can't. For example, we can add and multiply them. Alternatively, we can use the `sum` method to sum across a specific axis.

In [None]:
# Demonstrate differences between lists and arrays
sum_list = [1,3,5] + [3,5,7]
sum_array = np.array([1,3,5]) + np.array([3,5,7])
mult_array = np.array([1,3,5]) * np.array([3,5,7])

print(sum_list)
print(mult_array)

In [None]:
# Sum over a specific axis
this_array = np.array([[1,3,5],[3,5,7]])
sum_rows = this_array.sum(axis=1)
print(this_array)
print(sum_rows)

### Numpy also includes some very useful array generating functions:

* `arange`: like `range` but gives you a useful numpy array, instead of an interator, and can use more than just integers)
* `linspace` creates an array with given start and end points, and a desired number of points
* `logspace` same as linspace, but in log.
* `random` can create a random list (there are <a href="https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html">many different ways to use this</a>)
* `concatenate` which can concatenate two arrays along an existing axis [<a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html">documentation</a>]
* `hstack` and `vstack` which can horizontally or vertically stack arrays

Whenever we call these, we need to use whatever name we imported numpy as (here, `np`).

In [None]:
# When using linspace, both end points are included!
np.linspace(0,147,10)

Numpy also has built in methods to save and load arrays: `np.save()` and `np.load()`. Numpy files have a .npy extension.

See full documentation <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html">here</a>.

In [None]:
# Save method takes arguments 'filename' and then 'array':
np.save('matrix',my_matrix)

In [None]:
my_new_matrix = np.load('matrix.npy')
my_new_matrix

# Pandas

Pandas is a useful module that creates **dataframes** (think of these like Excel spreadsheets, but much faster!). We can think of Pandas as "numpy with labels".

### Benefits of Pandas
* Great for real-world, heterogeneous data
* Similar to Excel spreadsheets (but way faster!)
* Smartly deals with missing data

The two data structures of Pandas are the `Series` and the `DataFrame`. A `Series` is a one-dimensional onject similar to a list. A `DataFrame` can be thought of as a two-dimensional numpy array or a collection of `Series` objects. Series and dataframes can contain multiple different data types such as integers, strings, and floats, similar to an Excel spreadsheet. Pandas also supports `string` lables unlike numpy arrays which only have numeric labels for their rows and columns. For a more in depth explanation, please visit the [Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) section in the Pandas User Guide. 

You can create a Pandas dataframe by inputting dictionaries into the Pandas function `pd.DataFrame()`, by reading files, or through functions built into the Pandas package. The function [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) reads a comma- or tab-separated file and returns it as a `dataframe`.


## Loading in data as a dataframe
Below we will create a dataframe by reading the file `brainarea_vs_genes_exp_w_reannotations.tsv` which contains information on gene expression accross multiple brain areas. 

>**About this dataset:**
This dataset was created by Derek Howard and Abigail Mayes for the purpose of accelerating advances in data mining of open brain transcriptome data for polygenetic brain disorders. The data comes from normalized microarray datasets of gene expression from 6 adult human brains that was released by the Allen Brain Institute and then processed into the dataframe we will see below. For more information on this dataset please visit the <a href = "https://github.com/derekhoward/HBAsets"> HBAsets repository</a>. 


In [None]:
# Read in the file as a data frame
file_name = 'Data/brainarea_vs_genes_exp_w_reannotations.tsv'

# Use pd.read_csv
gene_df = ...

# '.head()' returns the first 5 rows in the dataframe
gene_df.head()

At the moment, the first column  of information above, the **index** just contains a list of numbers. We can reassign the row labels by using the method `set_index()`. We can choose any column in our present dataframe to be the row values. Let's assign the row lables to be the `gene_symbol` and reassign the dataframe. 

In [None]:
# Set the index
gene_df = gene_df.set_index('gene_symbol')
gene_df.head()

## Indexing

Indexing in Pandas works slightly different than in NumPy. Similar to a dictionary, we can index dataframes by their names. 

The syntax for indexing single locations in a dataframe is `dataframe.loc[row_label,column_label]`. To index an individual column, we use the shorthand syntax `dataframe[column_label]`. To index an individual row, we use the syntax `dataframe.loc[row_label]`. To index by index #, we use the syntax `dataframe.iloc[index_number]`. Below are some examples on how to access rows, columns, and single values in our dataframe. For more information on indexing dataframes, visit the <a href = "https://pandas.pydata.org/docs/user_guide/indexing.html#indexing"> "Indexing and selecting data"</a> section in the Pandas User Guide.

In [None]:
# Get 'DISC1' data using loc
DISC1_data = ...

Pandas has many, many useful methods that you can use on your data, including `describe`, `mean`, and more. To learn more about all the different methods that can be used to manipulate and analyze dataframes, please visit the <a href = "https://pandas.pydata.org/docs/user_guide/index.html"> Pandas User Guide </a>. 
* The `describe` method returns descriptive statistics of all the columns in our dataframe. 
* The `mean` and `std` method return the mean and standard deviation of each column in the dataframe, respectfully. 

In [None]:
# Try out describe


In [None]:
# Try out mean & std


<div class="alert alert-success">

**Challenge Task**: Reading Malformed `.csv` Files
    
</div>

`Data/malformed.csv` is a file of comma-separated values, containing the following fields:

|column name|description|type|
|---|---|---|
|`'first'`|first name of person|`str`|
|`'last'`|last name of person|`str`|
|`'weight'`|weight of person (lbs)|`float`|
|`'height'`|height of person (in)|`float`|
|`'geo'`|location of person; comma-separated latitude/longitude|`str`|

Unfortunately, the entries contains errors with the placement of commas (`,`) and quotes (`\"`) that cause `pandas`' `read_csv` function to fail parsing the file with the default settings:

In [None]:
pd.read_csv('Data/malformed.csv')

As a result, instead of using `pd.read_csv`, you must read in the file manually using Python's built-in `open` function.

### Using the open function
The built-in function [`open`](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) takes in a file path and returns a file object (sometimes called a file handle), which we can then iterate over:

In [None]:
with open('Data/malformed.csv') as fh:
    for line in fh:
        print(line)

Below, complete the implementation of the function `parse_malformed`, which takes in a file path (`fp`) and returns a parsed, properly-typed DataFrame with the information in the corresponding file. For example, `fp` may be `'Data/malformed.csv'`. The DataFrame should contain the columns described in the data description table above (with the specified types).

**Note:**
* The only kinds of issues you need your function to handle are comma and quote misplacements; don't try and find any other issues with the CSV.
* With that said, you should assume that `Data/malformed.csv` is a sample of a larger file that has the same sorts of errors, but potentially in different lines. For example, `Data/malformed.csv` has an unnecessary quote `\"` in line 4, but your function may be called on another CSV that has a perfectly fine line 4 but an unnecessary quote on some other line.
* So, **don't** implement `parse_malformed` assuming that the commas and quotes are mispositioned on specific lines; rather, implement `parse_malformed` such that it can handle these issues on every single line they appear in.
* A good way to proceed is to open `Data/malformed.csv` and look carefully at the comma and quote placements.
* You may want to use `fields = line.strip().split(',')`

In [None]:
def parse_malformed(fp):
    ...

## Resources
Check out the <a href="https://docs.scipy.org/doc/numpy/user/index.html">NumPy user guide</a> if you ever have a question about a NumPy array!

## About this notebook
This notebook is largely derived from UCSD COGS18 Materials, created by Tom Donoghue & Shannon Ellis, <a href="https://github.com/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb">JR Johannson's Scientific Python Lecture on Numpy</a>, and DSC80 materials 


Want to run this notebook as a slideshow? If you have Python (or Anaconda) follow <a href="http://www.blog.pythonlibrary.org/2018/09/25/creating-presentations-with-jupyter-notebook/">these instructions</a> to setup your computer with the RISE plugin.