# UBC MRI Research Python Workshop 2

## August 22 2017



1. Higher dimension numpy arrays
    * Indexing
    * Slicing
    * Boolean Masks
        * Exercise: Mask one array with another array of the same shape
   
2. Object-oriented programing
    * Writing classes
    * Initializing and manipulating objects
        * Exercise: Create an Image class
        
3. Matplotlib plotting
    * Plotting the object-oriented way
    * Changing plot attributes
    * Subplots
        * Exercise: Plot a 2D image
        
4. Curve fitting
    * Linear transform
        * Exercise: scipy.optimize.curve_fit()
        
5. Pandas
    * Importing and examining dataframes
    * Indexing dataframes
    * Condition indexing
    * Plotting
        * Exercise: Vancouver Open Data Catalogue
    

## Numpy indexing

https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

In [None]:
arr = np.random.randint(0,100,(8,8))

In [None]:
print(arr)
print(arr.shape)

In [None]:
arr[1,2]

In [None]:
arr[1]

In [None]:
arr[1:6,2]

In [None]:
arr[1:,2]

In [None]:
arr[:,2]

In [None]:
arr[1:6:2,2]

### Slicing summary: (start:stop:step)

In [None]:
arr[1::2,2]

In [None]:
arr[::-1,2]

### Boolean masks

In [None]:
arr == 19

In [None]:
arr > 19

In [None]:
arr[arr > 19]

In [None]:
arr[]

## Exercise: Mask one array with another array of the same shape
* Create two arrays with random digits
* Find all entries in array 1 where array 2 is larger than N
* Take the mean of the result

Options
* Investigate the different ways to create random arrays in numpy
* Take the threshold N as the 90th percentile of array 2

## Higher dimensional arrays

In [None]:
arr3d = np.random.randint(0,100,(512,512,48))

In [None]:
arr3d.shape

In [None]:
myslice = arr3d[:,:,20]

In [None]:
myslice.shape

In [None]:
myslices = arr3d[:,:,::2]

In [None]:
myslices.shape

In [None]:
arr4d = np.random.randint(0,100,(512,512,48,32))

In [None]:
arr4d.shape

## Exercise: Take the mean across the 4th dimension (temporal averaging)

In [None]:
arr_tempmean = arr4d.mean(axis=3)

In [None]:
arr_tempmean.shape

## Classes and Objects

Classes are a smart way to organize your code. Instead of looping funtions, define a class to describe subjects, timepoints, events, etc and give the class attributes and methods.

Our first class will describe a subject in our study. Our class will be called "Subject", and it's only attribute will be the subject ID of a given subject.

In [None]:
class Subject():
    pass

In [None]:
sub1 = Subject()

In [None]:
sub1.subID = 'sub001'

Let's give the class some information when it's first called

In [None]:
class Subject():
    def __init__(self,subID):
        self.subID = subID
        

`__init__` is a special method. It is automatically called when the object is created. The first argument to `__init__` is always "self". "self" gives a method access to all the attributes of the object. Any other arguments are passed to the object when it is created, like when you run a subject.

With this simple class definition, we can create subject objects, pass subject IDs on creation, then access the subject ID on demand.

In [None]:
sub1 = Subject('01')

In [None]:
sub2 = Subject('02')

In [None]:
sub1.subID

In [None]:
sub2.subID

Let's expand the class to add some additional attributes, and a method which modifies those attributes

In [None]:
class Subject():
    def __init__(self,subID,data,date):
        self.subID = subID
        self.data = data
        self.date = date
        self.isclean = False
        
    def cleandata(self):
        # Set any negative values to zero
        self.data = [ 0 if x<0 else x for x in self.data ]
        self.isclean = True

In [None]:
sub3 = Subject('03',[-2,-1,0,1,2,3,4,5],'2017-08-20')

In [None]:
print(sub3.subID,sub3.data,sub3.date)

In [None]:
sub3.cleandata()

In [None]:
sub3.data

In [None]:
sub3.isclean

Finally, let's do some data validation. 

In the constructor, we'll check whether "data" is a list. If not we'll raise an error.

We'll also convert the "date" string into a date object that python understands.

Let's imagine that there was a calibration error for all data collected in 2016, so we need to increase all data values by 1 for dates in 2016 but not in 2017. We can add this to the cleandata() method.

In [None]:
from datetime import datetime

class Subject():
    def __init__(self,subID,data,date):
        # Check that data is a list
        if type(data) != list:
            raise ValueError("Argument data must be type 'list'")
        self.subID = subID
        self.data = data
        # Make the date attribute a python date
        self.date = datetime.strptime(date,'%Y-%m-%d')
        self.isclean = False
        
    def cleandata(self):
        # Set any negative values to zero
        self.data = [ 0 if x<0 else x for x in self.data ]
        self.isclean = True
        
        # Recalibrate data if collected in 2016
        if self.date.year == 2016:
            self.data = [x+1 for x in self.data]
            
    

First, let's make a new subject but pass the wrong type of data:

In [None]:
sub4 = Subject('04','-2,-1,0,1,2,3,4,5','2017-08-20')

Now let's create two subjects with identical data, but with acquisition dates in different years

In [None]:
sub5 = Subject('05',[-2,-1,0,1,2,3,4,5],'2016-05-11')
sub6 = Subject('06',[-2,-1,0,1,2,3,4,5],'2017-05-11')

In [None]:
if (not sub5.isclean) and (not sub6.isclean):
    sub5.cleandata()
    sub6.cleandata()

In [None]:
print('Subject {}: {}'.format(sub5.subID,sub5.data))
print('Subject {}: {}'.format(sub6.subID,sub6.data))

## Class inheritence

Classes can inherit from each other. So, you can write a general Subject class that contains all the typical attributes of a research subject, then write a sub-class to customize it for your specific study

In [None]:
class MRISubject(Subject):
    pass

The above class inherits everything from the Subject class and adds nothing. We can do better! Let's add an attribute that's a list of scans acquired for this subject.

To do this, we need to modify the __init__ command. If we wanted, we could just write a new definition of __init__; but that would lose the work we did in the base class. Instead, we will define a new __init__ but bring in all the attributes from the base class as well.

Another change we will make is that the new attribut `scans` will be optional. We do this by assigning a default values in the __init__ definition. When this is done, the use can either set the value of scans themself or leave it blank.

In [None]:
class MRISubject(Subject):
    def __init__(self,subID,data,date,scans=None):
        # This super() function is magic. It copies in everything from the __init__ function of the base class
        super().__init__(subID,data,date)
        self.scans = scans

In [None]:
mrisub = MRISubject('09',[1,2,3,4,5],'2017-01-10',scans=['DTI','3DT1','T2GRASE'])

In [None]:
mrisub.scans

## \*args and \*\*kwargs (optional)
We can generalize function inputs so that we don't have to type out all the inputs to the __init__ function (or any funtion) every time. Instead, we can use \*args and \*\*kwargs

```def myfunction(*args,**kwargs):
    input1 = args[0]
    ...
    opt_input1 = kwargs['key1']
    ...
```


The single star bundles up all mandatory arguments into a list called `args`. The double star bundles up all optional arguments into a dictionary called `kwargs` with key:value pairs. The actual variable names "args" and "kwargs" can be anything you want, but are used by convention. Let's use this concept to simplify our class inputs.

In [None]:
class MRISubject(Subject):
    def __init__(self,*args,**kwargs):
        # This super() function is magic. It copies in everything from the __init__ function of the base class
        super().__init__(*args)
        self.scans = kwargs['scans']

In [None]:
mrisub = MRISubject('09',[1,2,3,4,5],'2017-01-10',scans=['DTI','3DT1','T2GRASE','ASL'])

In [None]:
mrisub.scans

## Exercise: Image Object

Create a class that defines a 3D image object. 
* Define a class called something like Image
* Give the class an attribute called "header" that is a dictionary of image properties
* Write a method called "generate_image()" or similar that generates a 3D matrix of random values and assigns it as an attribute
* Write a method called "generate_mask()" that generates a 3D matrix of the same size as your first image. The mask should be all zeros except for a region of ones. Your mask can be simple or complex. Assign the mask as an attribute
* Write a method that takes the mean of the image matrix where mask values are 1

Things to think about:
* Which methods should be run automatically, and which should the user call?
* What other methods can we write?

In [None]:
class Image():
    pass

# Plotting with MatPlotLib

There are two ways to interact with MatPlotLib: the scripting interface (pyplot), or the object-oriented interface. Both produce the same results and are useful in different situations. This tutorial will mostly use the object-oriented technique since I like it more, but when looking things up online keep in mind that both exist

There are two main objects in MPL: The figure and the axis. Each figure is a separate image. Each axis contains one or more datasets visualizations. A figure can have any number of axes in it, but each axis belongs to a single figure.

The function `plt.subplots(n)` creates a figure with `n` axes arranged vertically. We'll start with one axis and then make it more complicated.

In [None]:
%matplotlib
from matplotlib import pyplot as plt

First, let's invent some data. Let's make 1000 evenly spaced points between 0 and 4$\pi$ on the x axis, and a cosine function as the y data:

In [None]:
xdata = np.linspace(0,4*3.14,num=1000)

In [None]:
ydata = np.cos(xdata)

Now, make the figure and axes objects and plot the data

In [None]:
f, ax = plt.subplots(1)

In [None]:
cosline, = ax.plot(xdata,ydata)

So right now we have access to three major objects: The figure (`f`), the axis (`ax`), and the line (`cosline`). We can modify how the plot looks

In [None]:
cosline.set_color('red')

In [None]:
cosline.set_marker('_')

In [None]:
cosline.set_alpha(0.5)

In [None]:
ax.set_axis_off()

In [None]:
ax.set_axis_on()

In [None]:
f.legend([cosline],['my data'])

Let's start again with a new figure with 2 axes. Let's generate some random data for the second axis

In [None]:
ydata2 = np.random.random(1000)/5

In [None]:
f,ax = plt.subplots(3,sharey=True)

Note that `ax` is now an array of axes. We access them with `ax[0]` and `ax[1]`

In [None]:
noisecosline, = ax[0].plot(xdata,ydata+ydata2)

In [None]:
cosline, = ax[1].plot(xdata,ydata,color='orange')

In [None]:
noiseline = ax[2].plot(xdata,ydata2,color='blue')

In [None]:
a.set_ylim?

We can make subplots in different arrangements simply

In [None]:
f,ax = plt.subplots(2,2)

Or make arrangments more complicated

In [None]:
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(8, 8))
gs = gridspec.GridSpec(3, 3)
ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, :2])
ax3 = fig.add_subplot(gs[1:, 2])
ax4 = fig.add_subplot(gs[2, 0])
ax5 = fig.add_subplot(gs[2, 1])

## Exercise: Plot a 2D array with `matshow()`

* Create a 2D or 3D array
* Draw the 2D array (or a slice of the 3D array) with matshow
* Experiment with changing the properties of the plot

Optional:
* Add additional axes with more information

# Curve Fitting

Let's generate an exponential decay, and add some noise to the data.

In [1]:
xdata = np.linspace(1,32,32)

In [2]:
xdata

array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.])

In [3]:
def exp_decay(t,A,T):
    return A*np.exp(-t/T)

In [5]:
exp_decay(xdata,100,20)

array([ 95.12294245,  90.4837418 ,  86.07079764,  81.87307531,
        77.88007831,  74.08182207,  70.46880897,  67.0320046 ,
        63.76281516,  60.65306597,  57.69498104,  54.88116361,
        52.20457768,  49.65853038,  47.23665527,  44.93289641,
        42.74149319,  40.65696597,  38.67410235,  36.78794412,
        34.99377491,  33.28710837,  31.66367694,  30.11942119,
        28.65047969,  27.2531793 ,  25.92402606,  24.65969639,
        23.45702881,  22.31301601,  21.22479738,  20.1896518 ])

Set the "true" values for A and T, then generate some sample data

In [6]:
A = 10
T = 20
ydata = exp_decay(xdata,A,T)

In [7]:
ynoise = np.random.random(32)/3
ydata = ydata + ynoise

In [9]:
%matplotlib notebook

In [11]:
f,ax = plt.subplots()
ax.scatter(xdata,ydata)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x11e4c3048>

## Linear transform fit
If we know the data is exponential, it's quickest to transform the data and do a linear fit


$$ 
S = Ae^{{-t}/{T}}
$$
$$
\log{S} = \log{A} - \frac{t}{T} 
$$

Therefore when we plot log of signal vs time, 

$$
slope=-1/T
$$

In [12]:
ydatalog = np.log(ydata)

In [13]:
f2,ax2 = plt.subplots()
ax2.scatter(xdata,ydatalog)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x11e8c60b8>

In [15]:
np.polyfit?

In [16]:
m,b = np.polyfit(xdata,ydatalog,1)
m,b

(-0.047725977371510915, 2.3085737075634767)

In [17]:
ax2.plot(xdata,xdata*m+b)

[<matplotlib.lines.Line2D at 0x11eabf6a0>]

Generate some new data based on our measured A and T values

In [18]:
A_measured = np.exp(b)
A_measured

10.060065821707679

In [19]:
T_measured = -1/m
T_measured

20.952949631932114

In [20]:
ydata_linfit = exp_decay(xdata,A_measured,T_measured)

In [21]:
ax.plot(xdata,ydata_linfit)

[<matplotlib.lines.Line2D at 0x11eabfb38>]

## Exercise: curve_fit
Use `scipy.optimize.curve_fit()` to fit the same ydata directly to the exponential decay function that we defined. Plot the fit. Is it better or worse than the linear fit?

In [23]:
from scipy.optimize import curve_fit
curve_fit?

In [26]:
popt, pcov = curve_fit(exp_decay,xdata,ydata,bounds=[(5,10),(50,100)],p0=[10,20])

In [27]:
popt

array([ 10.11938587,  20.76528903])

## Exercise: Write an image class that can fit a curve across the time dimension

* Write an image generator method that generates a 4D image that contains a exponential decay timeseries along each voxel. Add some noise to make it realistic
* Keep the mask generator method from last time
* Write a method that computes the decay constant in each masked voxel
* Write a method that omputes the mean time constant in masked voxels and assign to to an attribute
* Write a method that plots some orthogonal slices
* Write a method that plots a histogram of the time constant distribution in masked voxels

Some tips:
* Take one step at a time
* Test frequently
* Use google! Anything that seems tricky probably has a simple solution

In [None]:
class Image4D(Image):
    pass

## Pandas
[pandas](http://pandas.pydata.org/) is the data analysis package in Python. It provides a [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) object which acts like a spreadsheet. Let's import the package and some data:

In [28]:
import pandas as pd

The Vancouver Police Deparment publishes crime data through City of Vancouver's Open Data Catalogue. Let's import the data (prepared and posted at math.ubc.ca/~pwalls) using the `pandas.read_csv()` function:

In [29]:
data = pd.read_csv('http://www.math.ubc.ca/~pwalls/data/van_crime.csv')

Examine the top few lines to the dataframe

In [30]:
data.head()

Unnamed: 0,TYPE,YEAR,MONTH,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
0,Mischief,2015,3,26XX E 49TH AVE,Victoria-Fraserview,496065.581256,5452452.0
1,Theft from Vehicle,2015,12,34XX WILLIAM ST,Hastings-Sunrise,497850.8008,5457933.0
2,Theft from Vehicle,2015,4,34XX WILLIAM ST,Hastings-Sunrise,497879.450446,5457923.0
3,Theft from Vehicle,2015,10,34XX WILLIAM ST,Hastings-Sunrise,497901.62345,5457932.0
4,Theft from Vehicle,2015,9,34XX WILLIAM ST,Hastings-Sunrise,497921.510576,5457932.0


User the `info` method to learn about the columns in the dataframe

In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45802 entries, 0 to 45801
Data columns (total 7 columns):
TYPE             45802 non-null object
YEAR             45802 non-null int64
MONTH            45802 non-null int64
HUNDRED_BLOCK    45802 non-null object
NEIGHBOURHOOD    41815 non-null object
X                45802 non-null float64
Y                45802 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 2.4+ MB


Use the DataFrame method unique to see the different types of crimes in the dataset:

In [32]:
data.columns

Index(['TYPE', 'YEAR', 'MONTH', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD', 'X', 'Y'], dtype='object')

In [34]:
data['TYPE'].unique()

array(['Mischief', 'Theft from Vehicle', 'Other Theft', 'Theft of Vehicle',
       'Break and Enter Residential/Other', 'Offence Against a Person',
       'Homicide', 'Break and Enter Commercial'], dtype=object)

Notice that we select columns using brackets and the column name. There are some crimes that do not include the longitude and latitude coordinates due to privacy. Let's do a query and select the rows where the X coordinate is 0:

In [35]:
data[data['X'] == 0].head(10)

Unnamed: 0,TYPE,YEAR,MONTH,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
168,Offence Against a Person,2015,12,OFFSET TO PROTECT PRIVACY,,0.0,0.0
223,Offence Against a Person,2015,9,OFFSET TO PROTECT PRIVACY,,0.0,0.0
225,Offence Against a Person,2015,2,OFFSET TO PROTECT PRIVACY,,0.0,0.0
244,Offence Against a Person,2015,4,OFFSET TO PROTECT PRIVACY,,0.0,0.0
407,Offence Against a Person,2015,9,OFFSET TO PROTECT PRIVACY,,0.0,0.0
677,Offence Against a Person,2015,10,OFFSET TO PROTECT PRIVACY,,0.0,0.0
756,Offence Against a Person,2015,1,OFFSET TO PROTECT PRIVACY,,0.0,0.0
1047,Offence Against a Person,2015,5,OFFSET TO PROTECT PRIVACY,,0.0,0.0
1316,Offence Against a Person,2015,9,OFFSET TO PROTECT PRIVACY,,0.0,0.0
1452,Offence Against a Person,2015,7,OFFSET TO PROTECT PRIVACY,,0.0,0.0


To access individual cells, we can use Datafram methods `.loc` or `.iloc`

In [None]:
data.loc[5,'TYPE']

In [None]:
data.iloc[5,0]

In [37]:
data[data['X']>0].plot(kind='scatter',x='X',y='Y',s=4,alpha=0.2,figsize=(8,8))


<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x121cef710>

In [None]:
data[data['X'] > 0].to_csv('van_crime_with_location.csv')

## Exercise: Vancouver Open Data

Choose your own dataset from the (Vancouver Open Data catalogue)[http://data.vancouver.ca/datacatalogue/index.htm]. Filter and plot the data.

In [38]:
from matplotlib import pyplot

In [None]:
import matplotlib.pyplot