# Project name: 
    The Functional Neuroanatomy of the Human Subthalamic Nucleus
    
    Initial code by Gilles de Hollander; 
    Edited by Steven Miletic and Max Keuken

# Goal of the project: 
    To investigate the internal organisation of the human subthalamic nucleus using histology. The non-demented control tissue has been originally analyzed by Gilles de Hollander. 

# Layout of the Notebook
### 1) Combine and store the data:
    1) import histo data into a HDF5 file that contains the histo data and the STN masks in the folder:
       /home/mkeuken1/data/post_mortem/new_data_format/

### 2) Plot the data:
    2) load in the HDF5 data files using the base.py script. The base.py script loads in the data, sets the resolution but also smooths the data with a number of smoothing kernels (0.15, 0.3, 0.6, 1.2, and 2.4 mm fwhm). The reason why we have such a large number of smoothing kernels is because we ran a simulation to generate a hypothesis figure. Here we noted that if the smoothing kernel was too small the histograms were very noisy, if the smoothing kernel was very large you cannot find small transition zones. So there needed to be a bit of a balance. We decided to run the entire analysis for the 5 different smoothing kernels. The final choice of fwhm to report in the main manuscript was based on consistency accross tissue blocks, the other kernels were placed in the supplements.  

### 3) Statistical analysis of the 27 PCA intensity sectors
    3a) Creating the 27 PCA sectors where for each stain, across the subjects we will test whether they differ from 0
    3b) Doing the actual statistical testing: t-tests which are FDR corrected for multiple comparisons.

### 4) Mixture models based 
    4a1) Creating the feature vectors (intensity and gradient) for the different stains
    4a2) Fit the mixture models to the feature vectors (intensity and gradient)
    4b) Sanity check for the gradient vectors: create same plots for gradient as is done in step 2)
    4c) How many clusters fit the data best?
    4c1) Model selection based on AIC and BIC
    4c2) Model selection based on cross validation within and between
    

### Differences compared to the initial analysis as shown in the thesis by Gilles de Hollander
- The input data contains a few more slices (were initially not included due to naming errors)
- Different amount of samples (now 15% of the data)
Mixture analysis has been changed substantially:

- different number of train / test partitions and number of samples. See "def make_feature_vector" in cluster.py for the updated code.
- treat the BIC and AIC seperately, see section 4c1).

THIS MEANS THAT YOU CANNOT USE THE ORIGINAL CODE ON GITHUB AS THERE ARE DIFFERENT FUNCTIONS USED.  


### 1) Combine and store the data
#### Importing the histological data as well as the masks of the STN and save them into a HDF5 file.
 

In [None]:
############
# What is the exact dataset that we are working with?
############

# The stain data of the following tissue blocks: 13095, 14037, 14051, 14069, 15033, 15035, 15055 
# 
# The specific data files are the processed files that will also be shared via DANS/Figshare. 
# The DANS/Figshare has the following folder structure:
#   Subject ID/
#              stain/
#                    unique stain/
#                                 orig/ (not relevant for this project, the multipage tiff as from the microscope)
#                                 proc/ (these are the files we will use for this project)
#              blockface/               (not relevant for this project)
#              MRI/                     (not relevant for this project)
#
# The stain data in the proc/ folder is aligned to the Blockface space
#
# All stain to blockface registration steps were visually inspected by Anneke Alkemade. If the registration failed, 
#   this stain and slice was excluded. See "exclusion_list.txt" for an overview. 
# 
# For this project the processed .png files (as indicated in the proc.DANS/Figshare folder) were renamed and
#   copied to the following folder:
#      data/STN_Histo/stacked_slides/
#
# How were the files renamed?
#    13095_vglut1_proc_1800_7561_2_blockface.png -> 13095_vglut1_1800_7561.png
#
#  and moved to their respective subjectID folder:
#    data/STN_Histo/stacked_slides/subjectID/
#
############
# Start code
############

# Importing a number of different tools
import re
import pandas
import glob
import h5py
import scipy as sp
from scipy import ndimage
import natsort
import numpy as np
import os

# Find the stains.png images per tissue blocks that have been registered to the blockface images
fns = glob.glob('/home/mkeuken1/data/post_mortem/stacked_slides/*/*')
reg = re.compile('.*/(?P<subject_id>[0-9]{5})_png/(?P<stain>[A-Za-z0-9]+)_(?P<slice>[0-9]+)_[0-9]+_(?P<id>[0-9]+)\.png')

df = pandas.DataFrame([reg.match(fn).groupdict() for fn in fns if reg.match(fn)])
df['subject_id'] = df['subject_id'].astype(int)
df['slice'] = df['slice'].astype(int)
df['fn'] = [fn for fn in fns if reg.match(fn)]
df['id'] = df['id'].astype(int)
# There were a number of stains where there were 2 images. The first image was before the tissue block
#  was moved forwards again during the cutting. The second image was once the cutting continued. We chose
#  to only keep the second image:
df = df.drop_duplicates(['subject_id', 'slice', 'stain'], keep='last')

# The naming conventions of the stains was lower case so rename to match to uppercase
def correct_stain(stain):
    if stain == 'calr':
        return 'CALR'
    
    if stain == 'fer':
        return 'FER'

    if stain == 'gabra3':
        return 'GABRA3'
    
    if stain == 'gad6567':
        return 'GAD6567'
    
    if stain == 'mbp':
        return 'MBP'
    
    if stain == 'parv':
        return 'PARV'    
        
    if stain == 'sert':
        return 'SERT' 
    
    if stain == 'smi32':
        return 'SMI32' 
    
    if stain == 'syn':
        return 'SYN'   
    
    if stain == 'th':
        return 'TH' 
    
    if stain == 'transf':
        return 'TRANSF' 
    
    if stain == 'vglut1':
        return 'VGLUT1'
    
    return stain

df['stain'] = df.stain.map(correct_stain).astype(str)

# Make a data structure that will be used for combining the histo data
df.to_pickle('/home/mkeuken1/data/post_mortem/data.pandas')

# Find the masks of the STN that were based of two raters who parcellated the STN using the PARV and SMI32 stains.
reg3 = re.compile('/home/mkeuken1/data/post_mortem/histo_masks/(?P<subject_id>[0-9]{5})_RegMasks_(?P<rater>[A-Z]+)/(?P<stain>[A-Z0-9a-z_]+)_(?P<slice>[0-9]+)_([0-9]+)_(?P<id>[0-9]+)\.png')

fns = glob.glob('/home/mkeuken1/data/post_mortem/histo_masks/*_RegMasks_*/*_*_*_*.png')

masks = pandas.DataFrame([reg3.match(fn).groupdict() for fn in fns])
masks['fn'] = fns
masks['subject_id'] = masks['subject_id'].astype(int)
masks['slice'] = masks['slice'].astype(int)

masks.set_index(['subject_id', 'slice', 'stain', 'rater'], inplace=True)
masks.sort_index(inplace=True)

masks.to_pickle('/home/mkeuken1/data/post_mortem/masks.pandas')

mask_stains = ['PARV', 'SMI32']
raters_a = ['KH', 'MT']

# There were a few masks missing (either due to not correct saving or skipping), so MCKeuken and AAlkemade parcellated the 
# remaing ones
raters_b = ['MCK', 'AA']

# A for loop that creates the .HDF5 files per tissue block 
for subject_id, d in df.groupby(['subject_id']):
    print subject_id
    
    slices = natsort.natsorted(d.slice.unique())
    
    print slices
    
    stains = natsort.natsorted(d.stain.unique())
    resolution = ndimage.imread(d.fn.iloc[0]).shape

    data_array = np.zeros((len(slices),) + resolution + (len(stains),))
    data_array[:] = np.nan
    
    print 'Storing data'
    for idx, row in d.iterrows():
        
        slice_idx = slices.index(row['slice'])
        stain_idx = stains.index(row['stain'])
        
        data_array[slice_idx, ..., stain_idx] = ndimage.imread(row.fn)
        
    mask_array = np.zeros((len(slices),) + resolution + (4,))
    
    print 'Storing masks'
    for idx, row in masks.ix[subject_id].reset_index().iterrows():
        
        slice_idx = slices.index(row['slice'])
        
        if row.rater in raters_a:
            last_idx = mask_stains.index(row.stain) * 2 + raters_a.index(row.rater)
        else:
            last_idx = mask_stains.index(row.stain) * 2 + raters_b.index(row.rater)
        
        im = ndimage.imread(row.fn)
        mask_array[slice_idx, ..., last_idx] = im > np.percentile(im, 70)
        
        
    print 'Creating HDF5 file'
    p = '/home/mkeuken1/data/post_mortem/new_data_format/%s/' % subject_id
    
    if not os.path.exists(p):
        os.makedirs(p)
    
    new_file = h5py.File(os.path.join(p, 'images.hdf5' % subject_id), )
    new_file.create_dataset('data', data=data_array)
    new_file.create_dataset('mask', data=mask_array.astype(bool))
    new_file.close()
    
    d.to_pickle(os.path.join(p, 'data.pandas'))
    masks.ix[subject_id].reset_index().to_pickle(os.path.join(p, 'masks.pandas'))


### 2) Plot the data:
#### There are two different types of plots that we are going for here. The first type is a plot that displays the intensity histogram of the stain which is combined with a tri-planner view of the STN. This is done per subject and stain. The second type of plot is used to check whether the MRI data aligns with the blockface images, whether the stains align with the blockface images, and finally whether the masks of the STN are located in a plausible location. 

#### It should be noted that we are not using the intensity per pixel but that we smooth the data a bit. Namely with a Gaussian smoothing kernel 0.3mm fwhm. For the original analysis we also used 0.15mm fwhm. 

In [None]:
############
# How does the data look like?
############
# To visualize the data we plot the stacked stains in a tri-planner view. This allows us to check whether there
#   are slices that are still completely misaligned. 
# We also create an intensity histogram to get an initial feeling for how the data distribution looks like.
#
# Given the high resolution of the data and that we are interested in the distribution thoughout the STN we decided 
#   to smooth the data a bit. Either with a [0.15, 0.3, 0.6, 1.2, 2.4] mm fwhm Gaussian kernel. 
#
############
# Start code
############
#
# Importing a number of different tools
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

from pystain import StainDataset
import os
import numpy as np

import seaborn as sns
sns.set_context('poster')
sns.set_style('whitegrid')

# Which tissue blocks are we going to visualize? 
subject_ids = [13095, 14037, 14051, 14069, 15033, 15035, 15055]

# Ensure that the color coding is normalized between the min and max per stain
def cmap_hist(data, bins=None, cmap=plt.cm.hot, vmin=None, vmax=None):
    n, bins, patches = plt.hist(data, bins=bins)
    bin_centers = 0.5 * (bins[:-1] + bins[1:])
    
    if vmin is None:
        vmin = data.min()
    if vmax is None:
        vmax = data.max()

    # scale values to interval [0,1]
    col = (bin_centers - vmin) / vmax

    for c, p in zip(col, patches):
        plt.setp(p, 'facecolor', cmap(c))

# Create the figures per stain, per tissue block, per smoothing kernel [0.15, 0.3, 0.6, 1.2, 2.4].
for subject_id in subject_ids[:]:
    for fwhm in [0.15, 0.3, 0.6, 1.2, 2.4]:
        dataset = StainDataset(subject_id, fwhm=fwhm)
        dataset.get_vminmax((0, 99))

        d = '/home/mkeuken1/data/post_mortem/visualize_stains_v2/%s/' % (subject_id)
        
        if not os.path.exists(d):
            os.makedirs(d) 

        fn = os.path.join(d, 'stains_%s.pdf' % fwhm)
        if os.path.isfile(fn):
            continue
        pdf = PdfPages(fn)
        
        for i, stain in enumerate(dataset.stains):
            print 'Plotting %s' % stain
            plt.figure()
            # thresholded mask area is where at least 3 masks overlay
            data = dataset.smoothed_data.value[dataset.thresholded_mask, i]
            data = data[~np.isnan(data)]
            bins = np.linspace(0, dataset.vmax[i], 100)
            cmap_hist(data, bins, plt.cm.hot, vmin=dataset.vmin[i], vmax=dataset.vmax[i])
            plt.title(stain)
            plt.savefig(pdf, format='pdf')

            plt.close(plt.gcf())

            plt.figure()

            if not os.path.exists(d):
                os.makedirs(d)

            for i, orientation in enumerate(['coronal', 'axial', 'sagittal']):
                for j, q in enumerate([.25, .5, .75]):
                    ax = plt.subplot(3, 3, i + j*3 + 1)
                    slice = dataset.get_proportional_slice(q, orientation)
                    dataset.plot_slice(slice=slice, stain=stain, orientation=orientation, cmap=plt.cm.hot)
                    ax.set_anchor('NW')

            plt.gcf().set_size_inches(20, 20)
            plt.suptitle(stain)
            plt.savefig(pdf, format='pdf')
            plt.close(plt.gcf())

        pdf.close()

/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/14037/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/14037/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/14037/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/14037/images.hdf5
calculating vmin
calculating vmax
/home/mkeuken1/data/post_mortem/new_data_format/14037/images.hdf5
calculating vmin
calculating vmax


### 3) Statistical analysis of the 27 PCA sectors
3a) For each subject the data is collected, masked so that we only have the data in the masks, a two component PCA is run of which the first component is along the dorsal axis, whereas the second component is via the lateral axis. Then in the Y direction, or anterior/posterior axis, the structure is devided into three parts. Afterwards, for the lateral and dorsal PCA components, the line is devided into 3 parts. This is done for each Y slices, resulting in 3x3x3: 27 sectors. 

The data of those 27 sectors are then combined across subjects per stain. 


In [1]:
############
# Is the data uniformly distributed over the STN?
############
#
# To test this question we divide the STN into 27 sectors based on a PCA analysis where we identify the three main 
#   axes which are then each divided into three parts. 
#
# The mean intensity per stain is subtracted of each elipsoid, so that if the data is uniformly distributed each
#   sector would be equal to zero. If there are sectors that have a signal lower than the overall mean these sectors
#   will have a negative value and vice versa for higher signals. 
# 

# Importing a number of different tools
from sklearn.decomposition import PCA
from matplotlib.backends.backend_pdf import PdfPages
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('poster')
sns.set_style('whitegrid')
import pandas
from pystain import StainDataset

# Start code
subject_id = 13095
ds = StainDataset(subject_id)

conversion_matrix = np.array([[0, 0, ds.xy_resolution],
                      [-ds.z_resolution, 0, 0],
                      [0, -ds.xy_resolution, 0]])
results = []

# What are the subject IDs?
subject_ids=[13095, 14037, 14051, 14069, 15033, 15035, 15055]

# What fwhm to use here?
fwhms = [0.15, 0.3, 0.6, 1.2, 2.4]
for fwhm in fwhms:
    for subject_id in subject_ids[:]:
        ds = StainDataset(subject_id, fwhm=fwhm)

    # Get coordinates of mask and bring them to mm
        x, y, z = np.where(ds.thresholded_mask)
        coords = np.column_stack((x, y, z))
        coords_mm = conversion_matrix.dot(coords.T).T
        coords_mm -= coords_mm.mean(0)

    # Fit two components and make sure first axis walks dorsal
    #   and second component lateral
        pca = PCA()
        pca.fit_transform((coords_mm - coords_mm.mean(0))[:, (0, 2)])

        components = pca.components_
        print components

        if components[0, 1] < 0:
            components[0] = -components[0]

        if components[1, 0] < 0:
            components[1] = -components[1]

        print components

        coords_dataframe = pandas.DataFrame(coords_mm, columns=['x_mm', 'y_mm', 'z_mm'])
        coords_dataframe['slice'] = x

        coords_dataframe['pc1'] = components.dot(coords_mm[:, (0, 2)].T)[0, :]
        coords_dataframe['pc2'] = components.dot(coords_mm[:, (0, 2)].T)[1, :]

        coords_dataframe[['pc1_slice_center', 'pc2_slice_center']] = coords_dataframe.groupby(['slice'])[['pc1', 'pc2']].apply(lambda x: x - x.mean())

        coords_dataframe['slice_3'] = pandas.qcut(coords_dataframe.y_mm, 3, labels=['posterior', 'middle', 'anterior'])    

        coords_dataframe['pc1_3'] = coords_dataframe.groupby('slice_3').pc1.apply(lambda d: pandas.qcut(d, 3, labels=['ventral', 'middle', 'dorsal']))
        coords_dataframe['pc2_3'] = coords_dataframe.groupby(['slice_3', 'pc1_3']).pc2.apply(lambda d: pandas.qcut(d, 3, labels=['medial', 'middle', 'lateral']))

        df= pandas.concat((ds.smoothed_dataframe, coords_dataframe), 1)
        tmp = df.pivot_table(index=['pc1_3', 'pc2_3', 'slice_3'], values=ds.stains, aggfunc='mean').copy()
        tmp['subject_id'] = subject_id
        tmp['fwhm'] = fwhm

        results.append(tmp.copy())

df = pandas.concat(results).reset_index().set_index(['subject_id', 'slice_3', 'pc1_3', 'pc2_3'])
df = pandas.melt(df.reset_index(), id_vars=['fwhm', 'subject_id', 'slice_3', 'pc1_3', 'pc2_3'], var_name='stain')
df['value'] = df.groupby(['fwhm', 'subject_id', 'stain']).transform(lambda x: (x - x.mean()) / x.std())




/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
[[-0.98094749 -0.19427308]
 [-0.19427308  0.98094749]]
[[ 0.98094749  0.19427308]
 [ 0.19427308 -0.98094749]]
/home/mkeuken1/data/post_mortem/new_data_format/14037/images.hdf5
[[-0.95611755 -0.29298334]
 [ 0.29298334 -0.95611755]]
[[ 0.95611755  0.29298334]
 [ 0.29298334 -0.95611755]]
/home/mkeuken1/data/post_mortem/new_data_format/14051/images.hdf5
[[-0.78933812 -0.61395874]
 [ 0.61395874 -0.78933812]]
[[ 0.78933812  0.61395874]
 [ 0.61395874 -0.78933812]]
/home/mkeuken1/data/post_mortem/new_data_format/14069/images.hdf5
[[-0.70764237 -0.70657079]
 [ 0.70657079 -0.70764237]]
[[ 0.70764237  0.70657079]
 [ 0.70657079 -0.70764237]]
/home/mkeuken1/data/post_mortem/new_data_format/15033/images.hdf5
[[-0.66358108 -0.74810437]
 [ 0.74810437 -0.66358108]]
[[ 0.66358108  0.74810437]
 [ 0.74810437 -0.66358108]]
/home/mkeuken1/data/post_mortem/new_data_format/15035/

### 3) Statistical analysis of the 27 PCA sectors
#### 3b) For each stain and sector we do a simple t-test to compare whether the intensity values are different from zero. This is corrected for multiple comparisons using a fdr correction, critical p-value of 0.05.

#### The sectors that survive the fdr correction are then plotted on the elipsoid, where red indicates above average intensity, blue indicates below average intensity. 
 

In [2]:
# this first half plots everything in a separate pdf-file for each stain

# # Importing a number of different tools
# from statsmodels.sandbox.stats import multicomp
# from matplotlib import patches
# import scipy as sp
# sns.set_style('white')
# df.stain.unique()

# def plot_ellipse_values(values, ellipse_pars=None, size=(1000, 1000), vmin=None, vmax=None, cmap=plt.cm.coolwarm, **kwargs):

#     ''' values is a n-by-m array'''

#     if ellipse_pars is None:
#         a = 350
#         b = 150
#         x = 500
#         y = 500

#         theta = 45. / 180 * np.pi

#     else:
#         a, b, x, y, theta = ellipse_pars

#     A = a**2 * (np.sin(theta))**2 + b**2 * (np.cos(theta))**2
#     B = 2 * (b**2 - a**2) * np.sin(theta) * np.cos(theta)
#     C = a**2 * np.cos(theta)**2 + b**2 * np.sin(theta)**2
#     D = -2 * A * x - B* y
#     E = -B * x - 2 * C * y
#     F = A* x**2 + B*x*y + C*y**2 - a**2*b**2

#     X,Y = np.meshgrid(np.arange(size[0]), np.arange(size[1]))

#     in_ellipse = A*X**2 + B*X*Y +C*Y**2 + D*X + E*Y +F < 0

#     pc1 = np.array([[np.cos(theta)], [np.sin(theta)]])
#     pc2 = np.array([[np.cos(theta - np.pi/2.)], [np.sin(theta - np.pi/2.)]])

#     pc1_distance = pc1.T.dot(np.array([(X - x).ravel(), (Y - y).ravel()])).reshape(X.shape)
#     pc2_distance = pc2.T.dot(np.array([(X - x).ravel(), (Y - y).ravel()])).reshape(X.shape)

#     pc1_quantile = np.floor((pc1_distance / a + 1 ) / 2. * values.shape[0])
#     pc2_quantile = np.floor((pc2_distance / b + 1 ) / 2. * values.shape[1])

#     im = np.zeros_like(X, dtype=float)

#     for pc1_q in np.arange(values.shape[0]):
#         for pc2_q in np.arange(values.shape[1]):
#             im[in_ellipse * (pc1_quantile == pc1_q) & (pc2_quantile == pc2_q)] = values[pc1_q, pc2_q]

#     im = np.ma.masked_array(im, ~in_ellipse)
#     cax = plt.imshow(im, origin='lower', cmap=cmap, vmin=vmin, vmax=vmax, **kwargs)
#     sns.despine()

#     return cax

# fwhms = [0.15, 0.3, 0.6, 1.2, 2.4]
# for fwhm in fwhms:
#     # What is the output folder for the PCA figures:
#     pca_folder = '/home/mkeuken1/data/post_mortem/visualize_stains_v2/PCA_sectors/fwhm_%s' %fwhm
#     if not os.path.exists(pca_folder):
#         os.makedirs(pca_folder) 

#     # For every stain and sector over the 7 subjects plot the data and test whether it differs from zero:
#     for stain, d in df.loc[df.fwhm==fwhm].groupby(['stain']):
#         fn = '/home/mkeuken1/data/post_mortem/visualize_stains_v2/PCA_sectors/fwhm_%s/%s_big_picture_coolwarm.pdf' %(fwhm, stain)
#         pdf = PdfPages(fn)

#         fig, axes = plt.subplots(nrows=1, ncols=3)

#         for i, (slice, d2) in enumerate(d.groupby('slice_3')):

#             ax = plt.subplot(1, 3, ['anterior', 'middle', 'posterior'].index(slice) + 1)

#             n = d2.groupby(['pc1_3', 'pc2_3']).value.apply(lambda v: len(v)).unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]
#             t = d2.groupby(['pc1_3', 'pc2_3']).value.apply(lambda v: sp.stats.ttest_1samp(v, 0,nan_policy='omit')[0]).unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]
#             p = d2.groupby(['pc1_3', 'pc2_3']).value.apply(lambda v: sp.stats.ttest_1samp(v, 0,nan_policy='omit')[1]).unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]
#             mean = d2.groupby(['pc1_3', 'pc2_3']).value.mean().unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]

#             # FDR: as we are doing 27 seperate t-tests we need to correct for multiple comparisons:
#             p.values[:] = multicomp.fdrcorrection0(p.values.ravel())[1].reshape(3, 3)

#             # Providing some parameters for plotting the figures
#             if i == 1:
#                 a, b, x, y, theta  = 350, 150, 300, 275, 45
#             else:
#                 a, b, x, y, theta  = 300, 125, 300, 275, 45.

#             plot_ellipse_values(t[p<0.05].values, size=(600, 550), ellipse_pars=(a, b, x, y,  theta / 180. * np.pi), vmin=-7, vmax=7, cmap=plt.cm.coolwarm)

#             e1 = patches.Ellipse((x, y), a*2, b*2,
#                              angle=theta, linewidth=2, fill=False, zorder=2)

#             ax.add_patch(e1)

#             plt.xticks([])
#             plt.yticks([])    

#             sns.despine(bottom=True, left=True)

#             print stain
#             print p.values  
#         plt.suptitle(stain, fontsize=24)
#         fig.set_size_inches(15., 4.)
#         pdf.savefig(fig, transparent=True)    
#         pdf.close()



# Plot all stains together in a single figure
# Importing a number of different tools
from statsmodels.sandbox.stats import multicomp
from matplotlib import patches
import scipy as sp
sns.set_style('white')
df.stain.unique()
%matplotlib inline

# gray 'background' of STN instead of white
cmap = matplotlib.colors.LinearSegmentedColormap.from_list('colormap', ['blue', 'lightgray', 'red'])
def plot_ellipse_values(values, ellipse_pars=None, size=(1000, 1000), vmin=None, vmax=None, cmap=plt.cm.coolwarm, ax=None, **kwargs):

    ''' values is a n-by-m array'''

    values[np.isnan(values)] = 0
    if ellipse_pars is None:
        a = 350
        b = 150
        x = 500
        y = 500

        theta = 45. / 180 * np.pi

    else:
        a, b, x, y, theta = ellipse_pars

    A = a**2 * (np.sin(theta))**2 + b**2 * (np.cos(theta))**2
    B = 2 * (b**2 - a**2) * np.sin(theta) * np.cos(theta)
    C = a**2 * np.cos(theta)**2 + b**2 * np.sin(theta)**2
    D = -2 * A * x - B* y
    E = -B * x - 2 * C * y
    F = A* x**2 + B*x*y + C*y**2 - a**2*b**2

    X,Y = np.meshgrid(np.arange(size[0]), np.arange(size[1]))

    in_ellipse = A*X**2 + B*X*Y +C*Y**2 + D*X + E*Y +F < 0

    pc1 = np.array([[np.cos(theta)], [np.sin(theta)]])
    pc2 = np.array([[np.cos(theta - np.pi/2.)], [np.sin(theta - np.pi/2.)]])

    pc1_distance = pc1.T.dot(np.array([(X - x).ravel(), (Y - y).ravel()])).reshape(X.shape)
    pc2_distance = pc2.T.dot(np.array([(X - x).ravel(), (Y - y).ravel()])).reshape(X.shape)

    pc1_quantile = np.floor((pc1_distance / a + 1 ) / 2. * values.shape[0])
    pc2_quantile = np.floor((pc2_distance / b + 1 ) / 2. * values.shape[1])

    im = np.zeros_like(X, dtype=float)

    for pc1_q in np.arange(values.shape[0]):
        for pc2_q in np.arange(values.shape[1]):
            im[in_ellipse * (pc1_quantile == pc1_q) & (pc2_quantile == pc2_q)] = values[pc1_q, pc2_q]

    im = np.ma.masked_array(im, ~in_ellipse)
#     cmap.set_bad('grey')
    if ax is None:
        cax = plt.imshow(im, origin='lower', cmap=cmap, vmin=vmin, vmax=vmax, **kwargs)
    else:
        ax.imshow(im, origin='lower', cmap=cmap, vmin=vmin, vmax=vmax, **kwargs)
        cax = ax
#    sns.despine()

    return cax

fwhms = [0.3, 0.6, 1.2, 2.4]
for fwhm in fwhms:
    # What is the output folder for the PCA figures:
    pca_folder = '/home/mkeuken1/data/post_mortem/visualize_stains_v2/PCA_sectors/fwhm_%s' %fwhm
    if not os.path.exists(pca_folder):
        os.makedirs(pca_folder) 

    fig, axes = plt.subplots(nrows=4, ncols=3*3)
    
    # For every stain and sector over the 7 subjects plot the data and test whether it differs from zero:
    for ii, (stain, d) in enumerate(df.loc[df.fwhm==fwhm].groupby(['stain'])):
        fn = '/home/mkeuken1/data/post_mortem/visualize_stains_v2/PCA_sectors/fwhm_%s/combined_big_picture_coolwarm.pdf' %(fwhm)
#        pdf = PdfPages(fn)
        column_set = int(np.floor(ii/4.))
        row_n = (ii)%4.

        for i, (slice, d2) in enumerate(d.groupby('slice_3')):
            print(row_n, ['anterior', 'middle', 'posterior'].index(slice) + 3*(column_set))
            ax = axes[row_n, ['anterior', 'middle', 'posterior'].index(slice) + 3*(column_set)]
#            ax = plt.subplot(1, 3, ['anterior', 'middle', 'posterior'].index(slice) + 1)

            n = d2.groupby(['pc1_3', 'pc2_3']).value.apply(lambda v: len(v)).unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]
            t = d2.groupby(['pc1_3', 'pc2_3']).value.apply(lambda v: sp.stats.ttest_1samp(v, 0,nan_policy='omit')[0]).unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]
            p = d2.groupby(['pc1_3', 'pc2_3']).value.apply(lambda v: sp.stats.ttest_1samp(v, 0,nan_policy='omit')[1]).unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]
            mean = d2.groupby(['pc1_3', 'pc2_3']).value.mean().unstack(1).ix[['ventral', 'middle', 'dorsal'], ['medial', 'middle', 'lateral']]

            # FDR: as we are doing 27 seperate t-tests we need to correct for multiple comparisons:
            p.values[:] = multicomp.fdrcorrection0(p.values.ravel())[1].reshape(3, 3)

            # Providing some parameters for plotting the figures
            if i == 1:
                a, b, x, y, theta  = 350, 150, 300, 275, 45
            else:
                a, b, x, y, theta  = 300, 125, 300, 275, 45.

            plot_ellipse_values(t[p<0.05].values, size=(600, 550), ellipse_pars=(a, b, x, y,  theta / 180. * np.pi), vmin=-7, vmax=7, cmap=cmap, ax=ax)

            e1 = patches.Ellipse((x, y), a*2, b*2,
                                 angle=theta, linewidth=2, fill=False, zorder=2)

            ax.add_patch(e1)
            ax.set_xticks([])
            ax.set_yticks([])
            
            sns.despine(bottom=True, left=True, right=True)
            
            if slice == 'middle':
                ax.set_title(stain, fontsize=24)
            
            print stain
            print p.values

    fig.set_size_inches(10.*2, 4.*2)
    fig.subplots_adjust(hspace=.275, wspace=0.00, bottom=0.01, left=0.0, top=.95, right=1)
    
    plt.plot([0, 0], [0, 1], color='black', lw=1, transform=plt.gcf().transFigure, clip_on=False)
    plt.plot([0, 1], [1, 1], color='black', lw=1, transform=plt.gcf().transFigure, clip_on=False)
    plt.plot([1, 1], [0, 1], color='black', lw=1, transform=plt.gcf().transFigure, clip_on=False)
    plt.plot([0, 1], [0, 0], color='black', lw=1, transform=plt.gcf().transFigure, clip_on=False)

    plt.plot([1/3., 1/3.], [0, 1], color='black', lw=1, transform=plt.gcf().transFigure, clip_on=False)
    plt.plot([2/3., 2/3.], [0, 1], color='black', lw=1, transform=plt.gcf().transFigure, clip_on=False)
#    fig.savefig(pdf,  format='pdf')#, bbox_inches='tight')
#    pdf.close()

NameError: name 'matplotlib' is not defined

### 4. Mixture analysis
   The plan is to fit a mixture of exGausian distributions to either the intensity or gradient vector data per stain. 
    The idea here is that through model comparison we can then figure out whether the data is better explained by a 
        single exGauss or by a mixture of maximum 6 exGuassians. 

#### 4a1) First we need to create the feature vectors which we are going to use for the model fitting.

In [None]:
# Create feature vectors
# For both the intensity and the gradient vectors
# These feature vectors are used to fit the exGaussian mixture models to. 
# This is done on three different test and training sets which are created in the cluster.py script
#
# Import several functions
import glob
import pystain
from pystain import StainDataset, StainCluster

# Get the data
fns = glob.glob('/home/mkeuken1/data/post_mortem/new_data_format/*')

# Get the subject id
subject_ids = [int(fn.split('/')[-1]) for fn in fns]

# For the subjects and stains create .pkl files (see cluster.py script)
for subject_id in subject_ids:
    for fwhm in [0.15, 0.3, 0.6, 1.2, 2.4]:
        dataset = StainDataset(subject_id, fwhm=fwhm)
        # If the data already exist, just grab it instead of recreating it again and then giving an error
        #  that it already exists:
        if 'data_smoothed_%s_thr_3' % str(fwhm) not in dataset.h5file.keys():
            _ = StainDataset(subject_id, thr=3, fwhm=fwhm)
        if 'data_smoothed_%s_thr_3_gradient_image' % str(fwhm) not in dataset.h5file.keys():
            dataset.get_gradient_images() 
        if 'data_smoothed_%s_thr_3_gradient_image_2D' % str(fwhm) not in dataset.h5file.keys():
            dataset.get_gradient_images_2D()

        cluster = StainCluster(dataset, fwhm=fwhm)
        cluster.make_feature_vector(save_directory_intensity='/home/mkeuken1/data/post_mortem/crossval_intensity_feature_vectors/', 
                                    save_directory_gradient='/home/mkeuken1/data/post_mortem/crossval_gradient_feature_vectors/',
                                    save_directory_gradient_2D='/home/mkeuken1/data/post_mortem/crossval_gradient_2D_feature_vectors/')


/home/mkeuken1/data/post_mortem/new_data_format/15033/images.hdf5
Calculcating gradient of CALR
Calculcating gradient of FER
Calculcating gradient of GABRA3
Calculcating gradient of GAD6567
Calculcating gradient of MBP
Calculcating gradient of PARV
Calculcating gradient of SERT
Calculcating gradient of SMI32
Calculcating gradient of SYN
Calculcating gradient of TH
Calculcating gradient of TRANSF
Calculcating gradient of VGLUT1
Calculcating gradient of CALR
Calculcating gradient of FER
Calculcating gradient of GABRA3
Calculcating gradient of GAD6567
Calculcating gradient of MBP
Calculcating gradient of PARV
Calculcating gradient of SERT
Calculcating gradient of SMI32
Calculcating gradient of SYN
Calculcating gradient of TH
Calculcating gradient of TRANSF
Calculcating gradient of VGLUT1
 *** CALR ***
All slices available for stain CALR!
 *** FER ***
All slices available for stain FER!
 *** GABRA3 ***
All slices available for stain GABRA3!
 *** GAD6567 ***
All slices available for stain G

#### 4a2) Once we have the feature vectors we now want to fit the actual model onto the data and cross validate.
    AIC and BIC will be done on the entire STN mask
    
    Cross validation will be done in two general ways:
    1. within a subject, within a stain
    2. betweeen a subject, within a stain
    
    For each given stain we have three different train and test sets:
    
    CV dataset 1: train on every odd slice, test on every even slice.
    CV dataset 2: train on every even slice and test on every even slice
                  Skip every second slice to decorrelate train and test set
    CV dataset 3: train on every odd slice and test on every odd slice.
                  Skip every second slice to decorrelate train and test set
    For every given train and test set pair we also swap them around. So in total we have 6 different model cross 
                    validations. 
                    
    Other things to note
    - Using the entire dataset was computationally too heavy, therefore we sample 15% of the datapoints evenly spaced
    

In [None]:
# The following script is used to fit the mixture models to the data.
# Note that this script has been ported to a .py file 
# The .py file was saved as fit_ml_cv_all_gradient_intensity_combined.py and run on lisa
#    This means that there are a few pieces of code that are lisa (surfsara.nl) specific.

# Import several functions and set a few style features:
import os
import pandas
import numpy as np
import itertools
from multiprocessing import Pool

import scipy as sp
from scipy import optimize
import glob
import pickle as pkl
import seaborn as sns
import matplotlib.pyplot as plt

# How many processes do you want to run simulataneously on Lisa? The node has 16 cores, it is advised due
#  to mememory issues you dont use all of them at the same time.
n_proc = 14

# Defining a number of functions:    
def exgauss_pdf(x, mu, sigma, nu):

    nu = 1./nu
    p1 = nu / 2. * np.exp((nu/2.)  * (2 * mu + nu * sigma**2. - 2. * x))
    p2 = sp.special.erfc((mu + nu * sigma**2 - x)/ (np.sqrt(2.) * sigma))

    return p1 * p2

def mixed_exgauss_likelihood(x, w, mu, sigma, nu):

    # Create indiviudal
    pdfs = w * exgauss_pdf(x[:, np.newaxis], mu, nu, sigma)
    ll = np.sum(np.log(np.sum(pdfs, 1)))

    if ((np.isnan(ll)) | (ll == np.inf)):
        return -np.inf

    return ll

def input_optimizer(pars, x, n_clusters):

    pars = np.array(pars)

    if np.sum(pars[:n_clusters-1]) > 1:
        return np.inf

    pars = np.insert(pars, n_clusters-1, 1 - np.sum(pars[:n_clusters-1]))

    if np.any(pars[:n_clusters] < 0.05):
        return np.inf

    w = pars[:n_clusters][np.newaxis, :]
    mu = pars[n_clusters:n_clusters*2][np.newaxis, :]
    nu = pars[n_clusters*2:n_clusters*3][np.newaxis, :]
    sigma = pars[n_clusters*3:n_clusters*4][np.newaxis, :]

    return -mixed_exgauss_likelihood(x, w, mu, sigma, nu)

def _fit(input_args, disp=False, popsize=100, **kwargs):

    sp.random.seed()

    x, n_clusters = input_args

    weight_bounds = [(1e-3, 1)] * (n_clusters - 1)
    mu_bounds = [(-1., 2.5)] * n_clusters
    nu_bounds = [(1e-3, 2.5)] * n_clusters
    sigma_bounds = [(1e-3, 2.5)] * n_clusters

    bounds = weight_bounds + mu_bounds + nu_bounds + sigma_bounds

    result = sp.optimize.differential_evolution(input_optimizer, bounds, (x, n_clusters), polish=True, disp=disp, maxiter=500, popsize=popsize, **kwargs)
    result = sp.optimize.minimize(input_optimizer, result.x, (x, n_clusters), method='SLSQP', bounds=bounds, **kwargs)

    return result

class SimpleExgaussMixture(object):

    def __init__(self, data, n_clusters):

        self.data = data
        self.n_clusters = n_clusters
        self.n_parameters = n_clusters * 4 - 1
        self.likelihood = -np.inf

        self.previous_likelihoods = []
        self.previous_pars = []

    def get_likelihood_data(self, data):
        
        return mixed_exgauss_likelihood(data, self.w, self.mu, self.sigma, self.nu)
    
    def get_bic_data(self, data):
        likelihood = self.get_likelihood_data(data)
        return - 2 * likelihood + self.n_parameters * np.log(data.shape[0])
       
    def get_aic_data(self, data):
        likelihood = self.get_likelihood_data(data)
        return 2 * self.n_parameters - 2  * likelihood
    
    def _fit(self, **kwargs):
        return _fit((self.data, self.n_clusters), **kwargs)

    def fit(self, n_tries=1, **kwargs):
        for run in np.arange(n_tries):

            result = self._fit(**kwargs)
            self.previous_likelihoods.append(-result.fun)

            if -result.fun > self.likelihood:

                pars = result.x
                pars = np.insert(pars, self.n_clusters-1, 1 - np.sum(pars[:self.n_clusters-1]))

                self.w = pars[:self.n_clusters][np.newaxis, :]
                self.mu = pars[self.n_clusters:self.n_clusters*2][np.newaxis, :]
                self.nu = pars[self.n_clusters*2:self.n_clusters*3][np.newaxis, :]
                self.sigma = pars[self.n_clusters*3:self.n_clusters*4][np.newaxis, :]

                self.likelihood = -result.fun

        self.aic = 2 * self.n_parameters - 2 * self.likelihood
        self.bic = - 2 * self.likelihood + self.n_parameters * np.log(self.data.shape[0])

    def fit_multiproc(self, n_tries=4, n_proc=4, disp=False):

        pool = Pool(n_proc)

        print 'starting pool'
        results = pool.map(_fit, [(self.data, self.n_clusters)] * n_tries)
        print 'ready'

        print results

        pool.close()

        for result in results:
            self.previous_likelihoods.append(-result.fun)
            self.previous_pars.append(result.x)

            if -result.fun > self.likelihood:

                pars = result.x
                pars = np.insert(pars, self.n_clusters-1, 1 - np.sum(pars[:self.n_clusters-1]))

                self.w = pars[:self.n_clusters][np.newaxis, :]
                self.mu = pars[self.n_clusters:self.n_clusters*2][np.newaxis, :]
                self.nu = pars[self.n_clusters*2:self.n_clusters*3][np.newaxis, :]
                self.sigma = pars[self.n_clusters*3:self.n_clusters*4][np.newaxis, :]

                self.likelihood = -result.fun

        self.aic = 2 * self.n_parameters - 2 * self.likelihood
        self.bic = - 2 * self.likelihood + self.n_parameters * np.log(self.data.shape[0])

    def plot_fit(self):
        # Create indiviudal pds

        t = np.linspace(0, self.data.max(), 100)
        pdfs = self.w * exgauss_pdf(t[:, np.newaxis], self.mu, self.nu, self.sigma)

        sns.distplot(self.data)
        plt.plot(t, pdfs, c='k', alpha=0.5)

        plt.plot(t, np.sum(pdfs, 1), c='k', lw=2)
        
class Scaler(object):
    
    def __init__(self):
        self.min = None
        self.max = None
    
    def fit(self, X):
        self.min = X.min()
        self.max = X.max()
    
    def transform(self, X):
        X -= self.min
        X /= self.max
        
        return X

# Starting the actual fitting:
 
# Select the smoothing kernel
fwhms = [0.15, 0.3, 0.6, 1.2, 2.4]
# Set the number of mixtures
ns = [1,2,3,4,5,6,7,8,9]
# Subject ids
subject_ids = [13095, 14037, 14051, 14069, 15033, 15035, 15055]
# Stains
stains = ['CALR', 'FER', 'GABRA3', 'GAD6567', 'MBP', 'PARV', 'SERT', 'SMI32', 'SYN', 'TH', 'TRANSF', 'VGLUT1']
# data type
data_types = ['intensity', 'gradient_2D']

# PBS array is used on lisa so that you can submit alot of similiar jobs to the cue. We have 2016 jobs to submit 
#    (12 stains * 9 number of clusters * 7 tissue blocks * 5 fwhms * 2 data types)
if 'PBS_ARRAYID' in os.environ.keys():
    PBS_ARRAYID = int(os.environ['PBS_ARRAYID'])
else:
    PBS_ARRAYID = 0

fwhm, n_clusters, stain, subject_id, data_type = list(itertools.product(fwhms, ns, stains, subject_ids, data_types))[PBS_ARRAYID]
print n_clusters, fwhm, stain, subject_id, data_type

results = []
# The different training and test set partitions
partitions = [{'train_name': 'All-Data-In-Mask'},
               {'train_name': 'CV_set1_1', 'test_name': 'CV_set1_2'},
               {'train_name': 'CV_set1_2', 'test_name': 'CV_set1_1'},
               {'train_name': 'CV_set2_1', 'test_name': 'CV_set2_2'},
               {'train_name': 'CV_set2_2', 'test_name': 'CV_set2_1'},
               {'train_name': 'CV_set3_1', 'test_name': 'CV_set3_2'},
               {'train_name': 'CV_set3_2', 'test_name': 'CV_set3_1'}]

for partition in partitions:
    print('Current train set: %s' % partition['train_name'])
   
    # Always fit model to train data
    # Load data
    train_set = pandas.read_pickle(os.path.join(os.environ['HOME'], 
                        'data/post_mortem/crossval_%s_feature_vectors/%s_%s_%s.pkl' %(data_type, subject_id, fwhm, partition['train_name'])))

    # Select stain
    train_set = train_set[stain]

    # Remove nan-values
    train_set = train_set[~pandas.isnull(train_set)]

    # Select reasonable subsample to decrease computational burden.
    # Get random evenly spaced (should be evenly spaced to make sure a representative sub sample
    # of the spatial organisation of the STN is selected) sub sample of 15% to reduce size
    # Getting exactly 15% is tricky because of varying original shapes (sample sizes), but this
    # approach gets the value closest to and minimum of 15%
    step_size = int(train_set.shape[0] / (train_set.shape[0]*.15))
    train_set = train_set[::step_size]

    # The train and test partitions are normalized per partition seperately. In a previous attempt we normalised 
    #   the test partition by the train partition. This however in the end led tot inf values for the the loglikelihood.
    #   Since it could happen that the test partition had values that were outside the min/max of the train partition and 
    #   would therefore get a zero as likelihood -> log(0) == inf.
    scaler = Scaler()
    scaler.fit(train_set)
    train_set = scaler.transform(train_set)

    # Check if model is already saved to disk. If so, load model
    path_name = os.path.join(os.environ['HOME'], 'data', 'post_mortem', 'ml_clusters_cross_validated_%s' %data_type)
    if not os.path.exists(path_name):
        os.makedirs(path_name)
    pickle_fn = os.path.join(os.environ['HOME'], 'data', 'post_mortem', 'ml_clusters_cross_validated_%s' %data_type, '%s_%s_%s_%s_%s.pkl' %(subject_id, fwhm, stain, n_clusters, partition['train_name']))

    if os.path.isfile(pickle_fn):
        # Model was already trained & saved, so load
        with open(pickle_fn, 'r') as f:
            model = pkl.load(f)
    else:
        # Model does not yet exist, so train now
        # Create & fit model on train set
        model = SimpleExgaussMixture(train_set, n_clusters)
        model.fit_multiproc(n_tries=n_proc, n_proc=n_proc)
        pkl.dump(model, open(pickle_fn, 'w'))

    # Append to results
    results.append({'train': partition['train_name'], 'test': partition['train_name'],
                   'll': model.get_likelihood_data(train_set),
                   'aic': model.get_aic_data(train_set),
                   'bic': model.get_bic_data(train_set)})


    # If a test-set is provided, check cross-validation model fit
    if 'test_name' in partition.keys():
        # Load test data
        test_set = pandas.read_pickle(os.path.join(os.environ['HOME'], 'data/post_mortem/crossval_%s_feature_vectors/%s_%s_%s.pkl' %(data_type, subject_id, fwhm, partition['test_name'])))
        
        # select stain
        test_set = test_set[stain]    

        # Remove nan/Nones
        test_set = test_set[~pandas.isnull(test_set)]

        # Subsample
        step_size = int(test_set.shape[0] / (test_set.shape[0]*.15))
        test_set = test_set[::step_size]

        # Scale
        scaler = Scaler()
        scaler.fit(test_set)
        test_set = scaler.transform(test_set)

        # Append to results
        results.append({'train': partition['train_name'], 'test': partition['test_name'],
                       'll': model.get_likelihood_data(test_set),
                       'aic': model.get_aic_data(test_set),
                       'bic': model.get_bic_data(test_set)})

results = pandas.DataFrame(results)
results['subject_id'], results['stain'], results['fwhm'], results['n_clusters'] = subject_id, stain, fwhm, n_clusters
# Saving the data to pandas
results_fn = os.path.join(os.environ['HOME'], 'data', 'post_mortem', 'ml_clusters_cross_validated_%s' %(data_type), '%s_%s_%s_%s_all.pandas' %(subject_id, fwhm, stain, n_clusters))
results.to_pickle(results_fn)

#### 4) Mixture analysis 
#### 4b1) After training and fitting the intensity and gradient vectors we want to plot the gradient vectors
    As a sanity check to see whether the gradients are:
        - not driven by borders
        - large changes in gradients match what we see in the intensity figures
        - the histograms look some what plausible
     we will plot the data in a similar manner as what we did for the intensity pdfs as in section 2.

In [1]:
# Plotting the gradient vector data 
# Plotting the gradients in a similair manner as the intensity pdf

# Import several functions and set a few style features:
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from pystain import StainDataset
import os
import numpy as np
import seaborn as sns
sns.set_context('poster')
sns.set_style('whitegrid')

# Ensure that the color coding is normalized between the min and max per stain
#   We are using the same color range as what we used for the intensity plots in section 2):
def cmap_hist(data, bins=None, cmap=plt.cm.hot, vmin=None, vmax=None):
    n, bins, patches = plt.hist(data, bins=bins)
    bin_centers = 0.5 * (bins[:-1] + bins[1:])
    
    if vmin is None:
        vmin = data.min()
    if vmax is None:
        vmax = data.max()

    # scale values to interval [0,1]
    col = (bin_centers - vmin) / vmax

    for c, p in zip(col, patches):
        plt.setp(p, 'facecolor', cmap(c))

# The code to visualize the data:

# Which tissue blocks are we going to visualize? 
subject_ids = [13095, 14037, 14051, 14069, 15033, 15035, 15055]

# Create the figures per stain, per tissue block, for the all three FWHM kernel:
for gradient_type in ['gradient_2D']: # could also plot 'gradient' here (ie 3D gradient) but we didnt fit these
    for subject_id in subject_ids:
        for fwhm in [0.15, 0.3, 0.6, 1.2, 2.4]:
            dataset = StainDataset(subject_id, fwhm=fwhm)

            d = '/home/mkeuken1/data/post_mortem/visualize_stains_v2/%s/' % (subject_id)

            if not os.path.exists(d):
                os.makedirs(d) 

            fn = os.path.join(d, 'intensity_%s_%s.pdf' % (gradient_type, fwhm))
                
            if os.path.isfile(fn):
                print('%s already plotted' %fn)
                continue

            pdf = PdfPages(fn)

            for i, stain in enumerate(dataset.stains):
                
                print 'Plotting %s' % stain
                plt.figure()
                
                # Get appropriate gradient type from h5File
                if gradient_type == 'gradient':
                    key = 'data_smoothed_%s_thr_%s_gradient_image' % (fwhm, 3)
                elif gradient_type == 'gradient_2D':
                    key = 'data_smoothed_%s_thr_%s_gradient_image_2D' % (fwhm, 3)
                
                # thresholded mask area is where at least 3 masks overlay
                data = dataset.h5file[key].value[dataset.thresholded_mask, i]
                data = data[~np.isnan(data)]
                vmax = np.percentile(data, 99)
                vmin = np.min(data)
                bins = np.linspace(0, vmax, 100)
                cmap_hist(data, bins, plt.cm.hot, vmin=vmin, vmax=vmax)
                plt.title(stain)
                plt.savefig(pdf, format='pdf')

                plt.close(plt.gcf())

                plt.figure()

                if not os.path.exists(d):
                    os.makedirs(d)

                for i, orientation in enumerate(['coronal', 'axial', 'sagittal']):
                    for j, q in enumerate([.25, .5, .75]):
                        ax = plt.subplot(3, 3, i + j*3 + 1)
                        slice = dataset.get_proportional_slice(q, orientation)
                        dataset.plot_slice(slice=slice, stain=stain, gradient=True, orientation=orientation, cmap=plt.cm.hot)
                        ax.set_anchor('NW')

                plt.gcf().set_size_inches(20, 20)
                plt.suptitle(stain)
                plt.savefig(pdf, format='pdf')
                plt.close(plt.gcf())

            pdf.close()
        

/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
/home/mkeuken1/data/post_mortem/visualize_stains_v2/13095/intensity_gradient_2D_0.15.pdf already plotted
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
/home/mkeuken1/data/post_mortem/visualize_stains_v2/13095/intensity_gradient_2D_0.3.pdf already plotted
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
/home/mkeuken1/data/post_mortem/visualize_stains_v2/13095/intensity_gradient_2D_0.6.pdf already plotted
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
/home/mkeuken1/data/post_mortem/visualize_stains_v2/13095/intensity_gradient_2D_1.2.pdf already plotted
/home/mkeuken1/data/post_mortem/new_data_format/13095/images.hdf5
/home/mkeuken1/data/post_mortem/visualize_stains_v2/13095/intensity_gradient_2D_2.4.pdf already plotted
/home/mkeuken1/data/post_mortem/new_data_format/14037/images.hdf5
/home/mkeuken1/data/post_mortem/visualize_stains_v2/14037/intensity_gradient_2D_0.1

#### 4) Mixture analysis 
#### 4b1) Do the exGaussian mixtures actually fit the data?


In [75]:
# Import several functions and set a few style features:
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
import itertools 
import os
import numpy as np
import pickle as pkl
import scipy as sp
sns.set_context('poster')
sns.set_style('ticks')


# It is a bit ugly but the plot_fit function used to generate the figures is not part of the pystain package. So the function 
#  needs to be defined here:
def exgauss_pdf(x, mu, sigma, nu):

    nu = 1./nu
    p1 = nu / 2. * np.exp((nu/2.)  * (2 * mu + nu * sigma**2. - 2. * x))
    p2 = sp.special.erfc((mu + nu * sigma**2 - x)/ (np.sqrt(2.) * sigma))

    return p1 * p2

class SimpleExgaussMixture(object):

    def __init__(self, data, n_clusters):

        self.data = data
        self.n_clusters = n_clusters
        self.n_parameters = n_clusters * 4 - 1
        self.likelihood = -np.inf

        self.previous_likelihoods = []
        self.previous_pars = []

    def get_likelihood_data(self, data):
        
        return mixed_exgauss_likelihood(data, self.w, self.mu, self.sigma, self.nu)
    
    def get_bic_data(self, data):
        likelihood = self.get_likelihood_data(data)
        return - 2 * likelihood + self.n_parameters * np.log(data.shape[0])
       
    def get_aic_data(self, data):
        likelihood = self.get_likelihood_data(data)
        return 2 * self.n_parameters - 2  * likelihood
    
    def _fit(self, **kwargs):
        return _fit((self.data, self.n_clusters), **kwargs)

    def fit(self, n_tries=1, **kwargs):
        for run in np.arange(n_tries):

            result = self._fit(**kwargs)
            self.previous_likelihoods.append(-result.fun)

            if -result.fun > self.likelihood:

                pars = result.x
                pars = np.insert(pars, self.n_clusters-1, 1 - np.sum(pars[:self.n_clusters-1]))

                self.w = pars[:self.n_clusters][np.newaxis, :]
                self.mu = pars[self.n_clusters:self.n_clusters*2][np.newaxis, :]
                self.nu = pars[self.n_clusters*2:self.n_clusters*3][np.newaxis, :]
                self.sigma = pars[self.n_clusters*3:self.n_clusters*4][np.newaxis, :]

                self.likelihood = -result.fun

        self.aic = 2 * self.n_parameters - 2 * self.likelihood
        self.bic = - 2 * self.likelihood + self.n_parameters * np.log(self.data.shape[0])

    def fit_multiproc(self, n_tries=4, n_proc=4, disp=False):

        pool = Pool(n_proc)

        print 'starting pool'
        results = pool.map(_fit, [(self.data, self.n_clusters)] * n_tries)
        print 'ready'

        print results

        pool.close()

        for result in results:
            self.previous_likelihoods.append(-result.fun)
            self.previous_pars.append(result.x)

            if -result.fun > self.likelihood:

                pars = result.x
                pars = np.insert(pars, self.n_clusters-1, 1 - np.sum(pars[:self.n_clusters-1]))

                self.w = pars[:self.n_clusters][np.newaxis, :]
                self.mu = pars[self.n_clusters:self.n_clusters*2][np.newaxis, :]
                self.nu = pars[self.n_clusters*2:self.n_clusters*3][np.newaxis, :]
                self.sigma = pars[self.n_clusters*3:self.n_clusters*4][np.newaxis, :]

                self.likelihood = -result.fun

        self.aic = 2 * self.n_parameters - 2 * self.likelihood
        self.bic = - 2 * self.likelihood + self.n_parameters * np.log(self.data.shape[0])
        
    def plot_fit(self):
        # Create indiviudal pds

        xlim_max = np.percentile(self.data, q=97.5)

        data_plot = self.data[self.data<=xlim_max]
        
        t = np.linspace(0, data_plot.max(), 1000)
        pdfs = self.w * exgauss_pdf(t[:, np.newaxis], self.mu, self.nu, self.sigma)

        sns.distplot(data_plot, kde=False, norm_hist=True)
        plt.plot(t, pdfs, c='k', alpha=0.5, lw=3, )
        plt.plot(t, np.sum(pdfs, 1), c='k', lw=3, linestyle=':')
        plt.xlabel('')#Stain intensity (a.u.)')
        plt.ylabel('')#Density')
        sns.despine()
        plt.gca().set_xlim([0, xlim_max])

# The actual plotting of the data:
# Select the smoothing kernel
fwhms = [0.3, 0.6, 1.2, 2.4]
# Set the number of mixtures
ns = [1,2,3,4,5,6]#,7,8,9]
# Subject ids
subject_ids = [13095, 14037, 14051, 14069, 15033, 15035, 15055]

# Stains
stains = ['CALR', 'FER', 'GABRA3', 'GAD6567', 'MBP', 'PARV', 'SERT', 'SMI32', 'SYN', 'TH', 'TRANSF', 'VGLUT1']

# data type
data_types = ['intensity','gradient_2D']

print len(list(itertools.product(fwhms, ns, stains, subject_ids, data_types)))
all_missing = []

for fwhm in fwhms:
    for subject_id in subject_ids:
        for stain in stains:
            for data_type in data_types:
                
                # load data of this stain,sub,n_clusters,data_type
                fns = ['/home/mkeuken1/data/post_mortem/ml_clusters_cross_validated_%s/%s_%s_%s_' %(data_type, subject_id, fwhm, stain) + str(x) + '_All-Data-In-Mask.pkl' for x in ns]
                
                fns_not_exist = [fn for fn in fns if not os.path.exists(fn)]
                all_missing.append(fns_not_exist)
                
                models = [pkl.load(open(fn)) if os.path.exists(fn) else 0 for fn in fns]
                
                f, ax = plt.subplots(2, 3, sharex=True, sharey=True)
                
                for model_n in range(1, np.max(ns)+1):
                    plt.subplot(2, 3, model_n)
                    
                    if not models[model_n-1] == 0:
                        models[model_n-1].plot_fit()
                        if model_n > 1:
                            plt.title('%d clusters' %model_n)
                        else:
                            plt.title('%d cluster' %model_n)
                            
#                plt.xlabel('staining %s' %stain)
                subj_id_formatted = str(subject_id)[:2] + '-' + str(subject_id)[2:]
                if data_type == 'gradient_2D':
                    dtype_formatted = '%s ' %stain
                    xlab = 'Gradient magnitude (a.u.)'
                else:
                    dtype_formatted = '%s' %stain
                    xlab = 'Immunoreactivity (a.u.)'
                    
                plt.suptitle('%s #%s' %(dtype_formatted, subj_id_formatted))
                plt.gcf().set_size_inches((20*2/3., 10*2/3.))

                plt.tight_layout(rect=[0, 0.03, 1, 0.95])
                plt.gcf().text(0.5, 0.00, xlab, ha='center')
                plt.gcf().text(0.00, 0.5, 'Density', va='center', rotation='vertical')
                
                
                f = plt.gcf()
                pdf = PdfPages('/home/mkeuken1/data/post_mortem/visualize_stains_v2/fits/%s_%s_%s_%s' %(subject_id, fwhm, stain, data_type))
                f.savefig(pdf,  format='pdf', bbox_inches='tight')
                pdf.close()
                plt.close()
#                break
#            break
        break
    break


4032


#### 4) Mixture analysis 
#### 4c) How many clusters fit the data best?

Now that we plotted the gradient vectors and the histograms seem plausible we want to see what the winning model is in terms of number of clusters per stain.
    We will do this in four different ways:
        - bic
        - aic
        - cross validation loglikelihood within tissue block
        
#### 4c1) Getting the winning models based on the AIC and BIC values
   What we want to do is to determine what the number of mixtures fits the data best per stain.
   We have 6 different train and test partitions so it is a bit of a shame if we wouldnt do anything with them.
       What we do in the following part seperately for AIC and BIC values is to
           - calculate mean rank per number of clusters (mean over training datasets, note that there is only a single train dataset)
           - plot them (ensure that the plot actually shows all the datapoints!)

In [None]:
# Getting the mean rank number for the different cluster nubmers based on the AIC and BIC 
# Once we know for a given subject what the prefered number of mixtures is per stain we can plot it. 

# Import several functions and set a few style features:
import pandas
import glob
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
# Select the smoothing kernel
fwhms = [1.2]

for fwhm in fwhms:
    for data_type in ['intensity', 'gradient_2D']:
        # Read all the panda data that have 1-7 clusters
        fns = glob.glob('/home/mkeuken1/data/post_mortem/ml_clusters_cross_validated_%s/pandas_%s/*_all.pandas' %(data_type, str(fwhm)))
        fns = [x for x in fns if int(x.split('_')[3]) < 8]
        df = pandas.concat([pandas.read_pickle(fn) for fn in fns])

        # Select all rows (idx) to use for BIC/AIC model comparison
        idx_bicaic = df['train']=='All-Data-In-Mask'
        dfBICAIC = df[idx_bicaic].copy()

        # Calculate normalized aic and bic scores (for ranking)
        dfBICAIC['aic_rank'] = np.nan 
        dfBICAIC['bic_rank'] = np.nan 

        # Loop over all combinations of subject, stain, training set, plus metric (aic / bic) to calculate ranks
        tmp = dfBICAIC.drop('fwhm', axis=1).pivot_table(index=['subject_id','stain', 'train'], columns=['n_clusters'])
        for metric in ['aic', 'bic']:
            for subj in tmp[metric].reset_index()['subject_id'].unique():
                for stain in tmp[metric].reset_index()['stain'].unique():
                    for train in tmp[metric].reset_index()['train'].unique():
                        tmp[metric + '_rank'].loc[subj].loc[stain].loc[train] = tmp[metric].loc[subj].loc[stain].loc[train].rank()


        # Calculate mean rank per number of clusters based on AIC
        # Reset index and groupby subj id and stain to calculate mean rank per number of clusters (mean over training datasets, note that for 
        #   AIC and BIC we only have a single training set: 15% of the total data in STN mask.)
        tmp2 = tmp.reset_index().copy()
        tmp2 = tmp2.groupby(['subject_id', 'stain']).mean()['aic_rank']
        tmp2 = pandas.melt(tmp2.reset_index(), id_vars=['subject_id', 'stain'], value_vars=[1,2,3,4,5,6], value_name='mean_rank')

        best_models_aic = tmp2.groupby(['subject_id', 'stain'])['mean_rank'].apply(lambda x: np.nanargmin(x)+1)
        best_models_aic = best_models_aic.reset_index(name='winning_model')

        # Calculate mean rank per number of clusters based on BIC
        # Reset index and groupby subj id and stain to calculate mean rank per number of clusters (mean over training datasets, note that for 
        #   AIC and BIC we only have a single training set: 15% of the total data in STN mask.
        tmp3 = tmp.reset_index().copy()
        tmp3 = tmp3.groupby(['subject_id', 'stain']).mean()['bic_rank']
        tmp3 = pandas.melt(tmp3.reset_index(), id_vars=['subject_id', 'stain'], value_vars=[1,2,3,4,5,6], value_name='mean_rank')

        best_models_bic = tmp3.groupby(['subject_id', 'stain'])['mean_rank'].apply(lambda x: np.nanargmin(x)+1)
        best_models_bic = best_models_bic.reset_index(name='winning_model')

        # Creating extra dummy columns and concatenating both AIC and BIC in one df
        best_models_aic['metric'] = 'aic'
        best_models_bic['metric'] = 'bic'
        best_models = pandas.concat([best_models_bic, best_models_aic])

        # Lets now plot the winning cluster per stain per subject for both AIC and BIC seperately. 
        loopdict = [{'name':'aic', 'df': best_models_aic}, {'name': 'bic', 'df': best_models_bic}]
        for metric in loopdict:
            fac = sns.swarmplot('stain', 'winning_model', 'subject_id', metric['df'], split=True, size=10, alpha=1, )
            sns.despine()
            fac.set_ylim((.5, 6.5))
            ylims = fac.get_ylim()
            xs = np.arange(0.5, 11.5, 1)
            fac.vlines(x=xs, ymin=0, ymax=ylims[1], linewidths=1, linestyle='--', alpha=.5)
            fac.set_ylabel('Number of clusters')

            plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

            plt.gcf().set_size_inches(30, 6)
            f = plt.gcf()
            pdf = PdfPages('/home/mkeuken1/data/post_mortem/visualize_stains_v1/model_comparison_clusters/fwhm_%s/'%str(fwhm) + metric['name'] + '_' + data_type )
            f.savefig(pdf,  format='pdf', bbox_inches='tight')
            pdf.close()
            plt.close()


#### 4c2a) Getting the winning models based on the withincross validation
   What we want to do is to determine what the number of mixtures fits the data best per stain.
   We have 6 different train and test partitions so it is a bit of a shame if we wouldnt do anything with them.
       What we do in the following part seperately for within and cross validation values is to
           - calculate mean rank per number of clusters (mean over training datasets)
           - plot them (ensure that the plot actually shows all the datapoints!)

In [None]:
# Getting the mean rank number for the different cluster nubmers based on the cross-validation performance 
# Once we know for a given subject what the prefered number of mixtures is per stain we can plot it. 

# Import several functions and set a few style features:
import pandas
import glob
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
%matplotlib inline

# Select the smoothing kernel
fwhms = [0.3, 0.6, 1.2, 2.4]
max_cluster = 6
consider_clusters = np.arange(1, max_cluster+1).tolist()

for fwhm in fwhms:
    for data_type in ['intensity', 'gradient_2D']:
        # Read all the panda data
        df = pandas.concat([pandas.read_pickle(fn) for fn in glob.glob('/home/mkeuken1/data/post_mortem/ml_clusters_cross_validated_%s/*_%s_*.pandas' %(data_type, str(fwhm)))])

        # Select all rows (idx) to use for BIC/AIC model comparison
        idx_cv = df['train']!=df['test']
        dfcv = df[idx_cv]

        # Calculate normalized cross-validated log likelihood scores (for ranking)
        dfcv['ll_rank'] = np.nan 

        # Loop over all combinations of subject, stain, training set, plus metric (log likelihood) to calculate ranks
        tmp = dfcv.drop('fwhm', axis=1).pivot_table(index=['subject_id','stain', 'train'], columns=['n_clusters'])
        for metric in ['ll']:
            for subj in tmp[metric].reset_index()['subject_id'].unique():
                for stain in tmp[metric].reset_index()['stain'].unique():
                    for train in tmp[metric].reset_index()['train'].unique():
                        tmp[metric + '_rank'].loc[subj].loc[stain].loc[train] = tmp[metric].loc[subj].loc[stain].loc[train].rank()

        # Calculate mean rank per number of clusters based on cross-validation
        # Reset index and groupby subj id and stain to calculate mean rank per number of clusters (mean over training datasets, note that for 
        #   cross validation we have 3 different partitions and 2 directions)
        tmp2 = tmp.reset_index().copy()
        tmp2 = tmp2.groupby(['subject_id', 'stain']).mean()['ll_rank']
        tmp2 = pandas.melt(tmp2.reset_index(), id_vars=['subject_id', 'stain'], value_vars=consider_clusters, value_name='mean_rank')

        # Here, we get the winning models per subject/stain combination. Note that, contrary to the AIC/BIC, 
        # the winning model has the *largest* likelihood (instead of smallest).
        # Therefore we are not using the nanargmin but the nanargmax
        best_models_ll = tmp2.groupby(['subject_id', 'stain'])['mean_rank'].apply(lambda x: np.nanargmax(x)+1)
        best_models_ll = best_models_ll.reset_index(name='winning_model')

        # Creating extra dummy columns and concatenating both AIC and BIC in one df
        best_models_ll['metric'] = 'll_cv'

        # Lets now plot the winning cluster per stain per subject for LL. 
        loopdict = [{'name':'ll_cv', 'df': best_models_ll}]
        for metric in loopdict:
            fac = sns.swarmplot('stain', 'winning_model', 'subject_id', metric['df'], split=True, size=10, alpha=1, )
            sns.despine()
            fac.set_ylim((.5, max_cluster + .5))
            ylims = fac.get_ylim()
            xs = np.arange(0.5, 11.5, 1)
            fac.vlines(x=xs, ymin=0, ymax=ylims[1], linewidths=1, linestyle='--', alpha=.5)
            fac.set_ylabel('Number of clusters')

            plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
            plt.gcf().set_size_inches(30, 6)
            f = plt.gcf()
            pdf = PdfPages('/home/mkeuken1/data/post_mortem/visualize_stains_v2/model_comparison_clusters/fwhm_%s/'%str(fwhm) + metric['name'] + '_' + data_type )
            f.savefig(pdf,  format='pdf', bbox_inches='tight')
            pdf.close()
            plt.close()

#### 4c2b) A sanity check is to see whether the rank orders correlate when switching train and test set.

In [None]:
## Checkout correlations between rank order of different partitionings of the data for cross-validation.
## i.e., check if the rank order obtained by training on CVset1_1 is similar to the rank order obtained by 
## training on CVset1_2.
tmp3 = tmp.reset_index()
cvset11_ranks = tmp3.loc[tmp3['train']=='CV_set1_1']['ll_rank'].values.ravel()
cvset12_ranks = tmp3.loc[tmp3['train']=='CV_set1_2']['ll_rank'].values.ravel()
print(np.corrcoef(cvset11_ranks, cvset12_ranks))
cvset21_ranks = tmp3.loc[tmp3['train']=='CV_set2_1']['ll_rank'].values.ravel()
cvset22_ranks = tmp3.loc[tmp3['train']=='CV_set2_2']['ll_rank'].values.ravel()
print(np.corrcoef(cvset21_ranks, cvset22_ranks))
cvset31_ranks = tmp3.loc[tmp3['train']=='CV_set3_1']['ll_rank'].values.ravel()
cvset32_ranks = tmp3.loc[tmp3['train']=='CV_set3_2']['ll_rank'].values.ravel()
print(np.corrcoef(cvset31_ranks, cvset32_ranks))

# All >0.6, so seems at least reasonable

### 5) What is the proportion that a voxel belongs to a given cluster?

Here we want to know what the proportion is that a voxel belongs to a clusters. What that means is that for the model with a single exGauss, all voxels will belong to that cluster. It becomes a bit more interesting when you have more exGauss distributions fit to the data. In case of a mixture model with 2 different clusters, is it then the case the for most voxels it is clear that they belong to a single cluster (with 95% probability)?

In [None]:
# Loading in the data
fwhms = [0.3, 1.2]
# Set the number of mixtures
ns = [1,2,3,4,5,6]
# Subject ids
subject_ids = [13095, 14037, 14051, 14069, 15033, 15035, 15055]
# Stains
stains = ['CALR', 'FER', 'GABRA3', 'GAD6567', 'MBP', 'PARV', 'SERT', 'SMI32', 'SYN', 'TH', 'TRANSF', 'VGLUT1']
# data type
data_types = ['intensity', 'gradient_2D']
print len(list(itertools.product(fwhms, ns, stains, subject_ids, data_types)))

df = pandas.DataFrame(columns=['data_type', 'test', 'train', 'subject_id', 'stain', 'fwhm', 'n_clusters', 'model'])
for subject_id in subject_ids:
    for data_type in data_types:
        for stain in stains:
            for n_clusters in ns:
                for fwhm in fwhms:
                    for cv_set in ['All-Data-In-Mask']:
                        fn = '/home/mkeuken1/data/post_mortem/ml_clusters_cross_validated_%s/%s_%s_%s_%s_%s.pkl' %(data_type, subject_id, fwhm, stain, n_clusters, cv_set)
                        print(fn)
                        with open(fn, 'r') as f:
                            model = pkl.load(f)
                        df_new_row = pandas.DataFrame({'subject_id': subject_id,
                                                  'fwhm': fwhm,
                                                  'train_set': cv_set,
                                                  'stain': stain,
                                                  'n_clusters': n_clusters,
                                                  'data_type': data_type,
                                                  'fit_object': model}, index=[0])
                        df = pandas.concat([df, df_new_row])
            
        
    

In [None]:
def get_proportion_surely_one_cluster(row):
    
    fit_object = row.fit_object
    pdfs = fit_object.w * exgauss_pdf(fit_object.data[:, np.newaxis], fit_object.mu, fit_object.nu, fit_object.sigma)
    
    return ((pdfs.max(1) / pdfs.sum(1)) > 0.95).mean()

df['Proportion of strong cluster assignments'] = df.apply(get_proportion_surely_one_cluster, 1)
df['Number of clusters'] = df['n_clusters']

In [None]:
# this was the previous plotting style
# #sns.palplot(sns.color_palette('Set3', 12))
# for fwhm in fwhms:
#     for data_type in data_types:
#         df_data_type = df.loc[(df['data_type']==data_type) & (df['fwhm']==fwhm)]
#         tmp = df_data_type.groupby(['subject_id', 'Number of clusters', 'stain']).mean()
#         tmp['dummy'] = 1
#         fac = sns.factorplot('dummy', 'Proportion of strong cluster assignments', 'stain', tmp.reset_index(), col='Number of clusters', col_wrap=2, kind='bar', palette=sns.color_palette('Set3', 12), aspect=4)

#         fac.set_ylabels('')
#         fac.set_axis_labels(x_var='')
#         fac.set_xlabels('')
#         fac.set_xticklabels([])
#         plt.savefig('/home/mkeuken1/data/post_mortem/visualize_stains_v1/voxel_assignment/voxel_assignments_%s_%s.pdf' %(data_type, str(fwhm)))


# updated, better plot below
import seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('whitegrid')

for fwhm in fwhms:
    for data_type in data_types:
        df_data_type = df.loc[(df['data_type']==data_type) & (df['fwhm']==fwhm) & (df['n_clusters']>1)]
        tmp = df_data_type.groupby(['subject_id', 'Number of clusters', 'stain']).mean()
        tmp['dummy'] = 1
        tmp = tmp.reset_index()
        tmp['x'] = tmp['stain'].map(dict(zip(np.unique(tmp.stain), np.arange(len(np.unique(tmp.stain))))))
        tmp['Number of clusters'] = tmp['Number of clusters'].astype(int)

        f, ax = plt.subplots(2, 3, sharex=True, sharey=True)
        
        for i, n_cluster in enumerate(np.arange(2, 7)):
            row_n = np.floor(i/3.)
            col_n = i-row_n*3
            print(row_n, col_n)
            sns.barplot(x='dummy', y='Proportion of strong cluster assignments', hue='stain', 
                        data=tmp.loc[tmp['Number of clusters']==n_cluster], 
                        ax=ax[row_n, col_n], errwidth=2.5, capsize=0.01,
                        linewidth=.5, edgecolor="k")#, legend=False)
            
            ax[row_n, col_n].legend_.remove()
            ax[row_n, col_n].set_ylabel('')
            ax[row_n, col_n].set_xlabel('')
            ax[row_n, col_n].set_xticklabels([])
            ax[row_n, col_n].set_title('%d clusters' %n_cluster)
        ax[-1, -1].axis('off')
        ax[0, 2].legend(loc='center right', bbox_to_anchor=(1., -.60), ncol=2, 
                        columnspacing=.5, handletextpad=.5,
                        fontsize='x-small')

        suptxt = 'Gradient magnitude' if data_type=='gradient_2D' else 'Immunoreactivity'
        plt.tight_layout()
        plt.gcf().text(0.00, 0.5, 'Proportion strong cluster assignments', va='center', rotation='vertical')
        plt.gcf().suptitle(suptxt)
        plt.subplots_adjust(left=.09, top=.85)#, right=.95)

        plt.gcf().set_size_inches(10, 6)
        save_dir = '/home/mkeuken1/data/post_mortem/visualize_stains_v2/voxel_assignment'
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)
        plt.savefig('/home/mkeuken1/data/post_mortem/visualize_stains_v2/voxel_assignment/voxel_assignments_%s_%s.pdf' %(data_type, str(fwhm)))

#### Everything below is tinkering around to see what's going on