# Visual Computing in the Life Sciences
## Assignment Sheet 2

### Exercise 1 (Producing a Scatterplot Matrix, 25 Points)

In the previous assignment, you wrote a reduced dataset to disk that is limited to the benign and malignant classes and five variables that most strongly distinguish between benign and malignant samples.
This week, you will create and interpret a basic visualization of that data.
In this assignment your final visualization should be a 5 X 5 matrix whose rows and columns are the
measurements of the variables you selected last week. Diagonal cells visualize how the variables are
distributed; off-diagonal cells visualize the relationship between the values of pairs of variables.
Please proceed in the following steps and submit your final script, the final image, and answers to the
questions:


a) Each diagonal cell should contain two overlaid histograms, one for the benign and one for the
malignant class. In the histogram, variable values should be on the x axis, the frequency of
observing that value in each class should be on the y axis. Use different colors to distinguish
between the classes, and add a legend. Your visual design should make it easy to answer the
following questions (5P for implementation, 1P for justifying choice of colors, 3P for answering
questions):

- For which variable(s) you could find a range of values for which the class of the sample is
certain? Write down the ranges.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import matplotlib 
from matplotlib.lines import Line2D


In [9]:
##MAURICIOOO I provided the table with the class feature too :D
data=pd.read_excel('reduced_dataset.xlsx')
data.head()
variables=['thickness','uniCelS','uniCelShape','bareNuc','blaChroma']



In [10]:
benign = data[data['class']==2]
malignant = data[data['class']==4]
benign.head()

Unnamed: 0,thickness,uniCelS,uniCelShape,bareNuc,blaChroma,class
0,5,1,1,1.0,3,2
1,5,4,4,10.0,3,2
2,3,1,1,2.0,3,2
3,6,8,8,4.0,3,2
4,4,1,1,1.0,3,2


In [23]:

#Function to create an array of marker size for each point, depending on its frequency percentage
def marker_size(variable): 
    """
    variable should be the column of a dataframe for which I wish to create an array of marker sizes
    
    """
    #Create a frequency table 
    base=pd.crosstab(index=variable, columns='count',).sort_values(by='count')

    y=len(variable)

    #Determine a function for the marker size based on the ratio of occurence of the value divided by the 
    #set size: The percentage of occurence of such value in the dataset.
    
    set_size= lambda x:((x/y)*1000)
    
    #Create a column called size, for the marker size that corresponds to each value
    base['size'] = base.apply(lambda x: set_size(base['count']))
    
    size_array=[]
    
    #create an array with marker sizes for each element in the dataset.
    for each in variable:
        size_array.append(int(base.loc[[each],['size']]['size']))
    return size_array


In [30]:
def draw_histograms(data, variables, nrows, ncols):

    #Creating the figure where the subplots will be
    fig,axes=plt.subplots(nrows,ncols,sharex=True, figsize=(20,20))
    
    #Setting up the colours for bening and malignant datasets, as well as the datasets themselves.
    c=['#d8b365','#5ab4ac']
    benign = data[data['class']==2]
    malignant = data[data['class']==4]
    
    for i in range(0,5):
        
        #Naming x and y labels for the whole figure class
        axes[4,i].set_xlabel(variables[i])
        axes[i,0].set_ylabel(variables[i])
        
        y = malignant[variables[i]]

        for j in range(0,5):
            
            x = benign[variables[j]]
            
            if j==i:
            
                axes[i,j].hist(x, bins=20, alpha=0.65, label='Benign',color=c[0])
                axes[i,j].hist(y, bins=20, alpha=0.65, label='Malign',color=c[1])
            
            else:        
                
                #size is the array of arrays for the malign and bening datasets, based on the
                #frequency of each value on the WHOLE dataset
                size=np.array([[marker_size(data[variables[j]])],[marker_size(data[variables[i]])]])                
    
    
                #Overlapped scatter plots for bening and malignant datasets for each of the variables.
                axes[i,j].scatter(benign[variables[j]],benign[variables[i]],alpha=0.3,color=c[0],s=size)
                axes[i,j].scatter(malignant[variables[j]],malignant[variables[i]],alpha=0.3,color=c[1],s=size)
        

    
    
    
    fig.text(0.5,0.04,'Variables', ha='center', va='center', fontsize=30)
    fig.text(0.05,0.5,'Frequency', ha='center', va='center', \
                 rotation='vertical',fontsize=30)
    
    cami = [Line2D([0], [0], marker='o', color=c[0], label='Scatter',markerfacecolor=c[0], markersize=15),\
          Line2D([0], [0], marker='o', color=c[1], label='Scatter',markerfacecolor=c[1], markersize=15)]
    
    
    fig.legend(handles=cami,labels=('Bening','Malignant'),loc='upper right',prop={'size': 16})
    plt.show()

    return


In [None]:
draw_histograms(data,variables,5,5)


- Which variable(s) has(have) almost a uniform distribution for the malignant samples?

b) In each non-diagonal cell, display a scatter plot that visualizes the values of the corresponding
pair of variables. Use different colors and opacities so that it is simple to relate these scatter
plots to the density plots on the diagonal, and the size of the marker should reflect the number of
overlapping points. (5P for implementation, 2P for answering questions):


In [None]:
def draw_histograms(bening, malign, variables, nrows, ncols):
    fig,axes=plt.subplots(nrows,ncols,sharex=True, sharey=True, figsize=(20,20))

    i=4


- Point out a pair of variables whose values have a positive correlation overall.


- Can you identify a pair of variables for which the values are highly correlated in one group
of subjects (e.g. malignant), but less so in the other group?

c) Compute the distance consistency of all scatter plots. Which pair of variables leads to the highest
distance consistency? (6P)

d) Imagine that, given only the values of two variables, you will be asked to decide whether they are
from a benign sample, or a malignant one. Which pair of variables would you choose to make
that decision? Why? (3P)
Hint: You can use the Python toolkit matplotlib to create plots. More information on it is available
from http://matplotlib.org/.

### Exercise 2 (Principal Component Analysis, 25 Points)


It is difficult to fully visualize a very high-dimensional space. In the first assignment sheet and the
previous exercise, we therefore focused on a few variables that we found to be particularly discriminative.
In this exercise, we will instead employ dimensionality reduction on the values of all variables.

a) Perform a Principal Component Analysis (PCA) on the values.
Write a program to read the breast-cancer-wisconsin.xlsx fole again. Interpolate missing
values as before, but keep all variables this time. Make a plot that, for any number n, shows what
fraction of the overall variance in the data is contained in the first n principal components. How
many components do we need to cover 90% of the variance? (5P). Hint: You may use the implementation of PCA that is provided in the Python package scikit-learn.


b) Each sample is now characterized by a point in PCA space. Create a scatter plot matrix (in the
same manner as in the previous sheet) that shows the first five principal components. This time,
instead of histograms, each diagonal cell should contain two overlaid density plots, one for the
benign and one for the malignant class. In the density plot, variable values should be on the x
axis, the frequency of observing that value in each class should be on the y axis. Use different
colors to distinguish between the classes, and add a legend. (5P)

c) In which PCA modes do you see a clear difference between the benign samples and the malignant
samples, in which modes the difference is less? (3P)

d) Sometimes outliers (points that are quite far away from the rest of the data) could affect the data
analysis. Provide the sample-Code or row index of the furthest point of malignant samples in the
fourth PCA mode. Then remove that sample using its row index. (5P)

e) See what happens when we re-weight the variables to emphasize those that discriminate well
between the benign and malignant classes. To do so, compute F scores (cf. sheet 2, task 1 d))
and multiply each data value by its corresponding F score. Create two scatter plots to compare
PCA results with and without the re-weighting. (5P)

f) In the breast cancer data-set, all the variables have a similar range of values v 2 [1; 10]. If the
variables of a data-set have varying ranges, for example one variable have values around 1000 to
2000 and another around 1 to 5, how could this aect the PCA performance. Explain how would
you solve this problem? (2P)