# Python for Data Analytics - Exercises 1

# Getting Started – Diagnosing Liver Disease

## Objectives

* To introduce a classification task using real data.
* To learn how to use iPython to visualize 1- and 2-D distributions.

# Introduction
This set of iPython notebooks use the Python programming language for Data Analytics. They introduce some additional Python concepts but the main focus is on making use of the Python programming skills that you have developed already to carry out analysis of real data sets.


# Loading the data

In this notebook we are going to be experimenting with a set of medical data concerned with the diagnosis of liver disease. The data can be loaded directly from the web into this iPython notebook session without having to first download it by using the code in the next cell. To execute the code in the next cell,

1. simply click on the cell so that it becomes the current cell (a green border should appear around it)
2. click on the 'run cell' button - the black triangle in the tool bar above, or use the Ctrl+Return combination

<font size ="4"><u>Note:</u></font> some lines of code end with a semicolon. That symbol is added to eliminate unnecessary output from running the cell. Python does not require to add a semicolon at the end of each line of code.

In [None]:
import numpy as np
data = np.loadtxt(open("files/liver_data.txt","rb"), delimiter=",")

This command uses [NumPy](https://docs.scipy.org/doc/numpy/user/basics.html) to load all the rows of comma-separated numbers from the file `liver_data.txt` and store them all in a matrix variable called `data`. (Note, the variable `data` did not need to be declared or given a type, or allocated memory.) 

# Understanding the data

The data concerns the diagnosis of liver disease. Seven bits of information (7 columns) have been recorded about 345 patients (345 rows). We will talk about there being 345 samples and each sample having 7 features. The data in the 7 columns has the following meaning,

1. mean corpuscular volume
2. alkaline phosphotase
3. alamine aminotransferase
4. aspartate aminotransferase
5. gamma-glutamyl transpeptidase
6. number of half-pint equivalents of alcoholic beverages drunk per day 
7. Class label (1=healthy; 2=diseased)

The first five are the results from blood tests (we do not need to worry about their precise meaning), the sixth concerns the patient’s alcohol consumption. The final column is a ‘class label’. A 1 in this column means the patient is healthy and a 2 means the patient has liver disease. Full information about the data can be found here, http://archive.ics.uci.edu/ml/datasets/Liver+Disorders

If designing a medical diagnostic test, our goal would be to try and predict the class label in the seventh column (healthy vs diseased) given the patient data recorded in the first 6 columns (the blood tests and alcohol consumption), i.e., it is a classification task (data in, label out).

# Examining the data

Let us now look at one of the features in more detail. We can extract the nth column from a matrix $M$ and store it in a vector, $v$, using the Python code,

    v = M[:, n]

So to extract the information about alcohol consumption (feature 6) we can run

In [None]:
drinks = data[:,5]

We can now plot this data using the plot command that is imported as part of the 'pylab' module. The line `%matplotlib inline` tells iPython notebook that we want the plots to appear inline in the notebook (i.e. in the browser) and not in a separate window.

In [None]:
import matplotlib.pylab as plt
%matplotlib inline
plt.plot(drinks);

The plot should have appeared above. It will have the sample numbers along the x-axis (i.e. patient 1 to patient 345) and the patient's alcohol consumption on the y-axis. This is not a very clear way of viewing the data. It would be more appropriate to use a histogram (“a bar chart”). This can be easily done by using matplotlib's 'hist' function,

In [None]:
plt.hist(drinks);

Unfortunately by default the histogram has 10 bars which is not enough to display the data precisely. The `hist` function has a parameter called 'bins' that takes a vector that specifies the position of the boundary between each histogram bin. We want bins centered at 0, 1, 2, ..., 20, so we need to specify boundaries at -0.5, 0.5, 1.5, ..., 20.5. This is most easily done using python's `linspace` command which produces a sequence of evenly spaced values between two specified limits,

In [None]:
plt.hist(drinks,bins=np.linspace(-0.5,20.5,22));

You can make the figure a little clearer by changing the range of the axes, try,

In [None]:
plt.hist(drinks,bins=np.linspace(-0.5,20.5,22));
plt.xlim([-0.5,20.5]);

So we can now see that the people in the dataset are drinking between 0 and 20 units of alcohol a day! With the mode (most popular) being 1 unit. (Where would you lie on this distribution?) To calculate the average alcohol consumption use the command '`np.mean(drinks)`'. Try typing this in the empty cell below,

In [None]:
# Calculate the average alcohol consumption in this cell

# Using the class label

In the previous section we looked at the distribution of alcohol consumption for all 345 people. We now want to look at separate histograms for healthy and diseased people. Remember the class label is in column 7. We can make Python test whether the entries in a column match a specific number, say for example 1, by using syntax like, 

In [None]:
# Uncomment the line below and execute the cell
#data[:,6]==1

Take care to note that it is == and not =.

The results are returned as a vector of 1’s (true) and 0’s (false). This single line of code effectively performed 345 comparisons and returned 345 results all in one go.

We can store these results in a variable,

In [None]:
wellPeople = data[:,6]==1

We can use these Booleans to select rows belonging to the healthy people from a column of our matrix, e.g. to select the column 6 data for just the healthy people,

In [None]:
wellDrinks = data[wellPeople, 5]

or putting it all in one line,

In [None]:
wellDrinks = data[data[:,6]==1, 5]

and similarly,

In [None]:
illDrinks = data[data[:,6]==2, 5]

Now we want to compare the histograms for these two classes. We can place two plots in the same window using the `subplot` command. See [here](http://matplotlib.org/api/pyplot_api.html?highlight=subplot#matplotlib.pyplot.subplot) for subplot documentation.

In [None]:
plt.subplot(2,1,1)
plt.hist(wellDrinks, bins=np.linspace(-0.5, 20.5, 22))
plt.xlim(-0.5,20.5);
plt.subplot(2,1,2)
plt.hist(illDrinks, bins=np.linspace(-0.5, 20.5, 22))
plt.xlim(-0.5,20.5);

Compare the two histograms. Are you surprised by how they appear? Is alcohol consumption by itself a good predictor of liver disease?

# Writing functions

So far we looked at the distribution of the 'number of drinks' feature (i.e. feature index 6) for the well and unwell classes. Let's say we now want to look at the distribution for another feature. We'd need to repeat the commands executed so far but using an index other than 6. It would be conveninet to place these commands in a function body. We would pass the data and the feature index as parameters of the function. We can then call the function with different index values and see how different features are distributed. 

Try implementing this function in the cell below. Hint: most of the lines can be cut and paste form above. The '6' just need replacing with a parameter variable.

In [None]:
# Implement the function described above in this cell

# Test your function with the rest of the columns from the "liver_data.txt" file

Which feature appears to be the best for distinguishing between the two classes? i.e. which feature generates a pair of histograms with the smallest overlap?

# Looking at pairs of features using a scatter plot.

You’ll probably find that no one feature is very useful in isolation. The classes will be better separated if we use more features. When looking at one dimension we can visualize the data distribution using a histogram. When looking at the distribution of a pair of features it is often better to use a scatter plot. With a scatter plot the pair of features are represented as a point on a 2-D plane, i.e. each sample is plotted on the plane at a position that represents the value of the sample’s features.

Let’s say that we want to look at feature 4 and feature 5. Again, we will separate the healthy people from the diseased people,

In [None]:
wellFeatureX = data[data[:,6]==1, 3]
wellFeatureY = data[data[:,6]==1, 4]

we can now plot these against each other using,

In [None]:
plt.scatter(wellFeatureX, wellFeatureY);

Examine how the healthy patients are spread. We can now do the same for the diseased patients by changing the ==1 to ==2, i.e.

In [None]:
illFeatureX = data[data[:,6]==2, 3]
illFeatureY = data[data[:,6]==2, 4]
plt.scatter(wellFeatureX, wellFeatureY);
plt.scatter(illFeatureX, illFeatureY);

The problem now is that both sets of data are plotted using the same symbol so we can’t see which is which. To use a different symbol and/or colour you can provide scatter with extra arguments. The argument 'c' can be used to specify a colour, 'marker' can be used to specify a marker style and 's' is an integer that specifies the marker size. 

The example below uses green crosses for the healthy people and red dots for the diseased people,

In [None]:
plt.scatter(illFeatureX, illFeatureY, s=20, c='r', marker='o');
plt.scatter(wellFeatureX, wellFeatureY, s=20, c='g', marker='x');

How well are the two classes separated by this pair of features?

We can now generalise the code we have written by writing a python function that takes the data and the indexes of any two features that we want to compare.

In [None]:
def compare_features(data, f1, f2):
    wellFeatureX = data[data[:,6]==1, f1]
    wellFeatureY = data[data[:,6]==1, f2]
    illFeatureX = data[data[:,6]==2, f1]
    illFeatureY = data[data[:,6]==2, f2]
    plt.scatter(illFeatureX, illFeatureY, s=20, c='r', marker='o')
    plt.scatter(wellFeatureX, wellFeatureY, s=20, c='g', marker='x')
    
# call compare_features to test the function
compare_features(data, 3, 4)

Once the function has been defined we can use it later in the notebook. For example the next cell uses the function to compare several different pairs of features,

In [None]:
plt.subplot(2,2,1)
compare_features(data, 2, 3)
plt.subplot(2,2,2)
compare_features(data, 2, 4)
plt.subplot(2,2,3)
compare_features(data, 1, 4)
plt.subplot(2,2,4)
compare_features(data, 0, 3)

Now using a nested loop and the compare_features function write a piece of code that will systematically compare all six features against each other and present the result as a six by six grid of scatter plots. Use the empty cell below.

In [None]:
# Add your code here to compare all six features against each other

# Summary

***Python***: We have used iPython notebook to explore a small medical data set. We have seen how Python can store data in a matrix and how the data can be manipulated using a simple syntax. We have seen some of matplotlib's powerful plotting functions and used them to visualize some 1-D and 2-D data distributions using histograms and scatter plots.

***Classification***: We have been introduced to a simple classification task based on the diagnosis of liver disease using patient data. We have seen that for real world tasks it can be hard to find single features that clearly separate the classes we are interested in. We have seen that using more than one feature at a time can be a way of better separating the classes (i.e. the data in the scatter plots was less overlapped that the data in the histograms).