Pattern recognition is a widely used tool in many areas of science from machine vision to neuroscience to just about any case where you are trying to use multiple variables to predict the group (or 'class') that a novel observation falls into. At a high level, pattern recognition deals  with the very general question: do the variables that you measure encode information about different classes of outcome measures that you want to recognize (or identify)? To the extent that there is information, then the pattern recognition algorithm will be able to successfully assign different examplars into their correct classes, and if there is no information, then it will fail [as a side note, pattern recognition performance is directly linked to Mutual Information, which is the degree to which measuring one variable (or set of variables) informs you about another variable (or set of variables). We can discuss this after the tutorial if people are interested]. The 'pattern' part of 'pattern recognition' refers to the fact that we're not just going to use a single variable to predict our outcome measure (i.e. a 'univariate' analysis) - we're going  to use the joint information encoded by a series of variables to make  that prediction (i.e. a 'multivariate' analysis), and as we'll illustrate below in the tuortial, this can be a major advantage as you might infer a null relationship based on univariate methods when in fact there is a very robust relationship that can be revealed by exploiting multivariate analysis methods. 

To start with, here is an example to illustrate the geneneral framework: suppose you're trying to write a machine vision program that will recognize different types of fruit (for simplicity, lets say that the fruits are either apples or oranges). This is a classic problem in machine vision (and human vision!), because there are a nearly infinte number of unique images of apples and oranges that one might encoutner in the real world if you consider slight variations in color, viewpoint, size, etc). One approach to solving this recognition problem would be to 'show' your program millions of pictures of apples and millions of pictures of oranges, and then to perform a 'template matching' process. Essentially, you would show your program a novel'test' image of either an apple or an orange, and the program would perform an exhaustive search through its data base of images and try to find an exact match to the novel test image. A bit of reflection should suggest how ineffecient and prone to failure this method is: what if your program wasn't exposed to the exact image that you're asking it to identify? It almost surely was not, as a random picture of an apple/orange is likely to be entirely unique in terms of angle, lighting, color variation, etc. So most of the time this approach will fail unless you restrict your image set to be unrealistically over-simplified.

Another approach, and one that is much more robust and effecient is to 'train' your program to learn  a set of features that are characteristic of each type of fruit (i.e. features that are diagnostic of each type of fruit). Using this approach, you might train your program with a set of just 1,000 or so apples and another set of 1,000 oranges, and then when you present the program with a novel test image that it has not seen, it will rely on its training to produce a guess about the identity of the novel test image. In short, it will compare the test image to the set of features that it has learned to associate with apples, then it will compare the test image to the set of features that its has learned about oranges, and then it will assign the test image to a category based on how closely the features of the test image match up with learned features for each type of fruit.

Note that to do this kind of training/testing recognition procedure, your could use a variety of approaches. The simplest approach is just straightforward correlation: you correlate the test image with the mean of the learned image for apples and the mean of the learned image for oranges, and then make your guess based on the higher of the two correlation coefficients. This method has a few advantages: its very easy to implement, and its relatively fast. However, using a simple correlation  as described above only uses the mean pattern associated with each fruit (or more generally with each 'class') and thus discards information that is encoded in the variability of each feature related to a class. In addition, simple correlations do not take account of the relationship between the different features that have been learned. We'll see why this is important later, and to do this, we'll step through a series of algorithms that can be used to perform pattern recognition, starting with simple correlation, then using the Euclidean distance, then the Normalized Euclidean distance, and then the Mahalanobis distance (the later of which is quite common in the neuroscience literature). Note that this set of pattern recognition algorithms is by no means exhaustive and there are dozens (hundreds? thousands?) of different variations and approaches. I've selected these because they build on each other and provide a good introduciton to the importance of weighting features based on their variance and on the covariance stucture of the data. Also, as we go through these algorithms, take note of the similarity between the later three approaches (Euclidean/Mahalanobis distance metrics) and the d-prime measure that we discussed when we talked about signal detection theory a few weeks ago. These pattern recognition approaches are essentially based on the same principles as d-prime, but with some added elements that we'll discuss as we go. 

One more theoretical note. In practice, pattern recognition algorithms often  rely on a principle called 'cross validation' when you cannot collect  an endless set of data such as described in the fruit example above. For instance, in our fruit example above, we could characterize the performance, of the generalizability, of our pattern recognition algorithm by showing the algorithm a ton of novel images of apples/oranges and seeing how  how accurately each stimulus was classified. However, in other cases where we don't have easy access to a ton of training and testing data, we have to rely on another approach to assess generalizability.  In these cases, which comprise most instances in neuroscience and social science we instead rely on 'cross-validation'. Cross-validation simply refers to the notion that you train your pattern recognition algorithm (henceforth I'll call this a 'classifer') using one set of data, and then you validate, or 'test', the performance of the classifier using a novel set of data that was not part of the training set. The main purpose of cross-validation is to assess the generalizability of your classifier and its ability to correctly categorize novel inputs. This sounds easy enough, but its actually a kind of deep issue. For instance, suppose you did an experiment that had 500 trials of stimulus type A and 500 trials of stimulus type B and you measured the response on each trial in 100 neurons. Then, to figure out how well the neural data respond systematically to changes in stimulus parameters, you  fit a multivariate regression analysis to see how much variability in the stimulus is accounted for by changes in neural activity (i.e. you compute something akin to a R^2 value to asses goodness of fit). Suppose you run this analysis on all 1,000 trials and you get your R^2 value and its nice and high - like .75 or so. You might be really happy with this, and rush off to write up a paper. However, since you fit all of the data in your model, your estimate of how good the model fits the data is likely overestimating how good the model is at accounting for the relationship between the two factors in general. Here's why: your dependent variables  (your measured neural responses) are corrupted by noise, and this noise is idiosyncratic in the sense that if you were to perform the experiment again, you'd get 1000 different measurements that were similar to the first 1000, but corrupted by unique noise. As a result, when you fit your regression model to the data, the resulting coeffecients will reflect the true 'signal' in the data AND the idiosyncratic noise that was measured along with the signal. In effect, your model learns the relationship between the independent variable and the (signal + noise). This occurs because your regression model (or whatever model for that matter) has no a priori means of separating out signal and noise - it just gets a measure of neural responses that were evoked by each stimulus, and its job is to relate those measurements to the independent variables. This is referred to as 'overfitting', and is a exacerbated by small data sets (where the signal is not likely to emerge from the noise due to the small sample size) and when you have a model that has lots of free parameters (more free parameters means that the model can more flexibly account for random variations in the data...i.e. noise).  So - what to do? Instead of just fitting the model to all the data and
assessing the goodness of fit, you should generally be using cross-validation. In our example above, you could train your classifer using 400/500 of the trials associated with each stimulus set (so 800 trials total), and then 'test' the classifer's performance at guessing the correct stimulus class using the remaining 200 trials (100 assocaited with each stimulus). Then you could permute this train/test procedure several times, each time holding-out a different set of 800 trials to train the classifer and 200 trials to test the performance of the classifier. Here is the cool part: if your classifer (or more generally: your model that related the IV and DV) is just learning the idiosyncratic noise in the data, then you might have a reasonable looking R^2 value based on your training data (i.e. the model fits the training data ok), but your ability to classify novel examplars from the test set might be at chance (again, becuase your classifer or model just learned the random noise in the training set and there was no 'signal' that could actually discriminate between classes). So, the use of cross-validation can protect against overly optimistic assessments of model fit due to 'overfitting', and also enables you to assess the generalizability of the model to classify novel exemplars. The degree to which a classifer generalizes to correctly predict novel stimuli is really  then a measure of how much real signal - or information - there is in your data about the different examplars that you're trying to classify.


PART I: Basic overview of classification and cross-validation using a univariate data set (i.e. a single variable was measured and we'll use the observed values from that varible to classify novel data points  as belonging to condition A or condition B. Note that conditions A and B could be experimentally manipulated (e.g. two stimulus types that are randomly presented to an observer) or they could be observational outcomes like 'graduated from high school' and 'did not graduate from high school'. The important thing here is that we have multiple variables that we're going to use to figure out whether a given novel observation of responses across these variables most likely belongs to condition A or to condition B. And as stated above, we're not going to get into actual pattern recognition or classification just yet, but instead we'll just try to develop an intuition about why looking at multivariate patterns of data is often more useful than just looking at each variable in isolation (univariate analysis). As in many of the previous tutorials, we'll start by setting up some fake data so that we can control all of the relevant parameters (amount of signal, noise, number of trials)