# Model Selection and Validation

Up to this point, you, the students, have been introduced the major topics in machine learning, as well as the problems the aim to address:

* Supervised
    * Classification
    * Regression
* Unsupervised
    * Cluster Analysis
    * Dimensionality Redunction
    
You have also learned of a number of models (Linear Regression, KMeans/KNN, Logistic Regression, etc.) as a means to approach these objectives. Unfortunately, knowledge of models does not a  data scientist make. One has to have an understanding of when and where a particular model is effective, and perhaps more importantly, when that model is inferior to another at performing a given task. 

This is what we mean by model selection - one cannot merely throw models at a problem and settle for whatever mediocre prediction results from the endeavor. Rather, we, as data scientists, must aspire to find the greatest approximation for a true solution to a problem, know what models may best achieve that objective, and be able to measure our success in towards achieving this goal.

Before we begin, let us import some packages that will be used throughout this section of notes.

In [None]:
import sklearn as sk
import sklearn.model_selection as ms
import pandas as pd
import matplotlib.pyplot as plt

## No Model Works in Every Situation

It should come as no surprise that there is no such thing as a perfect model (at least, not one that has been discovered), no one-size-fits all approach to solving a problem.

Therefore, in the following sections we will be discussing points of failure in models we have introduced to you so far in the course, conditions to consider when selecting your model and, finally, ways to adjust your chosen model so that it better suits the problem at hand.

### Model Failure and Why

In this section, we will discuss the shortcomings of a number of models that have previously been introduced in this course, as a means of demonstrating that one must be aware of _context_ when choosing their models.

#### The Problems With Linear Regression

    Linear resgression and classification methods and techniques derived from them are widespread throughout the field of data science. Therefore, we would like to specify that when referring to linear regression in this section, we will specifically be referring to the issues with Ordinary Least Squares Regression.

Linear Regression is a well worn technique in Data Science, and quite often it is the first regression model that most students will be introduced to. Hopefully, you are all familiar with this model, but those who aren't may refer to previous week's lecture notes. 

As a quick review - Linear regression operates by minimizing Mean Squared Error by fitting weights to each parameter in the model. In doing so, it fights a straight line to the data which is used to generate quantitative predictions.

The first -- and likely the most significant -- shortcoming of Linear Regression lies in one of the assumptions that make it work: that there exists an underlying linear structure to the data. This issue alone renders Linear regression ineffective in a vast number of cases where this assumption does not hold true.

In [None]:
%matplotlib inline

The graph above demonstrates this failure. On the left, you can see a case where Linear regression works fairly well, as there does seem to be a linear structure to thte data. On the right, however, the "line of best fit", deos not see ot fit very well at all - this is because there is no linear structure to the data.

A more subtle issue lies in Linear regression's poor handling of colinearity; consider the case where two input variables rely on each other, or have a high covariance, and where both relate highly to target variable. Then when used in a Linear regression model, both input variables may recieve significant weights, despite the fact that no additional information is gained from using both variables. This only makes your model more complex and may lead to overfitting.

#### Why KNN is rarely used

K Nearest Neighbors is another model which the reader should have come accross at this point in time. KNN is widely taught as one of the first classification methods presented to students, due largely to both its approachability and its prowess for demonstrating a number of topics integral to machine learning. As a student of this course, KNN may be one of the first things you consider when solving a classification problem. It may come as a surprise to you then, that KNN is actually quite rarely used in practice.

As a refresher, a KNN model splits the data up into neighborhoods of k samples, each of which it colors based on a majority vote of the samples in that neighborhood. Test points are given the same class as the neighborhoods they are positioned in.

This leads us quite naturally to the first issue with KNN. KNN assumes that points close together in space will tend to be of the same class, and so it takes a majority vote on a small area to create predictions. However, this means that KNN has some fairly severe density reqiurements in order for it to function effectively. If the data is too sparse, then the neighborhoods will be quite spread out - thus, a majority vote will be rather ineffective, as the points in the neighborhood cannot truly be considered close. Thus, our assumption that close together points tend to have the same class is irrelevant, seeing as the points in the dataset are not adequately close for this purpose.

The next issue is that KNN does not work particularly well on unbounded data, or on datasets where the amount of data along the boundary is lacking. Keep in mind that K nearest neighbors uses a majority vote of the neighborhood of the k points nearest to it. This graph (shown below) demonstrates the k nearest neighbors of an effective use of the model, those of an unbounded model, and those of a model with insuffecient data on the boundaries.

In [None]:
%matplotlib inline

As you can see, the second and third examples must go further afield to find it's "neighbors". Thus, once again, the assumption about KNN described above is once again irrelevant.

The final issue we'll talk about should be familiar to those of you who have formal experience programming: the complexity of K Nearest Neighbors at Test time scales linearly to the amount of data you have in your training set. In other words, as the size of the training dataset grows, so too does the time it takes to predict new values using the model.

Now, students who are familiar with algorithms may think that running in linear time is fairly good; after all it's the best you could hope for for many common algorithms in Computer Science. However, while this may be acceptable for training your model, testing should be near instant for use in production code. To this end, KNN simply fails to deliver.

#### Why Decision Trees and Sparsity

NEEDS TO BE LOOKED INTO

#### The Best Way to Tell,  is to Test it Out

All in all, it proves quite difficult to keep track of every model and their respctive shortcomings. Therefore, there may be times where you'd like to use a model without exactly knowing how well it will work. As such, the best way to tell if a model does or doesn't work... is to simply test it out and see.

### Things to consider when choosing appropriate models

#### The Curse of Dimensionality

The Curse of Dimensionality sounds quite ominous, and has stricken despair into the hearts of many a data scientist. In short, the Curse is a catch-all term for the problems which arise from adding too many additional dimensions to your data:

* Sparsity Increases - 
* Combinatorial Explosion - 
* Difficult to Identify Anomylous Data - 
* Increase in Data Requirements - 

#### Size and Complexity Constraints

This note should not be surprising. Different models take different amounts of space and time based a number of factors, and so one must be quite careful if working in a situation where memory or time is constrained - though a particular model may work better in a certain situation, one may sometimes need to choose the smaller, quicker option, if the constraints demand it.

#### Underlying Structure

This is another point we have touched briefly on previously in this week's notes. One should consider the possibility that there is an underlying structure or trend to the data; some models perform very well or exclusively in the presence of such structure, while others tend to perform well in more general scenarios; therefore, one cannot neglect looking at and understanding their data and it's implications. Doing so could save serious headaches later on, and leads to more effective, more understood models.

#### Available Data

This point is simple, but certainly not one that should be overlooked. Though the amount of data available to data scientists is growing day by day, one needs to be aware of what they are working with on a case by case basis. It should be known that certiain models, such as Stocastic Gradient Descent, require quite a large amount of data in order to function properly. Being aware of the amount and quality of your data, as well as the implications of these restraints is necessary.

### Effectively Tuning Your Models

#### Why We Need To

#### How We Go About It

#### Approaching the Answer