### Introduction

Because their are so many algorithms, deciding which algorithm to use can be daunting for a newbie building a machine learning model.

Luckily the kind folks behind Scikit-Learn have provided a nice map to help you get started: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


![algorithms](https://scikit-learn.org/stable/_static/ml_map.png)

The algorithmm are shown as green boxes.

If you ignore dimensionality reduction for now, we can see that three approaches: Classification, Clustering, Regression

NOTE: dimensionality reduction is more of an advanced topic and will not be covered on this course.

<div class="alert-info">
    
---
_**DEFINITION: CLASSIFICATION**_
   
_Classification is used to determine the class of an object. To train the model, we required labelled data.  In the Iris example, the Class of Iris was the label._
    
---
</div>

<div class="alert-info">
    
---
_**DEFINITION: REGRESSION**_
   
_Regression is similar to Classification, but instead of predicting a Class of an object, we are predicting a value on a continuous scale.  An example of this may be predicting the Salary based on features such as Age, Experience, etc._
    
---
</div>

<div class="alert-info">
    
---
_**DEFINITION: CLUSTERING**_
   
_Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)._
    
_Source: https://en.wikipedia.org/wiki/Cluster_analysis_
    
---
</div>

In the middle of the map, we see a decision point; "Do you have labelled data?".

<div class="alert-info">
    
---
_**DEFINITION: LABELLED DATA**_
   
_Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags_
    
_In the Iris dataset the class was labelled data._
    
_Source: https://en.wikipedia.org/wiki/Labeled_data_
    
---
</div>

### Excercises

**Exercise 01:** 

- Click the [link](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) to open the map on the scikit learn site.  
- Click on the Linear SVC algorithm - where does it take you?

E**xercise 02:**

- What type of algorithm is KMeans?

**Exercise 03:**

We have a diabetes dataset with a continuous number for the target:

In [1]:
import sklearn.datasets
import pandas as pd
import numpy as np

diabetes = sklearn.datasets.load_diabetes()

diabetes_df = pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],columns= diabetes['feature_names'] + ['target'])
diabetes_df.head(5)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


The number of records and columns can be found with:

In [2]:
diabetes_df.shape

(442, 11)

Assume only a few of the features will be important.  

- Which algorithm would you choose using the scikit algorithm cheat sheet?

**Click [Next](./07_estimator_parameters.ipynb) to continue with the next notebook.**