<h2>Lecture 1 Outline</h2>

<ul>
    <li>What is Machine Learning?</li>
    <li>Types of Machine Learning</li>
    <li>Supervised Learning</li>
    <li>Machine Learning Project Roadmap</li>
    <li>Getting started with the iris dataset</li>
</ul>

<h2>What is Machine Learning?</h2>

<img src="machine-learning2.jpg" alt="">

<h3>Some Definitions</h3>

<p>There is really no unified or textbook definition but here are some definitions I find useful</p>

<ul>
    <li>"We view the knowledge discovery process as a set of various activities for making sense of data. At the core of this
         process is the application of data mining methods for pattern discovery" - KDD paper</li>
    <li>"We define machine learning as a set of methods that can automatically detect patterns in data, and then use the                 uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty (such as      planning how to collect more data!)" - Machine Learning A probabilistic approach</li>
</ul>

<p>In both definitions above, the word 'data' seem important. It seem like the goal of ML is to 'learn' patterns from data. But what exactly do we mean by 'learning'? Here is another useful definition</p>

<p>“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”(Mitchell, 1997)</p>

<h3>Some Useful Terms From Mitchell's Definition</h3>

<p>In a loose sense, Task T often refers to tasks that are too difficult to solve with fixed programs designed and written by human beings. Examples: Trying to figure out a cat in a photo, predicting price of bitcoin and so on. Generally, we want our program to 'adapt'. In machine learning, we have two main categories of tasks: regression tasks and classification tasks (more of this under supervised learning)</p>

<p>Experience E refers to data or more accurately dataset. A dataset is a collection of examples concerning a task to be solved. To try to get the computer to figure out a cat in an image, we have show the computer examples of cats look like.</p>

<p>In order to evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Performance measure P is usually specific to the task at hand. In our cat program, we might wanna keep track of misclassification. In our bitcoin task, we might interested in some 'distance' between the actual price and predicted price. All these takes place within the context of 'training'</p>

<h2>Types of Machine Learning</h2>

<p>Three broad categories</p>

<img src="Types of Learning.png" alt="">

<h2>Supervised Learning</h2>

<p>Making Predictions using Data. Examples below</p>

<ul>
    <li>Predicting who survived in the titanic crash</li>
    <li>Is a given email "spam" or "ham"?</li>
    <li>What is the price of bitcoin?</li>
</ul>

<p>There is an outcome we are trying to predict. The first two example are called classification tasks while the last example is called a regression task</p>

<h3>How does Supervised Learning work?</h3>

<ul>
    <ol>Train a machine learning model using labeled data
        <ul>
            <li>"labeled data" means you know the outcomes or target</li>
            <li>"machine learning model" learns the relationship between the features of the data and its outcome.</li>
        </ul>
    </ol>
    <ol>
        Make predictions on new data for which the label is unknown.
        <img src="ML-chart2.png" alt="">
    </ol>
</ul>

<p>We can say in simple terms that our model learns from some examples(training data) then applies what it has learnt to future data(test data). Once we achieve this, we say our model has "generalize". For example, we can train a model to differentiate between spam email and "ham" using some sets of already known spam and ham(labeled data) such that when a new email arrives it can tell if it is a spam email or a "ham".</p>

<h2>Machine Learning Roadmap</h2>

<p>What does a typical ML pipeline look like?</p>

<img src="Pipeline2.jpg" alt="">

<h2>Getting Started with Iris Dataset</h2>

<p>This is going to be tutorial style class. I will explain concepts and terms as we use them and also try to emphasize things that will help in practice and on real datasets</p>

<h3>What is the Iris Dataset?</h3>

<p>In general, most datasets tend to be in csv format(at least those in this bootcamp). Datasets can be images, sounds depending on the task you are trying to solve.</p>
<p>Iris are a kind of flower. And the iris dataset contains 50 samples/observations of 3 different species of iris(150 samples in total). Our data attributes are measurements: sepal length, sepal width, petal length, petal width.</p>

<h3>How do we perform Machine Learning on the Iris Dataset?</h3>

<p>Framed as a supervised learning problem, we want to predict an iris species using the measurements. P.S. Before you go ahead to work on a dataset, always check literature or available solutions on the task. Read up as much as possible or work alongside a domain expert.</p>


<h3>Loading the Iris Dataset</h3>

In [2]:
# import iris dataset from the dataset module
from sklearn.datasets import load_iris
from sklearn import datasets
import numpy as np

In [3]:
# Check information on datasets folder
help(datasets)

Help on package sklearn.datasets in sklearn:

NAME
    sklearn.datasets

DESCRIPTION
    The :mod:`sklearn.datasets` module includes utilities to load datasets,
    including methods to load and fetch popular reference datasets. It also
    features some artificial data generators.

PACKAGE CONTENTS
    _svmlight_format
    base
    california_housing
    covtype
    kddcup99
    lfw
    mldata
    olivetti_faces
    openml
    rcv1
    samples_generator
    setup
    species_distributions
    svmlight_format
    tests (package)
    twenty_newsgroups

FUNCTIONS
    clear_data_home(data_home=None)
        Delete all the content of the data home cache.
        
        Parameters
        ----------
        data_home : str | None
            The path to scikit-learn data dir.
    
    dump_svmlight_file(X, y, f, zero_based=True, comment=None, query_id=None, multilabel=False)
        Dump the dataset in svmlight / libsvm file format.
        
        This format is a text-based format, wit

In [4]:
# Instantiate the iris dataset by creating an object
iris = load_iris()

In [5]:
# We can go ahead and print
print(iris)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [6]:
# Since iris is an object, we can try and see its attributes to understand what we have
print(iris.data)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [7]:
# Lets get the description of our Data. P.S. This is only applicable to datasets in scikit-learn and not real-life projects
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

<h3>Machine Learning Terminology</h3>

<ul>
    <li>Each row is an <strong>observation</strong> (also known as sample, instance, record, example)</li>
    <li>Each column is a <strong>feature</strong> (also known as predictor, attribute)</li>
</ul>

In [8]:
# print names of features
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [9]:
# Print out the species: 0 = setosa, 1 = versicolor, 2 = Virginica
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


<ul>
    <li>Each value we are predicting is the <strong>response</strong> (also known as: target, outcome, label)</li>
    <li><strong>Classification</strong> is supervised learning in which the response is categorical</li>
    <li><strong>Regression</strong> is supervised learning in which the response is continuous</li>
</ul>

<h3>Things to note while working with scikit-learn</h3>

<ul>
    <ol>Features and response are separate objects</ol>
    <ol>Features and response should be numeric</ol>
    <ol>Features and response should be NumPy arrays and with specific shapes</ol>
</ul>

In [10]:
# we can check type of data and target
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [11]:
# check shapes
print(iris.data.shape)
print(iris.target.shape)

(150, 4)
(150,)


In [12]:
# Store features(matrix) in X
X = iris.data

# Store response (vector) in y
y = iris.target

In [14]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

<h3>TRAINING A MACHINE LEARNING ALGORITHM</h3>

<p>In Scikit-Learn, there are four key steps to training (modeling)</p>

<p><strong>Step 1: </strong>Import the class you want to use.</p>

In [21]:
# Here we will be using an algorithm called K Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

<p><strong>Step 2: </strong>Instantiate the model (estimator)</p>

In [22]:
# We can set "hyperparameters" at this stage. Setting hyperparameters require knowledge about the algorithm
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier()
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')


In [16]:
help(KNeighborsClassifier)

Help on class KNeighborsClassifier in module sklearn.neighbors.classification:

class KNeighborsClassifier(sklearn.neighbors.base.NeighborsBase, sklearn.neighbors.base.KNeighborsMixin, sklearn.neighbors.base.SupervisedIntegerMixin, sklearn.base.ClassifierMixin)
 |  Classifier implementing the k-nearest neighbors vote.
 |  
 |  Read more in the :ref:`User Guide <classification>`.
 |  
 |  Parameters
 |  ----------
 |  n_neighbors : int, optional (default = 5)
 |      Number of neighbors to use by default for :meth:`kneighbors` queries.
 |  
 |  weights : str or callable, optional (default = 'uniform')
 |      weight function used in prediction.  Possible values:
 |  
 |      - 'uniform' : uniform weights.  All points in each neighborhood
 |        are weighted equally.
 |      - 'distance' : weight points by the inverse of their distance.
 |        in this case, closer neighbors of a query point will have a
 |        greater influence than neighbors which are further away.
 |      - [ca

<p><strong>Step 3: </strong> Fit your model with data.</p>

<ul>
    <li>This is the training step</li>
    <li>Model learns the relationship between X and y</li>
</ul>

In [25]:
knn.fit(X, y)
# tree.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

<p><strong>Step 4: </strong>Predict the response for a new observation</p>

<ul>
    <li>New observations are called "out-of-sample" data</li>
    <li>Uses the information it learned during the model training process</li>
</ul>

In [24]:
knn.predict(X)
tree.predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])