# Chapter 1

Machine Learning is about extracting knowledge from data.

It is a research field at the intersection of statics, artificial intelligence, and computer science.

## Supervised Learning

_Automate Decision Making Process by Generalizing from known examples._

The user provides the algorithm with pairs of inputs and desired outputs, and the algorithm finds a way to produce the desired output given an input. In particular, the algorithm is able to pricude the output for an input it has never seen before without any help from human.

Example: The user provides the algorithm with a large number of emails (which are inputs), together with the information about whether these emails are spam or not. Then the algorithm, given a new Email, can predict whether it is spam.

It is called supervised learning as the user just like a teacher provides supervision to the algorithm in the form of desired outputs for each exmaple they learn from.  

Examples: 
 - Identifying the ZIP code from handwritten digits on an envelope.
 - Determining whether a tumor is benign based on a medical image.
 - Detecting fradulent activity in credit card transaction.

## Unsupervised Learning

In unsupervised learning, only the input data is known, and no output data is given to the algorithm. While there are many successful applications of theme methods, they are usually harder to understand and evaluate.

Examples:
 - Identifying topics in a set of blog post.
 - Segmenting customers into groups with similar prefrences.
 - Detecting abnormal access patterns to a website.


## Questions to ask self when building a Machine Learning Solution

- What Questions am I trying to answer?
- Can the data collected answer that question?
- Have I collected enough data to represent the problem I want to solve?
- What features of the data did I extract? Will these enable the right predictions?
- How will I measure the success in my application?
- How wil the Machine Learning solution interact with other parts of my research or business product?

## First Application: Classifying Iris Species

We have to distinguish between the species of some iris flowers.  
We have some measurements associated with each iris: The length and width of petals and sepals (all in cm), and measurements of some irises that have been identified by an expert botanist as belonging to the species setosa, versicolor or virginica.

![Iris Flower Parts](Assets/Images/Chapter-1/iris_parts.png)

Since we have the measurements of correct species of data, this is a supervised learning problem.  
We want to predict one of several options. This is an example of _classification_ problem.  
The possible outputs (species) are called _classes_. Every Iris in the dataset belongs to one of three classes, so this is a three-class classification problem. The desired output for a single data point (an iris) is the species it belongs to, and is called its label.

### Meet the Data

Data used in this example is the Iris Dataset, a classic dataset in machine learning and statics.  
It is included in scikit-learn in the datasets module.

In [2]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [4]:
print(f'Keys of iris_dataset:\n{iris_dataset.keys()}')

Keys of iris_dataset:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [6]:
print(iris_dataset['DESCR']+'\n...')

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [10]:
print(f"Target Names: {iris_dataset['target_names']}")

Target Names: ['setosa' 'versicolor' 'virginica']


In [11]:
print(f"Feature names: {iris_dataset['feature_names']}")

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Data is contained in the target and data fields. They contain numeric measurements of sepal length, sepal width, petal length, petal width in a NumPy Array (numpy.ndarray)

In [14]:
print(f"Type of Data: {type(iris_dataset['data'])}")

Type of Data: <class 'numpy.ndarray'>


There are 150 rows and 4 columns containing data according to species of each flower.

In [15]:
print(f"Shape of data: {iris_dataset['data'].shape}")

Shape of data: (150, 4)


In [17]:
print(f'Data:\n{iris_dataset["data"]}')

Data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 

In [23]:
print(f'Type of Target: {type(iris_dataset["target"])}')

Type of Target: <class 'numpy.ndarray'>


In [25]:
print(f'Shape of target: {iris_dataset["target"].shape}')

Shape of target: (150,)
