# Lecture 5: Machine Learning Basics

## Agenda: 
1. What is machine learning?
2. How does machine learning work?
3. Scikit-learn
4. Numpy

##  1. What is Machine Learning ?
Machine learning learns models from a set of **n observations (also known as samples, examples, instances, records)** of data and then tries to predict **properties** of new data. 

                                  
|![Figure 1: Machine Learning](ML_training.png)|
|-----------------------------|
|Figure 1. Machine Learning|

### Statistics and ML

1. Statistics is a subfield of mathematics while ML is a subfield of computer science and grew out of AI to focus on learning from data

2. ML started to flourish as a separate field in the 1990s and changed the focus to methods borrowed from statistics and probability theory.

2. Statistics and ML are closely related in terms of methodological principles but are different in their primary goals
    - ML concentrates on prediction to identify the best course of actions with no or limited understanding of the underlying mechanism. Used for more complex relationship and large data sets.
    - Statistics have a focus on inference by modeling the data generation process to formalize understanding (although statistics can perform predictions as well). Traditionally used for small data sets.


### Two main categories of ML
1. Supervised learning, in which the data comes with additional ***labels/attributes that we want to predict***. This problem can be either: 
    1. Classification: the desired output consists of a finite number of **discrete categories** 
        1. Examples: handwritten digit recognition, Iris classification and spam or ham email classification
    2. Regression: the desired output consists of one or more **continuous variables**
        1. Predict the final score (0-100) of students using their grades of homework
![Figure 3: Machine Learning](handwritten.png)
2. Unsupervised learning, in which the training data consists of a set of input vectors x **without any corresponding target labels**. 
    1. Clustering: discover groups of similar examples within the data, e.g., group shoppers with similar behavior
![Figure 3: Machine Learning](clusters.png)
    2. Density estimation: determine the distribution of data 
    3. Dimensionality Reduction: project the data from a high-dimensional space down to low dimensions

## 2. How does machine learning work?
Take supervised learning for example:
1. First Training a machine learning using labeled data
    1. labeled data with labels (output)
    2. machine learning models learns the relationship of the input data and output(labels)
2. Make prediction in new data that was not used in training the model
    1. The primary goal of machine learning is to build model that generalizes to new data
    
![Figure 2: Machine Learning](ML_tt.png)

## 3. Scikit-learn

1. Learn machine learning basics "An introduction to machine learning with scikit-learn" from the tutorials at https://scikit-learn.org/stable/tutorial/index.html
2. Use datasets in Scikit-learn

In [1]:
#import scikit-learn package
import sklearn as sk # run __init__ first
print('sklearn version:', sk.__version__)

# check scikit-learn folder: C:\Program Files\Anaconda3\Lib\site-packages\sklearn 

sklearn version: 0.24.1


In [2]:
# explore iris dataset
import sklearn.datasets as ds
iris = ds.load_iris()
dir(iris) # show all keys

['DESCR',
 'data',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [3]:
# get access to data in iris using keys
type(iris)
#: iris['data']: input data, # of samples * # of features
#iris['target']: labels

sklearn.utils.Bunch

In [4]:
# get access to to data using attributes
ftrs = iris.data # they are all numpy arraies
labels = iris.target

In [5]:
ftrs.shape # tells the dimensionality of the array

(150, 4)

In [6]:
# get access to the last feature vector/sample
ftrs[-1]

array([5.9, 3. , 5.1, 1.8])

### 4. Practice NumPy array

1. Learn the numpy array.https://numpy.org/doc/stable/user/quickstart.html
  - NumPy is the fundamental package for scientific computing.
  - Numpy arrays are more efficient than Python’s built-in data strctures, e.g., Lists.
  - Create numpy array using Numpy.array()
  - Data inside an array must be of the same data type
  - Can perform element-wise operation (not possible in Lists)
  
  
2. Functions and Methods: concatenate, diagonal, dsplit, dstack, hsplit, hstack, newaxis, ravel, repeat, reshape, resize, squeeze, swapaxes, take, transpose, vsplit, vstack

3. Ordering: argmax, argmin, argsort, max, min, searchsorted, sort

4. math and statistics: cov, mean, std, var,all, any, inner, invert, max, maximum, mean, median, min, minimum, nonzero, outer, prod, re, round, sort, std, sum, trace, transpose



Time comparison between Numpy arrayes and Python lists 

In [29]:
# time taken for element-wise multiplication
import numpy
import time
  
sz = 1000000

# two lists
l1 = range(sz)
l2 = range(sz)

# two numpy arrays
a1 = numpy.arange(sz)
a2 = numpy.arange(sz)
  
# initial time stamp for list computation
initialTime = time.time()
# perform element-wise multipication
results = [(a * b) for a, b in zip(l1, l2)]  
#execution time
print("Time taken by two lists:", 
      (time.time() - initialTime),"seconds")
  
# initial time stamp for numpy computation
initialTime = time.time() 
results = a1 * a2
#execution time
print("Time taken by NumPy arrays:", 
      (time.time() - initialTime), "seconds")

Time taken by two lists: 0.12686777114868164 seconds
Time taken by NumPy arrays: 0.012990713119506836 seconds


In [None]:
# more practice on Numpy





