# ML History
* ML has been around for quite decades in some specialized applications, such as optical character recognition (OCR)
* 1990s: Spam filter is the first ML that became mainstream -> it is not a self aware robot, but it is smart enough to differentiate between a spam and not spam email
* Hundreds of ML applications that now quietly power hundreds of products and features that we regularly encounter -> voice prompts, automatic translation, image search, product recommendations, and many more.

# What is ML?
Machine learning is the science (and art) of programming computers so they can __learn__ from data.

A more engineering-oriented one:

> A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
> 
> — Tom Mitchell, 1997

Example (Flag spam emails):
The task T is to flag spam for new emails, the experience E is the training data, and the performance measure P needs to be defined; for example, you can use the ratio of correctly classified emails. This particular performance measure is called accuracy, and it is often used in classification tasks.



### Without Machine Learning:
![Without Machine Learning](attachment:image-2.png)

### With Machine Learning:
![With Machine Learning](attachment:image.png)


How to classify Machine Learning Model?

There are so many different types of machine learning systems that it is useful to classify them in broad categories, based on the following criteria:

* How they are supervised during training (supervised, unsupervised, semi-supervised, self-supervised, and others)

* Whether or not they can learn incrementally on the fly (online versus batch learning)

* Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model, much like scientists do (instance-based versus model-based learning)

_________________________________________________________

# Training Supervision

Type of machine learning supervision when we train the data 

Types: 
1. Supervised Learning
2. Unsupervised Learning
3. Self-supervised Learning
4. Semi-supervised learning
5. Reinforcement Learning

### 1. Supervised Learning

1. __Definition__:
train a machine learning model with labeled(target) instances (data) 

2. __Typical implementation__:
* Used for classification -> true or false
* Predict a target numeric values -> e.g. regression model

> Fun facts:
> * Target and label are generally being treated as synonyms <br>
a. The term _target_ that is mostly used in __regression tasks__, where the predict value is a continuous value <br>
b. The term _label_ that is moslty used in __classification tasks__, where the predict value is categorical value<br><br>
> * Features are the input variables used to train a model. <br>
 a. Features are sometimes called predictors or attributes in the training sample


### 2. Unsupervised Learning

1. __Definiton__: train a machine learning model with unlabelled dataset. In other words, the machine is trying to learn without a teacher
 

2. __Typical Usage__:
#### 1. Clustering algorithm: 

* it detects groups with similar data. The machine will find a connection of each data without human's intervention 
* It is the main task of exploratory data mining & a common technique for statistical data analysis used in many fields including Machine Learning, Pattern Recognition, Image Analysis, Information Retrieval, Bioinformatics, Data Compression, and Computer Graphics.
* Is it often used to visualize dataset also

##### Types of clustering algorithm
* K-Means Clustering:
    * Partitions $n$ observations into $k$ clusters, where each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
    * Initially, $k$ centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster.
    * The algorithm iteratively assigns each data point to the nearest centroid and then recalculates the centroids based on the current cluster members.
    * This process repeats until the centroids no longer change significantly, indicating that the clusters are stable.
    * K-Means is efficient for large datasets but can be sensitive to the initial placement of centroids and may converge to a local minimum.

* Centroid-based clustering: 
    * the centroid(center) of the cluster is the mean of all the data points within that cluster. 
    * The centroid-based clustering organizes the data into non-hierarchical clusters.
    * It is efficient but sensitive to inital conditions or outlies 
    * E.g: K mean, K mode <br> <br>
    ![image-2.png](attachment:image-2.png)

* Density-based clustering:
    * Groups contigous area of high example density into clusters 
    * Can identify clusters of arbitrary shape, while excluding outliers as noise.
    * Is particularly suitable for datasets with varying densities and for low-dimensional data (e.g., 2D or 3D) <br><br>
 ![image.png](attachment:image.png)
  
* Distribution-based clustering:
    * Creates and groups data points based on their likely hood of belonging to the same probability distribution (Gaussian, Binomial, etc.) in the data.
    * The cluster includes data objects that have a higher probability to be in it. 
    * Each cluster has a central point, the higher the distance of the data point from the central point, the lesser will be its probability to get included in the cluster.
    * When you're not comfortable assuming a particular underlying distribution of the data, you should use a different algorithm. <br><br>
![image-3.png](attachment:image-3.png)


* Hierarchical clustering:
    * This is why hierarchical clustering results in a tree structure or a __dendrogram__. -> used to  illustrate this hierarchical clustering
    *  Hierarchical clustering technique has two approaches
        * Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points as single clusters and merging them until one cluster is left.
        * Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.
    > How does it work?
    > 1. At the bottom of the dendrogram, each leaf (or terminal node) represents a single data point.
    > 2. As you move upward, clusters that are close to each other (by a distance metric like Euclidean or Manhattan) merge to form larger clusters.
    > 3. The height at which two clusters are joined represents the distance between them.
    > 4. At the top, all data points belong to one single cluster.

![image-4.png](attachment:image-4.png)

<br> <br>

Time complexity: $On^2$. Many clustering algorithms compute the similarity between all pairs of examples, which means their runtime increases as the square of the number of examples $n$


#### 2. Anomaly

An anomaly refers to a binary output that predicts whether data is normal or abnormal. It is commonly used in both supervised and unsupervised learning to detect irregularities in the input data.

##### Types of anomaly:
* Outliers: abnormal or extreme data points that exist within the training data. These data points deviate significantly from other observations but may not necessarily indicate harmful or problematic behavior.
* Novelties: New or previously unseen instances compared to the original training data. These anomalies represent data that does not conform to the patterns learned during training, often highlighting new or unexpected behaviors. -> This requires having a very “clean” training set, devoid of any instance that you would like the algorithm to detect.

<br><br>
Sources:
* https://medium.com/machine-learning-researcher/clustering-k-mean-and-hierarchical-cluster-fa2de08b4a4b
* https://developers.google.com/machine-learning/clustering/clustering-algorithms

### 3. Semi-supervised Learning
Definition: Uses a small amount of labeled data and a large amount of unlabeled data to improve learning accuracy. This approach is useful when labeling data is time-consuming and expensive.


### 4. Self-supervised Learning

Definition:  
* It is also known as predictive or pretext learning. 
* A machine learning process where the model trains itself to learn one part of the input from another part of the input.
* In this process, the unsupervised problem is transformed into a supervised problem by auto-generating the labels. 
* To make use of the huge quantity of unlabeled data, it is crucial to set the right learning objectives to get supervision from the data itself.

How does it work:


### 5. Reinforcement Learning
Definition: 
* the learning system(agent) will observe the environment, select and perform actions, and get rewards in returns if perform correctly or penalties for incorrect prediction 
*  It must then learn by itself what is the best strategy, called a policy, to get the most reward over time.
*  A policy defines what action the agent should choose when it is in a given situation.

![image.png](attachment:image.png)

# Batch VS Online Learning

Another criterion used to classify machine learning systems is whether or not the system can learn incrementally from a stream of incoming data.

### Batch Learning
* System is incapable of learning incremently as it must be trrained using all the available data.
* The model is trained on the entire dataset at once or in large chunks (batches).
*  Learning happens __offline__, meaning the model does not update dynamically as new data arrives.
* Used in traditional supervised learning and in RL for training models with large datasets.

Often encountered issue: 
A model's performance tends to decay slowly over time because the world evolve while the model remains unchange, this issue is called _model rot_ or _data drift_

Solution:
Regularly retrain the model on up-to-date data 
>How often you need to do that depends on the use case: if the model classifies pictures of cats and dogs, its performance will decay very slowly, but if the model deals with fast-evolving systems, for example making predictions on the financial market, then it is likely to decay quite fast.

### Online Learning
* Trains incrementally with incoming data, either individually or in mini-batches.
* Fast and cheap learning steps, allowing real-time updates.
* Ideal for streaming data or systems needing continuous adaptation.
* Can handle large datasets that don't fit in memory (out-of-core learning).
* Example: Updating stock price predictions with new data.

![image.png](attachment:image.png)

* Learning rate determines adaptation speed: high rate adapts quickly but forgets old data; low rate adapts slowly but retains old data.
* Risk: Bad data can degrade performance. Monitor and switch off learning if performance drops. Use anomaly detection to handle abnormal data.

![image-2.png](attachment:image-2.png)





# Instance-based Learning

Definition: 
* type of leaning that use labelled data or examples, then generelizes to new cases by using a similarity measure to compare (similarity measures in known example and the input) them to the learned example <br><br>

Example: <br><br>
![image.png](attachment:image.png)<br><br>
In this case, the new instance would be classified in the group triange because the majority of the most similar instances belong to that class

> another way to genealize from a set of example can be done by building model of the examples and use that to make predictions. It is called the model-based learning <br><br> ![image-2.png](attachment:image-2.png)



_________________________________________________________

# Summary of how to build ML models
1. **Study the Data**: Understand the dataset, identify features, and explore data patterns and distributions.

2. **Select a Model**: Choose an appropriate machine learning model based on the problem type (e.g., regression, classification) and data characteristics.

3. **Train the Model**: Split the data into training and validation sets. Train the model on the training data, optimizing its parameters to minimize a cost function.

4. **Evaluate the Model**: Assess the model's performance on the validation set using relevant metrics (e.g., accuracy, precision, recall).

5. **Tune Hyperparameters**: Adjust the model's hyperparameters to improve performance, using techniques like grid search or random search.

6. **Test the Model**: Evaluate the final model on a separate test set to estimate its generalization performance.

7. **Make Predictions**: Apply the trained model to new, unseen data to make predictions (inference).

8. **Deploy and Monitor**: Deploy the model into production and continuously monitor its performance, retraining as necessary to maintain accuracy.