This Jupyter notebook will introduces many fundamental concepts and terminology that every data scientist should know by heart. This will be a high-level overview to ensure you have a basic understanding and familiarity with these terms.

Starting with this class, we will begin exploring the field of Machine Learning. The content in the related Jupyter notebooks will draw extensively from online learning resources. Specifically, the notebooks will rely heavily on the two machine learning books listed in our course outline.

- [Machine Learning with PyTorch and Scikit-Learn, Packt Publishing, Latest Edition](https://learning.oreilly.com/library/view/machine-learning-with/9781801819312/) 

- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, O'Reilly Media, Inc., Latest Edition](https://learning.oreilly.com/library/view/hands-on-machinelearning/9781098125967/) 

Some other useful learning resources are also listed below for your reference:

- [Google Machine Learning Crash Course Series](https://developers.google.com/machine-learning)
- [StatQuest Video Series on Statistics and Machine Learning](https://statquest.org/video-index/)
- [Math Monk Youtube Video Series on Machine Learning](https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA)

# The Fundamentals of Machine Learning

Machine learning is used by billions of people at almost every moment nowadays. In fact, it has been around for decades in specialized applications like optical character recognition (OCR). The first mainstream ML application that significantly improved lives was the spam filter in the 1990s. While not a self-aware robot, it effectively learned to identify spam, reducing the need for manual intervention. Since then, hundreds of ML applications have emerged, quietly powering features we use regularly, such as voice prompts, automatic translation, image search, and product recommendations."

[How OCR works](https://www.youtube.com/embed/jO-1rztr4O0?si=LN7EnhYXnNX8Lvc2)


## What is Machine Learning?

A few great introductory videos to watch:

- [Machine Learning Explained in 100 Seconds](https://www.youtube.com/watch?v=PeMlggyqz0Y)
- [Machine Learning & Artificial Intelligence Crash Course](https://www.youtube.com/watch?v=z-EtmaFJieY)
- [The 7 steps of Machine Learning](https://www.youtube.com/watch?v=nKW8Ndu7Mjw)

### Using Spam Filter as an example

#### Without Machine Learning

Without Machine learning, you need to figure out the exact rules that can help you detect the spam emails. 
<img src="images/fig1.png" width="400" height="400">

#### With Machine Learning
Due to the complexity of the problem, your program may end up as a long list of intricate rules, making it difficult to maintain.

In contrast, a spam filter that uses machine learning techniques automatically identifies features that are strong indicators of spam by detecting unusually frequent patterns in spam emails compared to non-spam emails 

<img src="images/fig2.png" width="400" height="400">

New training data samples can be added through user feedback, allowing the machine learning model to learn from this data. Consequently, the model can be automatically updated without human intervention.

<img src="images/fig3.png" width="400" height="400">


Sure! Let's use the spam filter as an example to explain some key concepts of machine learning.

##### **Data Collection**
For a spam filter, we need a large dataset of emails labeled as "spam" or "not spam." This dataset is essential for training the model. This dataset is called the training set. Each training example is called a training instance (or sample).

##### **Feature Extraction**
Features are the pieces of information that the model uses to make decisions. In the case of a spam filter, features might include:
- The presence of certain keywords (e.g., "win," "free," "prize")
- The email's sender
- The presence of links
- The overall frequency of certain words

##### **Training the Model**
The part of a machine learning system that learns and makes predictions is called a model. The training process involves feeding the labeled dataset into a machine learning model. The model uses this data to learn patterns and correlations between the features and the labels (spam or not spam).

##### **Algorithm Selection**
For the machine learning model, there are various algorithms to choose from, such as decision trees, Naive Bayes, or neural networks. For simplicity, let's consider a Naive Bayes classifier, which is often used for spam filtering due to its efficiency and effectiveness.

##### **Model Training**
During training, the algorithm analyzes the features of each email and adjusts its internal parameters to improve its ability to classify emails correctly. For example, it might learn that emails containing the word "win" have a higher probability of being spam.

##### **Evaluation**
Once trained, the model is evaluated on a separate set of emails that it hasn't seen before (test set). Metrics such as accuracy, precision, and recall are used to assess how well the model performs. For a spam filter, you want high precision (few false positives) and high recall (few false negatives).

##### **Prediction**
When a new email arrives, the trained model extracts its features and predicts whether it is spam or not. If the model is confident enough, it can automatically move the email to the spam folder.

##### **Continuous Learning**
Machine learning models can improve over time. The spam filter can continuously learn from new emails, especially those that users manually mark as spam or not spam. This process, known as online learning or incremental learning, helps the model stay up-to-date with evolving spam techniques.

##### Example Flow:
1. **Data Collection**: Gather a dataset of emails labeled as "spam" and "not spam."
2. **Feature Extraction**: Identify features like keywords, sender, and links.
3. **Training the Model**: Use a Naive Bayes classifier to learn patterns from the dataset.
4. **Evaluation**: Test the model on a separate set of emails and measure its performance.
5. **Prediction**: Apply the model to new emails to classify them as spam or not spam.
6. **Continuous Learning**: Update the model with new user-labeled emails to improve its accuracy.

By understanding these steps, you can see how a spam filter uses machine learning to effectively identify and filter out unwanted emails.

## Examples of Machine Learning Applications

- `Analyzing images of products on a production line to automatically classify them.`
This is image classification, typically performed using convolutional neural networks or sometimes transformers.

- `Detecting tumors in brain scans.`
This is semantic image segmentation, where each pixel in the image is classified (as we want to determine the exact location and shape of tumors), typically using CNNs or transformers.

- `Automatically classifying news articles.`
This is natural language processing (NLP), and more specifically text classification, which can be tackled using recurrent neural networks (RNNs) and CNNs, but transformers work even better.

- `Automatically flagging offensive comments on discussion forums.`
This is also text classification, using the same NLP tools.

- `Summarizing long documents automatically.`
This is a branch of NLP called text summarization, again using the same tools.

- `Creating a chatbot or a personal assistant.`
This involves many NLP components, including natural language understanding (NLU) and question-answering modules.

- `Forecasting your company’s revenue next year, based on many performance metrics.`
This is a regression task (i.e., predicting values) that may be tackled using any regression model, such as a linear regression or polynomial regression model. If you want to take into account sequences of past performance metrics, you may want to use artificial neural networks.

- `Making your app react to voice commands.`
This is speech recognition, which requires processing audio samples: since they are long and complex sequences, they are typically processed using artificial neural networks.

- `Detecting credit card fraud`
This is anomaly detection.

- `Segmenting clients based on their purchases so that you can design a different marketing strategy for each segment`
This is clustering, which can be achieved using k-means, and more.

- `Representing a complex, high-dimensional dataset in a clear and insightful diagram.`
This is data visualization, often involving dimensionality reduction techniques.

- `Recommending a product that a client may be interested in, based on past purchases.`
This is a recommender system. One approach is to feed past purchases (and other information about the client) to an artificial neural network, and get it to output the most likely next purchase. This neural net would typically be trained on past sequences of purchases across all clients.

- `Building an intelligent bot for a game.`
This is often tackled using reinforcement learning, which is a branch of machine learning that trains agents (such as bots) to pick the actions that will maximize their rewards over time (e.g., a bot may get a reward every time the player loses some life points), within a given environment (such as the game). The famous AlphaGo program that beat the world champion at the game of Go was built using RL.

## Types of Machine Learning Systems


There are so many different types of machine learning systems that it is useful to classify them in broad categories, based on the following criteria:

- How they are supervised during training (supervised, unsupervised, semi-supervised, self-supervised, and others)

- Whether or not they can learn incrementally on the fly (online versus batch learning) (Won't cover in this course)

- Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model, much like scientists do (instance-based versus model-based learning)  (Won't cover in this course)

### Different types of Machine Learning based on the different levels of training supervision
ML systems can be classified according to the amount and type of supervision they get during training. There are many categories, but we’ll discuss the main ones: 
- supervised learning
- unsupervised learning
- self-supervised learning
- semi-supervised learning
- reinforcement learning.

#### Supervised Learning
In supervised learning, the training set you feed to the algorithm includes the desired solutions, called `labels`. Labels can be either categorical (discrete) or numeric (continuous). 

A typical supervised learning task is `classification`. The spam filter is a good example of this: it is trained with many example emails along with their class (spam or ham), and it must learn how to classify new emails.

Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.). This sort of task is called `regression`.⁠ To train the system, you need to give it many examples of cars, including both their features and their targets (i.e., their prices).

Classification:
<img src="images/fig4.png" width="400" height="400">

Regression:
<img src="images/fig5.png" width="400" height="400">

`The key difference between classification and regression problems lies in their label types`. Classification problems have discrete (categorical) labels, while regression problems have continuous (numeric) labels.

### Unsupervised learning

In unsupervised learning, the training data is unlabeled. The system tries to learn without a teacher.

For example, say you have a lot of data about your blog’s visitors. You may want to run a `clustering algorithm` to try to detect groups of similar visitors . At no point do you tell the algorithm which group a visitor belongs to: it finds those connections without your help.

Clusering:
<img src="images/fig6.png" width="400" height="400">


`Visualization algorithms` are also good examples of unsupervised learning: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can easily be plotted. These algorithms try to preserve as much structure as they can (e.g., trying to keep separate clusters in the input space from overlapping in the visualization) so that you can understand how the data is organized and perhaps identify unsuspected patterns.

<img src="images/fig7.png" width="400" height="400">

A related task is `dimensionality reduction`, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be strongly correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear.

Yet another important unsupervised task is `anomaly detection`—for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm. The system is shown mostly normal instances during training, so it learns to recognize them; then, when it sees a new instance, it can tell whether it looks like a normal one or whether it is likely an anomaly. 

<img src="images/fig8.png" width="400" height="400">


### Semi-supervised learning

Since labeling data is usually time-consuming and costly, you will often have plenty of unlabeled instances, and few labeled instances. Some algorithms can deal with data that’s partially labeled. This is called semi-supervised learning. 

<img src="images/fig9.png" width="400" height="400">


Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all the system needs is for you to tell it who these people are. Just add one label per person⁠3 and it is able to name everyone in every photo, which is useful for searching photos.

Most semi-supervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, a clustering algorithm may be used to group similar instances together, and then every unlabeled instance can be labeled with the most common label in its cluster. Once the whole dataset is labeled, it is possible to use any supervised learning algorithm.

### Self-supervised learning
Another approach to machine learning involves actually generating a fully labeled dataset from a fully unlabeled one. Again, once the whole dataset is labeled, any supervised learning algorithm can be used. This approach is called self-supervised learning.

<img src="images/fig10.png" width="400" height="400">

For example, if you have a large dataset of unlabeled images, you can randomly mask a small part of each image and then train a model to recover the original image. During training, the masked images are used as the inputs to the model, and the original images are used as the labels.

The resulting model may be quite useful in itself—for example, to repair damaged images or to erase unwanted objects from pictures. But more often than not, a model trained using self-supervised learning is not the final goal. You’ll usually want to tweak and fine-tune the model for a slightly different task—one that you actually care about.

For example, suppose that what you really want is to have a pet classification model: given a picture of any pet, it will tell you what species it belongs to. If you have a large dataset of unlabeled photos of pets, you can start by training an image-repairing model using self-supervised learning. Once it’s performing well, it should be able to distinguish different pet species: when it repairs an image of a cat whose face is masked, it must know not to add a dog’s face. Assuming your model’s architecture allows it (and most neural network architectures do), it is then possible to tweak the model so that it predicts pet species instead of repairing images. The final step consists of fine-tuning the model on a labeled dataset: the model already knows what cats, dogs, and other pet species look like, so this step is only needed so the model can learn the mapping between the species it already knows and the labels we expect from it.

### Reinforcement learning
Reinforcement learning is a very different beast. The learning system, called an `agent` in this context, can observe the environment, select and perform actions, and get `rewards` in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a `policy`, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

- [Reinforcement Learning Crash Course](https://www.youtube.com/watch?v=nIgIv4IfJ6s)
- [Reinforcement Learning Example](https://www.youtube.com/watch?v=kopoLzvh5jY)