**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART I**


**The Fundamentals of Machine Learning**

---





**CHAPTER 1 - The Machine Learning Landscape**

---



Machine Learning (ML) is often associated with futuristic robots, but it's already an integral part of our lives. It has existed for decades in specialized fields like Optical Character Recognition (OCR) and became widely known in the 1990s through applications like spam filters, which have improved significantly over time. Today, ML powers various features such as recommendation systems and voice search.

This chapter aims to clarify what ML is, exploring fundamental concepts such as the differences between supervised and unsupervised learning, online versus batch learning, and instance-based versus model-based learning. It also covers the typical workflow of an ML project, common challenges, and how to evaluate and optimize ML systems.

The chapter provides a high-level introduction to the essential ML concepts and terminology that data scientists should master. While the content remains simple and without much code, it's crucial to understand these basics before diving deeper into the subject.


---



**What Is Machine Learning?**

Machine Learning is the science (and art) of programming computers so they can
learn from data.

Here is a slightly more general definition:


> "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed".

Arthur Samuel, 1959

And a more engineering-oriented one:
> "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E".

Tom Mitchell, 1997

A spam filter is an example of Machine Learning, where the system learns to identify spam emails using a training set of labeled examples (spam and non-spam). The task is to flag spam, the experience is the training data, and performance is measured by accuracy, the ratio of correctly classified emails. Simply downloading a large dataset like Wikipedia does not qualify as Machine Learning because the system does not learn or improve at a specific task.




---



**Why Use Machine Learning?**

To write a spam filter using traditional programming, you would:


1.   Identify patterns in spam emails, like certain words or phrases (e.g., "free," "credit card") in the subject line, sender‚Äôs name, and body.

1.   Write algorithms to detect these patterns, flagging emails as spam when a certain number are found.

1.   Test and refine the program, repeating steps 1 and 2 until it performs well enough to launch.

This process relies on manually identifying and coding patterns, unlike Machine Learning, where the system learns patterns from data.

![Figure1-1.jpg](./01.Chapter-01/Figure1-1.jpg)





Traditional spam filters rely on complex, manually written rules, making them hard to maintain and update. If spammers change their tactics, like replacing "4U" with "For U," the filter would need constant updates to keep up.

In contrast, a Machine Learning-based spam filter automatically learns which words and phrases predict spam by detecting patterns in data. It adapts to new patterns, such as "For U," without needing manual updates, making it more efficient and accurate.

![Figure1-2.jpg](./01.Chapter-01/Figure1-2.jpg)

![Figure1-3.jpg](./01.Chapter-01/Figure1-3.jpg)

Machine Learning is ideal for problems that are too complex for traditional methods or lack known algorithms, like speech recognition. While a simple approach might work for a few words, it doesn't scale for thousands of words or varied speakers in different environments. The solution is an algorithm that learns from examples.

Additionally, Machine Learning can assist humans in learning by revealing patterns in data. For instance, a trained spam filter can show which words predict spam, uncovering hidden correlations. This process, called data mining, helps discover patterns that aren't immediately obvious.

![Figure1-4.jpg](./01.Chapter-01/Figure1-4.jpg)

To summarize, Machine Learning is great for:
*   Problems for which existing solutions require a lot of fine-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better than the traditional approach.
*   Complex problems for which using a traditional approach yields no good solution: the best Machine Learning techniques can perhaps find a solution.
*   Fluctuating environments: a Machine Learning system can adapt to new data.
*   Getting insights about complex problems and large amounts of data.



---




**Examples of Applications**

Here are some examples of Machine Learning applications and the techniques used for them:
* Image classification (e.g., classifying products on a production line): Convolutional Neural Networks (CNNs)
* Tumor detection in brain scans: Semantic segmentation using CNNs
* Classifying news articles: Natural Language Processing (NLP) with techniques like RNNs, CNNs, or Transformers
* Flagging offensive comments: Text classification using NLP tools
* Document summarization: Text summarization using NLP tools
* Chatbots or personal assistants: NLP components like natural language understanding and question-answering modules
* Revenue forecasting: Regression models, such as Linear Regression, Random Forest, or Neural Networks, often using RNNs or Transformers for past performance data
* Voice command recognition: Speech recognition using RNNs, CNNs, or Transformers
* Credit card fraud detection: Anomaly detection
* Client segmentation for marketing: Clustering
* Data visualization: Dimensionality reduction techniques
* Product recommendation: Recommender systems using neural networks trained on past purchase data
* Game bots: Reinforcement Learning (RL), like the AlphaGo program, to maximize rewards in a given environment



---



**Types of Machine Learning Systems**

Machine Learning systems can be classified into broad categories based on several criteria:

1. Supervision:

    * Supervised Learning: Trained with labeled data.

    * Unsupervised Learning: Trained without labeled data.

    * Semi-supervised Learning: A mix of both labeled and unlabeled data.

    * Reinforcement Learning: Learns through trial and error to maximize rewards.

2. Learning Type:

    * Online Learning: Learns incrementally, updating as new data arrives.

    * Batch Learning: Learns from a fixed dataset all at once.

3. Learning Approach:

    * Instance-based Learning: Compares new data points to known data points.

    * Model-based Learning: Detects patterns in the data to build a predictive model.

**Supervised Learning**

In supervised learning, the training data includes both the input data and the corresponding labels (desired outputs).

![Figure1-5.jpg](./01.Chapter-01/Figure1-5.jpg)

* Classification: Tasks like spam detection, where emails are labeled as spam or ham.

* Regression: Tasks like predicting the price of a car based on features like mileage and age.

Key supervised learning algorithms:

* k-Nearest Neighbors

* Linear Regression

* Logistic Regression

* Support Vector Machines (SVMs)

* Decision Trees and Random Forests

* Neural Networks

**Unsupervised Learning**

In unsupervised learning, the training data is unlabeled, and the system tries to learn patterns without predefined labels.

![Figure1-7.jpg](./01.Chapter-01/Figure1-7.jpg)

* Clustering: Grouping similar data, e.g., clustering blog visitors into groups based on behavior (using algorithms like k-Means, DBSCAN, HCA).

* Anomaly detection: Identifying rare events or outliers, such as fraudulent credit card transactions (using methods like One-class SVM, Isolation Forest).

* Dimensionality reduction: Simplifying the data by combining correlated features, e.g., combining car mileage and age into one feature (using PCA, t-SNE).

* Association rule learning: Discovering relationships between attributes, e.g., finding that people who buy barbecue sauce and chips often buy steak.

In clustering, for example, the algorithm groups data based on similarities without being told which group each instance belongs to, making it useful for customer segmentation or pattern discovery.

Key Techniques in Unsupervised Learning:

* Clustering: k-Means, DBSCAN

* Anomaly detection: One-class SVM, Isolation Forest

* Dimensionality reduction: PCA, t-SNE

* Association rule learning: Apriori, Eclat

Unsupervised learning is useful for exploring data and finding patterns without requiring labels, making it suitable for tasks like grouping similar items, reducing data complexity, and detecting anomalies.

**Semi-supervised learning**

Semi-supervised learning is a method used when there is a large amount of unlabeled data and only a small amount of labeled data. This approach combines both unsupervised and supervised learning techniques to make the most of the available data.

![Figure1-11.jpg](./01.Chapter-01/Figure1-11.jpg)

For example, in services like Google Photos, the system automatically clusters similar photos (unsupervised learning) to recognize faces. Once you label a few people, the system can then identify and label the same person across all photos, making it more efficient for tasks like photo searches.

Most semi-supervised learning algorithms are hybrids of unsupervised methods (like clustering) and supervised methods (like classification). For instance, deep belief networks (DBNs) use unsupervised restricted Boltzmann machines (RBMs) followed by supervised fine-tuning, enabling them to work effectively with both labeled and unlabeled data.

**Reinforcement Learning**

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent takes actions and receives rewards (or penalties) based on those actions. The goal is for the agent to learn the best strategy, called a policy, to maximize rewards over time.

![Figure1-12.jpg](./01.Chapter-01/Figure1-12.jpg)

For example, robots use RL to learn tasks like walking, and AlphaGo, developed by DeepMind, used RL to learn how to play the game of Go. AlphaGo analyzed millions of games and played against itself to refine its strategy. It applied the learned policy to defeat the world champion, with no further learning occurring during the championship games.

**Batch and Online Learning**

**Batch learning**

Batch learning is a machine learning approach where the system learns from the entire dataset at once. It is trained offline using all available data, and once trained, it is deployed into production without learning any further. If new data arises (e.g., a new type of spam), the system must be retrained from scratch with the updated dataset.

While this method works well for static datasets, it can be time-consuming and resource-intensive, requiring significant computing power. Additionally, for systems needing to adapt quickly to new data (e.g., predicting stock prices), batch learning may not be ideal. It is typically used for less dynamic situations and when data is not too large. For more adaptive systems, incremental learning is a better solution.

**Online learning**

Online learning trains a machine learning system incrementally by feeding it data instances one by one or in small batches, allowing the system to learn as new data arrives. This method is ideal for systems that need to adapt quickly to continuous data streams (e.g., stock prices) and those with limited computing resources, as it discards old data after learning from it.

Online learning is also useful for large datasets that can't fit into a machine‚Äôs memory, allowing training in smaller chunks (called out-of-core learning). One key parameter is the learning rate, which controls how quickly the system adapts to new data. A high learning rate makes the system adapt rapidly but may forget older data, while a low learning rate slows learning but reduces sensitivity to noise.

A challenge of online learning is that bad data can cause the system‚Äôs performance to decline over time. Monitoring the system and detecting abnormal data (using anomaly detection) can help mitigate this risk.

**Instance-Based Versus Model-Based Learning**

**Instance-based learning**

Instance-based learning is a simple approach where the system "learns by heart" from the examples it has been given. For instance, in a spam filter, it would flag emails that are identical to previously flagged ones. However, a more advanced version would flag emails that are similar to known spam emails, using a similarity measure (e.g., counting common words). This method generalizes new cases by comparing them to the most similar known examples. The system then classifies new instances based on these comparisons, like classifying a new instance as a "triangle" if most similar instances belong to that class.

![Figure1-15.jpg](./01.Chapter-01/Figure1-15.jpg)

**Model-based learning**

Model-based learning involves creating a model based on the training data and using it to make predictions.

![Figure1-16.jpg](./01.Chapter-01/Figure1-16.jpg)

Example: Predicting Life Satisfaction Based on GDP

Let‚Äôs say we want to predict life satisfaction based on a country's GDP per capita. We use Linear Regression to model this relationship.

Step 1: Collect and Organize Data

We gather data from sources like the OECD for life satisfaction and the IMF for GDP per capita. An example dataset might look like this:

![Table1-1.jpg](./01.Chapter-01/Table1-1.jpg)

![Figure1-17.jpg](./01.Chapter-01/Figure1-17.jpg)

Step 2: Select a Model (Linear Regression)

We decide to use a linear model to predict life satisfaction based on GDP. The formula for a linear model is:

![Eq1-1.jpg](./01.Chapter-01/Eq1-1.jpg)

Where:

* ùúÉ0 = is the intercept (constant value),

* ùúÉ1 = is the slope (coefficient for GDP per capita).

Step 3: Train the Model (Find ùúÉ0 and ùúÉ1)

Using a training algorithm (like Linear Regression), we find the optimal values of ùúÉ0 and ùúÉ1 that best fit the data. In our case, after training the model, we get:

![Eq1-2.jpg](./01.Chapter-01/Eq1-2.jpg)

This means the model can predict life satisfaction by applying these values for the intercept and slope.

Step 4: Make Predictions

Now that we have the model, we can use it to predict life satisfaction for countries with known GDP per capita.

For Cyprus, we look up the GDP per capita, which is $22,587. We then apply the model:

![Eq1-3.jpg](./01.Chapter-01/Eq1-3.jpg)

Calculating the result:

![Eq1-4.jpg](./01.Chapter-01/Eq1-4.jpg)

So, the predicted life satisfaction for Cyprus, based on its GDP per capita of $22,587, is 5.96.



Python Code Example

Here‚Äôs how you could implement this in Python using Scikit-Learn:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# Load the data
oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',
                             encoding='latin1', na_values="n/a")

# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

# Select a linear model
model = sklearn.linear_model.LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus's GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]



---



**Main Challenges of Machine Learning**

**Insufficient Quantity of Training Data**

Insufficient Quantity of Training Data is a challenge in Machine Learning. Unlike a toddler who can quickly learn to recognize an apple with a few examples, ML algorithms often require large amounts of data to learn effectively. For simple problems, thousands of examples are needed, and for complex tasks like image or speech recognition, millions of examples may be necessary. This can be mitigated if parts of an existing model are reused, but generally, more data leads to better performance.

**Nonrepresentative Training Data**

Nonrepresentative Training Data occurs when the data used to train a model does not accurately reflect the real-world scenarios the model will encounter. For example, if some countries are missing from the training data, the model's predictions can be skewed. Adding missing countries can significantly change the model, revealing that very rich countries are not necessarily happier than moderately rich ones, and some poor countries are happier than many rich countries.

![Figure1-21.jpg](./01.Chapter-01/Figure1-21.jpg)

Using nonrepresentative data leads to inaccurate predictions, especially for outliers (e.g., very poor or very rich countries). Achieving a representative training set is challenging, as even large datasets can have sampling bias if the sample method is flawed, which can lead to misleading results.

**Poor-Quality Data**

Poor-Quality Data can significantly hinder the performance of machine learning models. Errors, outliers, and noise in the data make it difficult for the system to detect meaningful patterns, leading to inaccurate predictions.

Cleaning up the data is crucial, and data scientists often spend a lot of time on this task. For example:

* Outliers: Discarding or correcting data points that are extreme or erroneous.

* Missing features: Deciding how to handle missing data (e.g., ignoring, filling with the median, or creating models with and without the missing feature).

Proper data cleaning ensures that the model learns from reliable and accurate data, improving its performance.

**Irrelevant Features**

Irrelevant Features in training data can negatively impact the performance of a machine learning model. If the data contains too many irrelevant features, the model may struggle to learn useful patterns.

Feature engineering is the process of selecting and creating the most relevant features for training:

* Feature selection: Choosing the most important features from existing data.

* Feature extraction: Combining features to create more useful ones (e.g., using dimensionality reduction).

* Creating new features: Gathering additional data to improve the model's predictive power.

Carefully selecting and engineering features is essential for building an effective model.

**Overfitting the Training Data**

Overfitting the Training Data occurs when a model performs well on the training data but fails to generalize to new, unseen data. This happens when the model becomes too complex and fits the noise or random patterns in the training data, rather than the true underlying trends.

For example, a complex model might learn irrelevant patterns, such as associating countries with names containing a "w" to higher life satisfaction, which is purely coincidental. This kind of model is overfitted and won't perform well on new data.

To prevent overfitting, regularization is used to simplify the model. It constrains the model, reducing its complexity and helping it generalize better. A regularized model might not fit the training data perfectly, but it will perform better on unseen data.

The degree of regularization is controlled by a hyperparameter, which must be set before training. A high regularization value leads to a simpler model with a lower chance of overfitting but may make it less accurate. Properly tuning hyperparameters is crucial for creating an effective machine learning model.

**Underfitting the Training Data**

Underfitting the Training Data occurs when a model is too simple to capture the underlying patterns in the data. For example, using a linear model for a complex problem like predicting life satisfaction can lead to inaccurate predictions, even on the training data.

To fix underfitting, you can:

* Select a more powerful model with more parameters.

* Improve the features (feature engineering) provided to the model.

* Reduce constraints on the model, such as decreasing the regularization hyperparameter, to allow more flexibility.

The goal is to create a model that is complex enough to learn the data's structure without overfitting.

**Stepping Back**


Stepping Back provides a recap of the key points in Machine Learning:

Machine Learning allows systems to improve at tasks by learning from data instead of relying on explicitly programmed rules.

There are various types of ML systems, such as supervised vs unsupervised, batch vs online, and instance-based vs model-based.

In an ML project, data is gathered in a training set, which is then used by a learning algorithm. A model-based algorithm adjusts parameters to fit the data, while an instance-based algorithm learns examples and generalizes based on similarity.

For a system to perform well, the training data must be sufficient, representative, and clean. The model should be complex enough to capture patterns but simple enough to avoid overfitting.

Finally, once the model is trained, it‚Äôs important to evaluate and fine-tune it to ensure it generalizes well to new, unseen data.



---



**Testing and Validating**

Testing and Validating is essential to evaluate how well a model will generalize to new data. Instead of just using the model in production, a better method is to split the data into two sets: the training set (used to train the model) and the test set (used to evaluate the model's performance on new data).

The generalization error (or out-of-sample error) is the error rate on the test set, which indicates how well the model will perform on unseen data. If the model has low training error but high generalization error, it suggests overfitting, meaning the model performs well on the training data but struggles with new data.

**Hyperparameter Tuning and Model Selection**

Hyperparameter Tuning and Model Selection involves selecting the best model and setting its hyperparameters for optimal performance.

To choose between models (e.g., linear vs. polynomial), you can train both and compare how well they generalize using a test set. For fine-tuning, you might train multiple models with different hyperparameter values and select the one with the lowest generalization error. However, adjusting hyperparameters based on test set performance can lead to overfitting on the test set, causing poor performance on new data.

To avoid this, holdout validation is used: part of the training set is held out as a validation set to evaluate different models and hyperparameters. The best model is selected based on its performance on the validation set, then retrained on the full training set, and finally tested on the test set.

Repeated cross-validation can further improve the process by using multiple small validation sets and averaging their performance, providing a more accurate estimate. However, this increases training time, as the model is evaluated multiple times.

**Data Mismatch**

Data Mismatch occurs when the training data is not representative of the data that will be used in production. For example, if you're building a mobile app to identify flower species from pictures, you might have millions of web-sourced flower images for training, but they won't reflect the real pictures taken by users on the app.

To address this, the validation and test sets must be as representative as possible of the actual data. One solution is to use a train-dev set to evaluate the model on data similar to what the app will encounter. If the model performs poorly on the validation set, it may be due to data mismatch. In this case, you can preprocess the web images to resemble those from the mobile app and retrain the model. If the model performs poorly on the train-dev set, it likely overfitted the training data, and adjustments like regularization or more diverse data may be needed.