---
title: Machine Learning Algorithms - Home assignment
subtitle: Machine Learning class of Master 2 Data Science at IDMC school, FRANCE
authors:
  - name: Marianne CLAUSEL
    email: marianne.clausel@univ-lorraine.fr
    affiliations:
      - name: Université de Lorraine, FRANCE
        ror: 04vfs2w97
date: 2025-04-04

% concept of machine learning, graphic of deep learning, diagram of AI presented in blueprint Pro Vector
:::{figure} https://static.vecteezy.com/system/resources/previews/045/781/906/non_2x/concept-of-machine-learning-graphic-of-deep-learning-diagram-of-ai-presented-in-blueprint-vector.jpg
:label: fig:machine_learning_illustration

Illustration created by the Thai artist [Jakarin Niamklang](https://www.vecteezy.com/members/jackie-niam).
:::

# **What are Machine Learning Algorithms?**

"[Machine learning](#fig:machine_learning) algorithms are **mathematical models trained on data**. They use statistical and predictive analytics techniques to learn patterns and relationships within the data. Then, they use this knowledge to make predictions or take action on new, untested data.

The main advantage of these algorithms is their ability to process training data to new, previously unknown forms, allowing them to make accurate predictions in real-world scenarios."<br>

<u>Source:</u> website of the Polish software engineering and advisory company [VM PL Software house](https://vmsoftwarehouse.com/8-machine-learning-algorithms-for-predictions)

% A figure of a diagram on Machine Learning, followed by a caption
:::{figure} https://vmsoftwarehouse.com/wp-content/uploads/2024/05/ML2_en-.png
:label: fig:machine_learning

Diagram taken from the website of the Polish software engineering and advisory company [VM PL Software house](https://vmsoftwarehouse.com/8-machine-learning-algorithms-for-predictions).
:::

# **Algorithm Selection Criteria**

Choosing the right algorithm depends on many variables. Even the most experienced data scientists cannot pinpoint the best algorithm without first testing it on a specific dataset. Therefore, the choice is largely speculative without initial testing of several algorithms on the given data. 

However, there is a set of rules that, based on several variables, help you narrow down your search to 2-3 algorithms that best fit your particular case. You can then test these selected algorithms on a real dataset to ensure that making the right decision becomes a formality.

## **1. Type of Task**

**We usually start with the simplest methods to determine if it is necessary to proceed with more profound and complex algorithms.** First, we analyze the type of task at hand. Is it a classification task where we want to predict specific categories? Or is it a regression task where we aim to predict continuous values? The better we understand the nature of the task, the more accurately we can choose the appropriate algorithm.

## **2. Size and Type of Data**

Understanding the data is key to success. That’s why we always analyze the specific data we are dealing with; the right data provides the necessary information. Exploratory data analysis is the first step performed during a project.  

**Understanding the data is also helpful at intermediate stages:** 

– Before moving on to data cleaning, we collect information on missing values.  

– Before transforming the data, we need to know what types of variables are in the set.  

– Before starting the modeling process, we check for outlier observations and variables with unusual distributions.

% A figure of a diagram on Data modeling rules, followed by a caption
:::{figure} https://vmsoftwarehouse.com/wp-content/uploads/2024/05/reguly_EN.png
:label: fig:data_modeling_rules

Diagram taken from the website of the Polish software engineering and advisory company [VM PL Software house](https://vmsoftwarehouse.com/8-machine-learning-algorithms-for-predictions).
:::

**Some algorithms are better suited for small data sets, while others can effectively handle large data sets** and complex relationships between variables. 

If you have a small data set with simple relationships between variables, algorithms such as linear regression or logistic regression may be sufficient. If you have a large data set with complex relationships, algorithms such as random forests or support vector machines may be more appropriate.    

## **3. Interpretation vs. Performance**

Another factor to consider is the **trade-off between interpretability and efficiency.** Some algorithms, such as decision trees, allow for interpretation, providing clear explanations for their predictions. Other algorithms, such as neural networks, may perform better but lack interpretability.   

If interpretability is essential to your project, algorithms such as decision trees or logistic regression are good choices. If performance is the main goal and interpretability is not a priority, neural networks or deep learning models may be more appropriate.

## **4. The Complexity of the Algorithm**

The complexity of the algorithm is also an essential factor. **Some algorithms are relatively simple and easy to implement, while others are more complex and require advanced programming skills or computing resources.**

If you have limited programming skills, algorithms such as linear regression or decision trees are a good starting point. If you have more advanced programming skills and computing resources, you can explore more complex algorithms, such as neural networks or deep learning models. 

Given these factors, you can **narrow your options and choose the suitable machine learning algorithm for your project.** Experimenting with different algorithms and evaluating their performance for your specific task and data is important.

# **Division of Algorithms in Machine Learning**

The most general way to divide algorithms is based on the type of learning: supervised and unsupervised.

## **Supervised Learning**

Supervised learning algorithms are **trained on labeled data, where the input data is associated with the correct output or target variable.** The algorithm learns to assign input data to the proper output data by finding patterns and relationships in the data. This type of algorithm is commonly used in tasks such as classification and regression.  

We use algorithms, such as regression, to predict a numerical value based on input characteristics. This value could be, for example, the estimated creditworthiness, the fraud risk for a selected transaction, or a binary value indicating whether a given bank customer will be a good or bad borrower. In summary, in this case, **we know exactly what we are looking for and what we will base our decisions on.**   

An example would be a dataset of bank customers, described by variables such as date of birth, ID number, account balance, home address, credit history data, transaction history, etc.

% A figure of a diagram on the Division of Algorithms in Machine Learning, followed by a caption
:::{figure} https://vmsoftwarehouse.com/wp-content/uploads/2024/05/ML_EN.png
:label: fig:division_algorithms_machine_learning

Diagram taken from the website of the Polish software engineering and advisory company [VM PL Software house](https://vmsoftwarehouse.fr/8-algorithmes-dapprentissage-automatique).
:::

## **Unsupervised Learning**

Unsupervised learning algorithms are trained on **unlabeled data, where only input data is available without a corresponding output or target label.** The goal of unsupervised learning is to discover hidden patterns or structures in the data. These algorithms are beneficial when the data’s underlying structure is unknown. 

We often use unsupervised learning algorithms for tasks such as clustering and dimensionality reduction. For example, in clustering tasks, the algorithm groups similar data points based on their internal similarities. This is useful for tasks such as customer segmentation, where the algorithm can identify groups of customers with similar preferences or behaviors.

# **Popular Machine Learning Algorithms**

Machine learning algorithms come in many forms and formats, each with unique characteristics. In this section, we will discuss some popular algorithms and their applications in various industries.

## **1. Binary Classification**

**In classification tasks, the algorithm learns to classify input data into two predefined categories or classes.** 

Classification is used in situations such as object detection, various kinds of automation, and counting objects. It is also used in medical fields, such as detecting changes in medical imaging to distinguish between a sick person and a healthy person. 

**Binary classification** involves training an algorithm to assign input data to two predetermined categories or classes. For example, a supervised learning algorithm can be trained to determine whether an email is spam or not by analyzing a labeled email dataset. Binary classification is commonly used because it **allows us to sift through a given dataset and separate it into two groups.**

What questions do classification algorithms answer? For example:

- Will the customer be a good borrower? (Will they repay the loan in full without significant delays?) | 0/1 (yes or no).  
- Will a given customer want to cancel our services? | 0/1 (yes or no).
- Is the transaction fraudulent? | 0/1 (yes or no).

## **2. Multi-class Classification**

**Multi-class classification is similar to binary classification but involves predicting a single outcome from more than two classes.** Sometimes, we want to differentiate between more complex categories. For example, in distinguishing diseases, we might want to know which cancer grade it is, what stage it is in, or determine a specific type of cancer from multiple types.

% A figure showing the different methods of multi-class supervision, followed by a caption
:::{figure} https://vmsoftwarehouse.com/wp-content/uploads/2024/05/EN-6-1400x1160.png
:label: fig:mulit-class_classification

Diagram taken from the website of the Polish software engineering and advisory company [VM PL Software house](https://vmsoftwarehouse.com/8-machine-learning-algorithms-for-predictions).
:::

In the image above, we see the application of algorithms under supervision. The methods used are:

- **CLASSIFICATION** – Using classification, we can identify that in the picture, there is a dog, there are plush toys, and there is a cup. 
- **OBJECT DISCOVERY** – We want to find a dog or a specific mug. This method allows us to determine the object’s boundaries (rectangle) and the probability that this specific object is in the frame.
- **SEGMENTATION** – This method attempts to find and then mark individual objects as precisely as possible, separating them from each other.
- **SEMANTIC SEGMENTATION** – This method marks objects of the same type as one object.

## **3. Linear regression**

Linear regression is a **linear equation that determines the relationship between different dimensions.** 

**The algorithm learns to find the best-fitting line that minimizes the sum of squared errors between predicted and actual values.** It is often used in numerical prediction. For example, an algorithm can predict the value of a house based on characteristics such as its location, number of bedrooms, and area. 

Linear regression is widely used in **finance, economics, and social sciences** to analyze relationships between variables and make predictions. For instance, it can be used to predict stock prices based on historical data.

## **4. Logistic Regression**

Logistic regression is a popular algorithm for predicting a **binary outcome, such as “yes” or “no,” based on previous data set observations.**

It determines the relationship between a binary dependent variable and one or more independent variables by fitting a logistic function to the data. The algorithm learns to find the best-fitting curve that separates two classes. 

Logistic regression is widely used in marketing, healthcare, and social sciences to predict churn, detect fraud, and diagnose diseases. For example, it can be used to predict whether a customer is likely to abandon a purchase based on past behavior or to diagnose whether a patient has a particular disease based on their symptoms and medical history.

# **Step 1 - Select a dataset**

To select from [UCI ML](https://archive.ics.uci.edu/) repository.

# **Step 2 - Present the dataset**

To present the dataset and explain the task selected on this dataset:<br>
exploratory analysis, clustering, classification, regression, etc.

# **Step 3 - Methodology**

To present the selected methodology.

# **Step 4 - Results**

To present the results, clearly indicating the metrics.