# Machine Learning Models

## Machine Learning Models


A machine learning (ML) model is a mathematical representation or computational function that learns patterns from data to make predictions or decisions **without being explicitly programmed for specific tasks.**

### Taxonomes of Machine Learning Models

**Taxonomies of machine learning models** provide a structured way to categorize algorithms based on how they learn from data and make predictions. These classifications help us distinguish between different types of models.

1. Supervised vs Unsupervised Models
2. Model-based vs. Model-free
3. Probabilistic vs. Non-probabilistic Models

### 1. Supervised vs Unsupervised Models

**Supervised Models:** Supervised models learn from labeled data by mapping inputs to known outputs to make accurate predictions.
Example:- Linear Regression, Decision Trees, Support Vector Machines (SVM), Random Forest, Neural Networks

**Unsupervised Models:** Unsupervised models learn from unlabeled data by identifying patterns, structures, or groupings without predefined outputs.
Example:- K-Means Clustering, Hierarchical Clustering, DBSCAN, Principal Component Analysis (PCA), Autoencoders

**The key difference is whether the model is learning with answers or figuring things out on its own.**


#### 🧠 Build the Intuition – Supervised vs. Unsupervised Learning

| **Question**                                                                 | **Type of Learning**       |
|------------------------------------------------------------------------------|-----------------------------|
| We predict the chance of a patient having a disease based on medical history. | ✅ Supervised Learning       |
| We group movies based on viewer preferences without knowing genres.          | 🔍 Unsupervised Learning     |
| We detect unusual credit card transactions without labeled fraud data.       | 🔍 Unsupervised Learning     |
| We teach a model to recognize handwritten digits using labeled images.       | ✅ Supervised Learning       |


### 2. Model-based vs Model-free

**Model-Based Models:**  Models that build an internal representation of the environment to plan and predict outcomes. They build an internal map or model of the environment or system, and then use it to plan or predict outcomes.
Example:- Dyna-Q, Kalman Filter, Hidden Markov Model (HMM), Model Predictive Control (MPC)

**Model-Free Models:** Model-free models don’t try to understand how things work. Instead, they directly learn what actions or predictions lead to good results, based only on experience.
Example:- Q-Learning, Deep Q-Network (DQN), SARSA, Policy Gradient Methods

**The key difference is whether the model builds a mental map of how things work, or just learns what actions lead to success.**

#### 🧠 Build the Intuition – Model-Based vs. Model-Free Learning

| **Question**                                                                 | **Type of Learning**    |
|------------------------------------------------------------------------------|--------------------------|
| A self-driving car predicts traffic behavior using a learned map of road rules. | 🚗 Model-Based Learning  |
| A robot learns to walk by trying different movements and adjusting from results. | 🤖 Model-Free Learning   |
| A game-playing agent studies how actions affect the game world to plan moves.   | 🎮 Model-Based Learning  |
| A recommendation system tries options and learns what users click most.        | 📦 Model-Free Learning   |


### 3. Probabilistic vs. Non-probabilistic Models

**Probabilistic Models:** Probabilistic models don’t just make predictions—they also tell you how confident they are. These models use probability distributions to express uncertainty.
Example:- Naive Bayes, Hidden Markov Model (HMM), Bayesian Networks, Gaussian Mixture Models (GMM)

**Non-Probabilistic Models:** Non-probabilistic models give a single, fixed output without expressing any uncertainty. They focus on finding a clear boundary or rule to separate classes or make predictions.
Example:- Support Vector Machines (SVM), Decision Trees, K-Nearest Neighbors (KNN), Linear Regression (standard form)

**The key difference is that probabilistic models express uncertainty, while non-probabilistic models make fixed decisions.**

#### 🧠 Build the Intuition – Probabilistic vs. Non-Probabilistic Models

| **Question**                                                                 | **Type of Model**              |
|------------------------------------------------------------------------------|--------------------------------|
| A spam filter gives a 90% chance that an email is spam.                     | 🎯 Probabilistic Model         |
| A classifier labels emails as spam or not without showing confidence.       | ⚙️ Non-Probabilistic Model     |
| A medical system estimates the probability of disease based on symptoms.    | 🎯 Probabilistic Model         |
| A decision tree makes a yes/no loan approval based only on thresholds.      | ⚙️ Non-Probabilistic Model     |


## Building a Machine Learning Models

Building a machine learning model from scratch can be challenging—it requires deep knowledge of algorithms, data structures, math, and efficient programming. Doing this every time for every problem would be slow and impractical.

That’s why we use **Scikit-learn**, a powerful and easy-to-use Python library that provides ready-to-use tools for building, training, and evaluating ML models.

Features of scikit-learn:



1.   🔧 Easy-to-use API
2.   🧹 Built-in data preprocessing
3. 🔁 Model selection and tuning
4. 📏 Evaluation metrics
5. 🧩 Pipeline support
6.  🧪 Well-documented & actively maintained




#### Introduction to Building ML Models Using Scikit-learn

Scikit-learn comes with several built-in datasets that we can use for practice and demonstration. Let's start by importing the necessary libraries and loading one of these datasets.

In [None]:
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

The Breast Cancer dataset is loaded, converted into a DataFrame with feature names as columns, and the target labels are added as a new column.

In [None]:
# Load the dataset
cancer = load_breast_cancer()

df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names) # Adding data in datafame
df['target'] = cancer.target # Adding target into dataframe


df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Here, we separate our data into two parts:

* X contains the features — these are the input variables or measurable properties (like mean radius, mean texture, mean area, etc.) that describe each data point.

* y contains the target — this is the output label or value that the model is trying to predict (in this case, whether a tumor is malignant or benign).

In simple terms, features are the inputs, and the target is the answer the model tries to learn..

In [None]:
X = df[cancer.feature_names]
y = df['target']

In machine learning, we use the train_test_split method to divide our dataset into two parts:

* Training set: Used to train the model (learn patterns).

* Testing set: Used to evaluate how well the model performs on unseen data.

This helps us measure the model’s generalization ability, which is the key goal of any ML model — to perform well not just on training data but also on new, unseen data.

(**Generalization** in a machine learning model refers to its ability to perform well on unseen data, not just the data it was trained on.)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# X: feature variables
# y: target labels
# test_size=0.2: reserves 20% of the data for testing and uses 80% for training
# random_state=42: ensures that the split is reproducible (same result each time you run the code)

# The output gives us four sets: X_train, X_test, y_train, and y_test used for training and testing the model respectively.

We train a machine learning model using RandomForestClassifier, an ensemble of decision trees. Setting random_state ensures reproducibility. Using .fit(X_train, y_train), the model learns patterns from the training data to make accurate predictions on unseen data.

In [None]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

**First step:** We use the trained model to predict on test data with model.predict(X_test), storing results in y_pred. This inference step applies learned patterns to unseen data.

**Second step:** We evaluate the model using classification_report(y_test, y_pred), which quickly summarizes precision, recall, F1-score, and accuracy to assess performance.

In [None]:
y_pred = model.predict(X_test)

print("📊 Classification Report:\n", classification_report(y_test, y_pred))

📊 Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



The classification report shows high accuracy (93–98%), which may seem great but could indicate overfitting—where the model learns training data too well and struggles with real-world data.

**Generalization is key:** a good model performs well not just on training data but also on unseen data, balancing underfitting and overfitting.