# 6.0 Supervised learning algorithms
This lesson provides an overview of several key supervised learning algorithms and their practical implementation.

**Lesson Objectives:** By the end of the lesson, students should be able to:
* Understand the basic concepts of supervised learning algorithms.
* Implement key supervised learning algorithms using real-world datasets.

**Supervised Learning** involves learning from labeled data to predict outcomes. The goal is to train a model that maps input data to a target (output) variable. Supervised learning algorithms are typically categorized based on the type of problem they aim to solve: regression or classification

**Key Types of Supervised Learning:**
* **Regression:** Predicting continuous outputs (e.g., house prices, stock prices).
* **Classification:** Predicting categorical outputs (e.g., spam detection, image classification).

# 6.1. Linear Regression

**Key Concepts:**
* **Simple Linear Regression:** Involves finding the line of best fit for a set of data points using the formula `y = mx + b` where *m* is the slope (coefficient), and *b* is the y-intercept.
* **Multiple Linear Regression:** Extends simple linear regression to multiple predictors (features). The model is given by `y = w<sub>1</sub>x<sub>1</sub> +w<sub>2</sub>x<sub>1</sub> + ... + w<sub>n</sub> +b`
* **Cost Function:** Mean Squared Error (MSE), which quantifies the difference between predicted and actual values.
* **Gradient Descent:** The optimization technique used to minimize the cost function by iteratively adjusting model parameters.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt

# Load dataset
data = load_boston()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Visualize
plt.scatter(y_test, y_pred)
plt.xlabel("True values")
plt.ylabel("Predicted values")
plt.title("Linear Regression: True vs Predicted")
plt.show()

# 6.2 Logistic Regression 
**Key Concepts:**
* **Sigmoid Function:** Converts linear outputs to a range between 0 and 1, ideal for binary classification problems.
* **Binary Classification:** Logistic regression is used to classify data into two categories (e.g., yes/no, 0/1).
* **Model Evaluation Metrics:** For classification problems, accuracy is not always the best metric. We use:
  - **Accuracy:** The proportion of correctly classified instances.
  - **Precision:** The ratio of true positives to predicted positives.
  - **Recall:** The ratio of true positives to actual positives.
  - **F1-Score:** The harmonic mean of precision and recall.

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
data = load_iris()
X = data.data
y = (data.target == 0).astype(int)  # Convert to binary (setosa or not)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
print(classification_report(y_test, y_pred))

# 6.3.  Decision Trees and Random Forests
**Key Concepts:**
* **Decision Trees:** A model that splits data into subsets based on feature values, creating a tree-like structure. Gini Index and Entropy are criteria used to choose the best splits.
* **Random Forests:** An ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting. Bagging is a method of training multiple models (trees) on different random subsets of the data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load dataset
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target

# Train Decision Tree
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X, y)

# Train Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X, y)

# Make predictions
y_pred_tree = tree_model.predict(X)
y_pred_rf = rf_model.predict(X)

# Evaluate model
print("Decision Tree Classification Report")
print(classification_report(y, y_pred_tree))

print("Random Forest Classification Report")
print(classification_report(y, y_pred_rf))

# 6.4. Support Vector Machines (SVM)
**Key Concepts:**
* Hyperplanes and Margins: SVM tries to find the hyperplane that best separates classes with the maximum margin.
* Kernel Trick: Used to map input data into a higher-dimensional space to handle non-linear separations.

In [None]:
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVM model with linear kernel
svm_model = SVC(kernel='linear') # different kernels are available (linear, polynomial, radial)
svm_model.fit(X_train, y_train)

# Evaluate model
y_pred = svm_model.predict(X_test)
print(classification_report(y_test, y_pred))

# 6.5. K-Nearest Neighbors (KNN)
**Key Concepts:**
* **KNN:** A simple, instance-based learning algorithm that classifies a data point based on the majority class of its k-nearest neighbors.
* **Distance Metrics:** The most common are Euclidean distance and Manhattan distance.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN model
# Tune the value of the number of neighbors and observe its effect on performance
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

# Evaluate model
y_pred = knn_model.predict(X_test)
print(classification_report(y_test

**Homework:**
* Implement a machine learning pipeline that includes preprocessing, model training, and evaluation on a new dataset (e.g., a Kaggle dataset).
* Experiment with hyperparameter tuning and model selection techniques.