<a href="https://colab.research.google.com/github/SolomonAyuba/machine-learning/blob/main/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Preprocessing Techniques

1. ### Rescaling Data (Normalisation)

  Rescaling is essential when features have different ranges. For instance, if one
  variable ranges from 0 to 1000, while another is between 0 and 1, models that
  depend on distance calculations, such as k-Nearest Neighbours (KNN) and Support Vector Machines (SVMs), may become biased.
  - Min-Max Scaling: Converts features to a scale between 0 and 1.
  - Formular: _X' = X - X_min ÷ X_max - X_min_

  Example using Scikit-Learn

In [2]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[200], [400], [600], [800], [1000]])
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]


2. ### Standardisation (Z-Score scaling)

  Standardisation transforms data so that it has a mean of 0 and a standard
  deviation of 1. This technique is crucial for models such as logistic regression and linear regression, which assume normally distributed data.
  - Z-Score Formula: _X' = X - μ ÷ σ_
  - where μ is the mean and σ is the standard deviation.
  
  Example using Scikit-Learn

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[50], [100], [150], [200], [250]])
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)

3. ### Binarisation
  Binarisation transforms data into binary values, often required for algorithms
  like Bernoulli Naive Bayes.

  Example using Scikit-Learn

In [None]:
from sklearn.preprocessing import Binarizer

data = np.array([[1.5], [0.3], [2.8], [0.5]])
binarizer = Binarizer(threshold=1.0)
binary_data = binarizer.fit_transform(data)
print(binary_data)

## Feature Selection in Machine Learning
Feature selection aims to identify the most relevant variables for model training, eliminating redundant or irrelevant ones.

## Methods of feature selection
**📌 Univariate selection:** Uses statistical tests to evaluate feature
importance.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklaern.datasets import load_iris

X, y = load_iris(return_X_y=True)
best_features = SelectKBest(score_func=f_classif, k=2)
X_new = best_featires.fit_transform(X, y)
print(X_new.shape)

**📌 Principal Component Analysis (PCA):** Reduces dimensionality while
preserving variance.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(X_pca.shape)

# Training and Testing in Machine Learning
The goal of a machine learning model is to learn patterns from data and generalise to new, unseen inputs. This requires a structured approach to model evaluation.

##Why do we need separate training and testing sets?
- **To avoid overfitting:** If a model is tested on the same data it was trained on, it may simply memorise patterns instead of generalising.
- **To measure real-world performance:** A model should be evaluated on unseen data to understand how it will behave in practical applications.
- **To compare models effectively:** Different models can be assessed fairly by using the same test dataset for evaluation.

## Splitting Data for Training and Testing
### Holdout method
The simplest technique is to split the dataset into two parts:
1. Training set: Used to train the model.
2. Testing set: Used to evaluate performance.

🔹 Common split ratios:
- 80% training / 20% testing
- 70% training / 30% testing
- 67% training / 33% testing (used in small datasets)

**Advantage:** Simple and fast.
**Disadvantage:** Results can vary significantly depending on the split.

Example using Scikit-learn

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linera_model import LogisticRegression
from sklearn.datasets import load_iris

# Load dataset
X, y = load_iris(return_X_y=True)

# Split Dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_triain)

# EValuate model
accuracy = model.score(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

##Splitting Data for Training and Testing
###k-Fold Cross-Validation
Instead of using a single train-test split, k-fold cross-validation divides the
dataset into k equal parts (folds). The model is trained on k-1 folds and tested
on the remaining fold. This process is repeated k times, with each fold used as
a test set once.

####Common k values:
- k = 5 (Standard for most datasets)
- k = 10 (Used when more training data is required)
- Leave-One-Out Cross-Validation (LOOCV) (Extreme case where k =
number of instances)

**Advantage:** More reliable than a single train-test split.
**Disadvantage:** It is computationally expensive.

Example:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Perform 5-fold cross-validation
scores= cross_val_score(model, X, y, cv=5)
print(f"Cross-validation Accuracy: {scores.mean():.2f}")
