## MACHINE LEARNING

- Machine Learning is a method that enables systems to learn patterns from data in order to make statements or predictions about future outcomes.

### 1. Categories of Machine Learning
**A. Supervised Learning**

Supervised learning works with specific outputs, meaning that the target variable y is known.

Two major tasks:

**1. Regression**

Used when the target variable y is continuous.

**Examples:**

- Predicting house price

- Predicting temperature

- Model Used: Linear Regression

- Simple Linear Regression (one independent variable)

- Multiple Linear Regression (multiple independent variables)


**2. Classification**

Used when the target variable y is discrete.

**Examples:**

- Spam or not spam

- Disease present or not

- Model Used: Logistic Regression

- Simple Logistic Regression

- Multiple Logistic Regression

## B. Unsupervised Learning

Unsupervised learning has no specific output variable. The system identifies hidden patterns within the data.

Common task:

**1. Clustering**

Groups similar data points together based on patterns.

**Examples:**

Customer segmentation

Grouping similar images

**2. Example Dataset Structure**

A general dataset contains multiple feature columns:

| X1 | X2 | X3 | X4 | X5 |
| -- | -- | -- | -- | -- |
|    |    |    |    |    |
|    |    |    |    |    |

Here,

- X1, X2, X3... are input features

- Output depends on the type of learning

- Regression: continuous value

- Classification: category label

- Clustering: no output column

## Summary Diagram 
**Machine Learning**

**Supervised Learning**

- Regression

- y is continuous

- Linear Regression

- Simple

- Multiple

- Classification

- y is discrete

- Logistic Regression

- Simple

- Multiple

**Unsupervised Learning**

- No specific output

- Clustering

## Linear Regression

- Linear regression is a supervised learning technique used to model the relationship between a dependent variable and one or more independent variables. It helps predict a numeric, continuous output.

## 1. Core Idea

Linear regression finds the best fitting straight line that represents the relationship between:

- X (independent variable, example experience)

- Y (dependent variable, example salary)

The goal is to estimate the line:

**y = m x + c**

Where:

- m is the slope (rate of change of y with respect to x)

- c is the intercept (value of y when x is zero)

## 2. Example Dataset

| Experience (X) | Salary (Y) |
| -------------- | ---------- |
| 1              | 10         |
| 2              | 20         |
| 3              | 30         |
| 4              | 40         |

This dataset shows a perfect linear pattern. As experience increases, salary increases proportionally.

## 3. Slope and Intercept
**Slope (m)**

Measures how much Y changes when X increases by 1 unit.

Mathematically:

m = change in Y divided by change in X

**Example from the dataset:**

Between (1,10) and (2,20):

m = (20 - 10) divided by (2 - 1) = 10

This means salary increases by 10 for every 1 year of experience.

Intercept (c)

c is the predicted salary when experience is zero.

From the line y = mx + c:

c = y minus m x

Using point (1,10):

c = 10 minus (10 multiplied by 1)
c = 0

So the regression equation becomes:

**y = 10 x**

## 4. Prediction Example

**To predict salary for someone with 7 years of experience:**

y = 10 x

y = 10 multiplied by 7
y = 70

So the predicted salary is 70.

## Assumptions of Linear Regression

- Linear Regression is a parametric model, which means it comes with a set of assumptions. If these assumptions are not met, the model will not produce reliable predictions in real-world deployment.

## **1. Linearity of Data**

**The relationship between the independent variables (X) and the dependent variable (Y) must be linear in nature.**

How to check linearity:

- Scatter Plot
Used to visually check whether X and Y show a linear trend, either positive or negative.

- Correlation Matrix
Helps measure the strength and direction of the relationship between features.

A high positive or negative value indicates a strong linear relationship.

## **2. Homoscedasticity**

**The variance of residuals must remain constant across all levels of the independent variable.**

If variance is similar at every point along the fitted line, the data satisfies homoscedasticity.

Violations (heteroscedasticity) lead to:

- Unequal spread of errors

- Poor prediction stability

- Unreliable confidence intervals

## **3. No Multicollinearity**

Input features should not be highly dependent on each other.

High multicollinearity leads to:

- Unstable coefficient estimates

- Overfitting

- Difficulty interpreting feature importance

How to detect multicollinearity:

- Correlation matrix

**Variance Inflation Factor (VIF)**

VIF values above 5 or 10 indicate a problem.3. No Multicollinearity

Input features should not be highly dependent on each other.

High multicollinearity leads to:

- Unstable coefficient estimates

- Overfitting

- Difficulty interpreting feature importance

How to detect multicollinearity:

- Correlation matrix

**Variance Inflation Factor (VIF)**

VIF values above 5 or 10 indicate a problem.

## **4. No Autoregression**

**The output variable Y should not depend on time unless the model is specifically meant for time series.**

Linear Regression assumes that:

- Residuals are not time-dependent

- Past values of Y do not influence current values of Y

- Time series data requires special models like ARIMA or SARIMA.

## **5. Zero Mean of Residuals**

**After fitting the regression line, the residuals (errors) must have a mean close to zero.**

Interpretation:

The fitted line should be the best unbiased representation of the data.
If the mean of residuals is not zero, the model is biased.

## **Summary of Assumptions**

- Linearity

- Homoscedasticity

- No multicollinearity

- No autoregression

- Zero mean of residuals

Meeting these assumptions ensures valid predictions and stable model performance.