#### Regression
##### Regression is a fundamental supervised learning technique in ML used for predicting a continuous target variable based on one or more input features.
##### Unlike classification, which predicts categorical classes, regression predicts continuous values like price, temperature, or age.
##### It finds a function that best describes the relationship between the independent variables (features) and the dependent variable(target).
##### The goal of regression is finding a "best-fit" line or curve that can be used to make predictions on new, unseen data

#### Supervised Machine Learning
##### Uses labeled data (input features + corresponding outputs).
##### Predicts outcomes or classifies data based on known labels.
##### Less complex, as the model learns from labeled data with clear guidance.
##### Model can be tested and evaluated using labeled test data.

##### Example.
##### Machine is given a dataset with input features (like age, salary, or temperature) and corresponding labels (like “yes/no,” “high/low,” or “rainy/sunny”).
##### Then machine learns dataset by finding patterns in the data. For example, it might learn that if the temperature is high, it’s likely to be sunny.

#### Un-Supervised Machine Learning
##### Uses unlabeled data (only input features, no outputs).
##### Discovers hidden patterns, structures, or groupings in data.
##### More complex, as the model must find patterns without any guidance.
##### Cannot be tested in the traditional sense, as there are no labels

##### Example
##### Imagine visiting a new city without a map or guide.
##### Buildings with tall spires might be grouped as churches.
##### Open spaces with greenery might be categorized as parks.
##### Streets with lots of shops could be grouped as markets.


#### Underfitting:
##### It occurs when a model is too simple to capture the underlying patterns in the data. It hasn't learned enough from the training data.
#### Causes: Insufficient training data or Poor feature selection.
##### Analogy: Like a student who didn't study enough for the exam. They only have a very basic understanding of the subject.


#### Overfitting:
##### It occurs when a model learns the training data too well, including the noise and random fluctuations, rather than the true patterns.
#### Causes: Noisy training data or model is too complex.
##### Analogy: Like a student who memorized every single practice question and answer but didn't truly understand the concepts. It will perfectly work on training data but on unseen data it don't.


#### Common Regression Algorithms
##### Linear Regression
##### Polynomial Regression
##### Logistic Regression
##### Support Vector Regression
##### Decision Tree Regression


#### What Problems Regression is it Used For?
##### Forecasting: Predicting future sales, stock prices, or demand for a product.
##### Estimation: Estimating the price of a house.
##### Healthcare: Predicting patient recovery time, disease progression, or the cost of treatment.


#### Basic Concepts to Know?
##### y = mx + b
##### we can also say, m is coefficient and b is intercept.
##### Y is target value, X is features independent values, m is the slope how steep the line is and b the position of slope line like where it is falling on x-axis, y-axis or in the middle.


#### Loss Funtions
##### Cost / Loss : Loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.
##### Cost Function (Loss Function): It tell us difference between the predicted values and the actual values of the target variable.
##### Mean Absolute Error (MAE): Calc the average of the absolute differences between predicted and actual values.
##### --> Advantage: Robust for outlayers.
##### --> Dis-Advantage: Graph cann't be differentiable due to straight "V" shape.
##### Mean Squared Error (MSE): Calculates the average of the squared differences between predicted and actual values.
##### --> Advantage: Can use as loss function because graph can be differentiable.
##### --> Dis-Advantage: The square for outliers is too big.
##### Root Mean Squared Error (RMSE): The square root of the MSE, providing an error metric in the same units as the target variable.
##### --> Advantage: It returns same unit after applying (LPA == LPA for Y^).
##### --> Dis-Advantage: It is sensitive to oulayers, means the RMSE for outlayer is too high.
##### Error(loss) = Submission( Y-actual(Yi) - Y-predicted(y^) ) == (Yi - MXi - b)



#### Predictors
##### R2 Score: It tells you how well your model explains the variation in the data.
##### Bad --> (0 <-- R2 --> 1) <-- Good
##### Misleading Impression: A model with many irrelevant features can appear to have a better fit (higher R2 ) than a simpler model with only relevant features.
##### Adjusted R2 Score: This modifies the drawback of R2  by taking into account the number of independent variables in the model.


#### Optimizers
##### It is used for finding the best-fit line is minimizing the sum of squared errors.
##### To find y = mx + b, we have two solutions.

##### 1: Ordinary Least Squares (OLS): When data is small or in 1D.
##### 2: Gradient Descent: For large data or more for than 1D data.

##### Gradient descent: An optimization algorithm used in ML to find the best parameters for a model by iteratively adjusting them to minimize a cost function. 
##### when we got error we apply Gradient Descent to find minimum loss / local minima point which can suite for prediction.

#### Steps:
##### Before applying Gradient Descent the line we get will not the best-fit line. Due to random selection of m and b.
##### Compute Gradients (slopes of loss w.r.t m and b)
##### --> dL/dm = -(2/n) ∑ x_i(y_i - (mx_i + b))
##### --> dL/db = -(2/n) ∑ (y_i - (mx_i + b))
##### If slope is positive decrement in b value. If Slope is negative increment in b value.
##### Update m and b: New m = Old m − learning_rate × dL/dm and New b = Old b − learning_rate × dL/db

#### When to Stop:
##### 1- Different between b_old and b_new is greater or equal(>=) to 0.001 
##### 2- Iterative methods. 100 to 1000 epochs


#### Types of Gradient Descent

##### 1- Batch GD.
#####  The loss function is calculated using the entire training dataset for each parameter update. This means that the model parameters are updated only once per epoch.
##### Advantage:
##### Using the entire dataset provides a true gradient direction, leading to precise parameter updates.
##### For convex loss functions, BGD is guaranteed to converge to the global minimum with a suitable learning rate.
##### Convex Loss Function: Which has a single, global minimum, making it easier for GD to converge to the optimal solution,
##### Dis-Advantage:
##### Bad for big data, because memory can't load whole data at a time.
##### Processing the entire dataset for each update can be very slow and computationally intensive.

##### 2- Stochastic GD.
#####  It update the model parameters by row by row.
##### Advantage:
##### It is faster and it select random rows not sequentially
##### Due to process one row at a time, making it memory-efficient and suitable for large datasets.
##### The noisy updates due to processing one row at a time can help SGD jump out of shallow local minima and potentially find a better minimum.
##### Dis-Advantage:
##### It doesn't give steady solution but give solution near to global minima.
##### Even reaching near global minima it shows impartiallity. To overcome this we have " Learning Sechedule" technique in which we change learning rate with each epochs change.

##### 3- Mini Batch GD.
#####  It divides the training dataset into small data.
##### Advantage:
##### It introduces some randomness that can help in escaping local minima.
##### We set batch size and it compute that data then update weights.
##### Itroduces the noise compared to SGD, leading to a smoother convergence path.
##### Dis-Advantage:
##### The size of the mini-batch is an additional hyperparameter that needs to be tuned.
##### While less noisy than SGD, the convergence path is not as smooth as BGD.

##### 4- Adam GD.
#####  It is the combination idea of Momentum and RMSprop.
##### It remembers the general downhill direction. If slope continues in one direction it increase learning rate through which we can take big steps. If slope change its direction it decrease its learning rate.
##### Advantage:
##### Adjusts the learning rate for each parameter individually, which can lead to faster convergence.
##### Requires storing only two moving averages per parameter, making it memory-efficient.
##### Each parameter has its own adaptive learning rate based on past gradients and their magnitudes.
##### Dis-Advantage:
##### While generally robust, performance can still be influenced by the initial learning rate and the β values.
 

#### 2- Logistic Regression.
##### It is primarily used for binary classification problems, predicting to a particular class ( 0 or 1).
##### It is used when data is linear.
##### The sigmoid function is used to map the output of a linear combination of inputs to a probability between 0 and 1. σ(z) = 1 / 1 + e^-z
##### Less sensitive to outliers compared to linear regression due to the sigmoid function compressing the output, but extreme outliers can still have an influence.
##### When it comes to multiclass we import Logistic regression and create instance for LogisticRegression. We need to specify like its multiclass problem in param LogisticRegression(multi_class='multinomial'). For more than 2 class we used "softmax" function.
##### Softmax as activation function which then classify the output in which class it should go.
##### We also have models like Support Vector Machines (SVMs), Random Forests, Decision Trees to solve this problem.


#### Loss Functions:
##### 1- Step Function: A function where the output value stays the same across an interval, and then "jumps" to a different constant value at the end of that interval.
##### 2- Maximum Likelihood Estimation: It is about finding the model maximum value.
##### 3- Cross Entropy: It is to find minimum value. Because we have to take log.
##### The submission of negative log of maximum likelihood is cross entropy.


#### Evaluation Metrics:
##### 1- Confusion Matrix: It counts the correct and incorrect predictions.
#####                   Predicted Positive	Predicted Negative
##### Actual Positive	True Positive (TP)	False Negative (FN)
##### Actual Negative	False Positive (FP)	True Negative (TN)

##### 2- Precision: Out of all the fruits your model said were red apples, how many were actually red apples?  P = TP / TP + FP.
##### Focus: Minimizing False Positives (incorrectly labeling other fruits as red apples).

##### 3- Recall: Out of all the actual red apples in the basket, how many did your model successfully find?    P = TP / TP + FN.
##### Focus: Minimizing False Negatives (missing actual red apples).   

##### 4- F1-Score: Provides a balance between Precision and Recall. 2 (PR/P+R)
##### Concerned with: Achieving a good performance on both Precision and Recall simultaneously.


#### 3- Polynomial Regression:
##### It allows you to fit a curved line to your data. Use this when data is non-linear.
##### The higher the "degree" of the polynomial (the highest power of x you use), the more flexible the curve can be.

#### Regularization
##### Try to fit the data well, but also try to keep your curve smooth by not letting your coefficients get too large.
##### --> λ (lambda) is the regularization parameter.
##### --> If λ=0, there is no penalty, and it behave like linear regression.
##### --> If λ is large, the penalty is strong, forcing the coefficients to be very small.

#### 1- Ridge Regularization(L2):
##### This adds a penalty proportional to the square of the magnitude of the coefficients to the loss function.
##### Penalty of three features looks like λ(x1^^2 + x2^^2 + x3^^2)
##### Use when all features are important.
##### If lambda value increase it nearly goes to zero but not zero.
##### If coefficient has big value than the higher value decrease quickly.
##### Lambda decrease --> bias decrease --> overfit --> variance increase.
##### Lambda increase --> bias increase --> underfit --> variance decrese.
##### alpha increse --> loss function decrease.

#### 2- Lasso Regularization(L1):  λ×(∣β1∣ + ∣β2∣ + ∣β3∣)
##### This adds a penalty proportional to the absolute value of the magnitude of the coefficients to the loss function.
##### Use when all features are not important.
##### If lambda value increase it goes to zero.
##### It tell us which features are important by making un-important features to zero.
##### It create sparsity, means that by increasing alpha some coefficient becomes zero.



#### Drawbacks of Regression
##### In regression models, outliers can mislead the model, making it learn an inaccurate relationship between variables. This can result in predictions that are way off.
##### When independent variables in a regression model are strongly related to each other, it becomes hard to determine the true effect of each variable on the outcome.
##### Linear regression is used for continuous target variables and is not suitable for classification problems (predicting categories). While logistic regression exists, it's a classification algorithm despite the name.



