Linear regression predicts the relationship between two variables by assuming a linear connection between the independent and dependent variables. It seeks the optimal line that minimizes the sum of squared differences between predicted and actual values.
It can extend to multiple linear regression involving several independent variables and logistic regression, suitable for binary classification problems.

This form of analysis estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. There are simple linear regression calculators that use a “least squares” method to discover the best-fit line for a set of paired data. You then estimate the value of X (dependent variable) from Y (independent variable).

Simple Linear Regression : 
This algorithm explains the linear relationship between the dependent(output) variable y and the independent(predictor) variable X using a straight line  Y= B0 + B1 X.

![Linear_Regression_1D_Image.jpg](attachment:cdfa326f-bcc2-446a-85b8-6bc735961fb4.jpg)

The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best fit line. The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.

COST FUNCTION for Linear Regression : 
The cost function helps to work out the optimal values for B0 and B1, which provides the best fit line for the data points.

In Linear Regression, generally Mean Squared Error (MSE) cost function is used, which is the average of squared error that occurred between the y_predicted and yi.
We calculate MSE using simple linear equation y=mx+b:

![Linear_Regression_MSE.jpg](attachment:85a271ed-9fa4-4682-aaaa-2613d653a6ec.jpg)

MULTIPLE LINEAR REGRESSION : Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

How to perform a multiple linear regression ? 
Multiple Linear Regression Formula : f(x) = β0x0 + β1x1 + β2x2 + . . . + βpxp.
With x0 defined as 1 then the above equation becomes : 

![Multiple_Linear_Regression_Formula.JPG](attachment:2929bfd9-7484-44d4-b3d5-e22ddba92c79.JPG)

Using this notion the Mean Squared-error loss function becomes : 

![MSE_multiple_variables.JPG](attachment:5a1f2a86-5a11-4e00-83b2-55e760d3ea8a.JPG)

Concepts of OVERFITTING and UNDERFITTING :: 

MODEL BASICS :: 
In order to talk about underfitting vs overfitting, we need to start with the basics: what is a model? A model is simply a system for mapping inputs to outputs. For example, if we want to predict house prices, we could make a model that takes in the square footage of a house and outputs a price. A model represents a theory about a problem: there is some connection between the square footage and the price and we make a model to learn that relationship. Models are useful because we can use them to predict the values of outputs for new data points given the inputs.

A model learns relationships between the inputs, called features, and outputs, called labels, from a training dataset. During training the model is given both the features and the labels and learns how to map the former to the latter. A trained model is evaluated on a testing set, where we only give it the features and it makes predictions. We compare the predictions with the known labels for the testing set to calculate accuracy. Models can take many shapes, from simple linear regressions to deep neural networks, but all supervised models are based on the fundamental idea of learning relationships between inputs and outputs from training data.

TRAINING AND TESTING DATA :->

To make a model, we first need data that has an underlying relationship. For this example, we will create our own simple dataset with x-values (features) and y-values (labels). 

An important part of our data generation is adding random noise to the labels. In any real-world process, whether natural or man-made, the data does not exactly fit to a trend. There is always noise or other variables in the relationship we cannot measure. In the house price example, the trend between area and price is linear, but the prices do not lie exactly on a line because of other factors influencing house prices.

Our data similarly has a trend (which we call the true function) and random noise to make it more realistic. After creating the data, we split it into random training and testing sets. The model will attempt to learn the relationship on the training data and be evaluated on the test data.

![Test_AND_train_data.webp](attachment:f72daf02-0336-40cc-b26a-87534761b179.webp)

OVERFITTING : 

Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. When data scientists use machine learning models for making predictions, they first train the model on a known data set. Then, based on this information, the model tries to predict outcomes for new data sets. An overfit model can give inaccurate predictions and cannot perform well for all types of new data.

Why does overfitting occur?
You only get accurate predictions if the machine learning model generalizes to all types of data within its domain. Overfitting occurs when the model cannot generalize and fits too closely to the training dataset instead. Overfitting happens due to several reasons, such as:
•    The training data size is too small and does not contain enough data samples to accurately represent all possible input data values.
•    The training data contains large amounts of irrelevant information, called noisy data.
•    The model trains for too long on a single sample set of data.
•    The model complexity is high, so it learns the noise within the training data.

OVERFITTING EXAMPLES : 

![Overfitting_example.webp](attachment:2fba14ea-826c-48ba-a4ed-fc71803baa70.webp)

Examlple of Dog Recognition in a given set of images : 

Consider a use case where a machine learning model has to analyze photos and identify the ones that contain dogs in them. If the machine learning model was trained on a data set that contained majority photos showing dogs outside in parks , it may may learn to use grass as a feature for classification, and may not recognize a dog inside a room.

Example of University student's academic performance and graduation outcome : 

Predicts a university student's academic performance and graduation outcome by analyzing several factors like family income, past academic performance, and academic qualifications of parents. However, the test data only includes candidates from a specific gender or ethnic group. In this case, overfitting causes the algorithm's prediction accuracy to drop for candidates with gender or ethnicity outside of the test dataset.

WAYS TO PREVENT OVERFITTING : 

    Using K-fold cross-validation
    Using Regularization techniques such as Lasso and Ridge
    Implementing ensembling techniques.
    Picking a less parameterized/complex model
    Training the model with sufficient data

UNDERFITTING : 

Underfitting is a scenario in where a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.

Underfitting occurs when a model is too simple, which can be a result of a model needing more training time, more input features, or less regularization. 
Like overfitting, when a model is underfitted, it cannot establish the dominant trend within the data, resulting in training errors and poor performance of the model. If a model cannot generalize well to new data, then it cannot be leveraged for classification or prediction tasks. Generalization of a model to new data is ultimately what allows us to use machine learning algorithms every day to make predictions and classify data.

![Underfitting.png](attachment:3c8a4875-a05f-4ac0-9025-566c5bdc05d0.png)

EXAMPLES OF UNDERFITTING : 

![Underfitting.png](attachment:89ee05cb-2dad-4211-8261-13639fec7e6e.png)

WAYS TO TACKLE UNDERFITTING : 

    Preprocessing the data to reduce noise in data
    More training to the model
    Increasing the number of features in the dataset
    Increasing the model complexity
    Reduce noise in the data

IMPACT OF OUTLIERS ON MEAN SQUARED ERROR (MSE)

What is Mean Square Error (MSE) 

In the fields of regression analysis and machine learning, the Mean Square Error (MSE) is a crucial metric for evaluating the performance of predictive models. 
It measures the average squared difference between the predicted and the actual target values within a dataset. The primary objective of the MSE is to assess the quality of a model's predictions by measuring how closely they align with the ground truth.

![OUTLIERS_MSE.png](attachment:343aed6e-c0df-42c8-9fc9-0105745e92d5.png)

What are OUTLIERS?

Outliers are data points that lie far away from the normal range of values in your dataset. They can be caused by various factors, such as measurement errors, data entry errors, sampling errors, or natural variability. Outliers can distort the overall distribution of your data and influence the mean, variance, and standard deviation. They can also make your data skewed or non-normal, which violates one of the assumptions of linear regression.

How do OUTLIERS affect linear regression?

Linear regression tries to find the best-fitting line that minimizes the MEAN SQUARED ERRORS (MSE) between the observed and predicted values. However, outliers can have a large influence on the slope and intercept of the line, as they pull the line towards themselves. This can result in a poor fit and a high MSE for the rest of the data. 

Outliers can also inflate the variance of the error term, which affects the confidence intervals and hypothesis tests for the regression coefficients. Furthermore, outliers can mask the true relationship between the variables and create spurious correlations.

Here's how outliers can affect MSE in 2D data:

    Increased Variance: Outliers introduce additional variability into the data set. When computing the MSE, the squared differences between the estimated and actual values are summed. Outliers, with their large deviations from the rest of the data, contribute disproportionately to this sum, increasing the overall variance and hence the MSE.

    Bias: Outliers can bias the estimated parameters of a model, leading to biased predictions. This bias can propagate into the MSE calculation, resulting in an overestimation or underestimation of the true error.

    Model Sensitivity: Some models are more sensitive to outliers than others. For instance, linear regression is sensitive to outliers since it aims to minimize the sum of squared errors. Therefore, outliers can pull the regression line towards them, leading to a higher MSE.

To mitigate the impact of outliers on mean squared error (MSE) in 2D data, we can employ various strategies:

DATA CLEANING : Identify and remove outliers from the dataset using statistical methods like Z-score, IQR (Interquartile Range), or visual inspection techniques like scatter plots.
Use domain knowledge to determine whether outliers are valid data points or errors that need to be addressed.


TRANSFORMATIONS : Apply data transformations such as log transformation or Box-Cox transformation to reduce the impact of outliers and stabilize the variance before computing MSE.

WEIGHTED MSE: Assign lower weights to outliers during the computation of MSE. This can be achieved by incorporating weights based on the proximity of data points to the estimated model or by using robust loss functions that downweight the influence of outliers.

RESAMPLING TECHNIQUES : Utilize resampling techniques like bootstrapping or cross-validation to assess the stability of the model's performance metrics, including MSE, in the presence of outliers.