In [None]:
#https://www.geeksforgeeks.org/ml-linear-regression/

# Linear regression 
is a simple and widely used statistical technique for modeling the relationship between a dependent variable (target) and one or more independent variables (features). The goal of linear regression is to find the best-fitting linear relationship that minimizes the sum of squared differences between the predicted and actual values of the dependent variable. The equation of a linear regression model is typically represented as:

y=b 0​ +b1​⋅x 1​
 +b 
2
​
 ⋅x 
2
​
 +…+b 
n
​
 ⋅x 
n
​

Here:
- \( y \) is the dependent variable (target).
- \( b_0 \) is the y-intercept (constant term).
- \( b_1, b_2, \ldots, b_n \) are the coefficients of the independent variables.
- \( x_1, x_2, \ldots, x_n \) are the independent variables.

The goal during training is to find the values of \( b_0, b_1, b_2, \ldots, b_n \) that minimize the residual sum of squares (RSS), which is the sum of the squared differences between the predicted and actual values.

### Steps in Linear Regression:

1. **Data Preparation:**
   - Organize the dataset into feature variables (\( x \)) and the target variable (\( y \)).
   - Split the dataset into training and testing sets.

2. **Model Definition:**
   - Define the linear regression model with the appropriate number of features.

3. **Training:**
   - Use the training data to find the optimal values for the coefficients (\( b_0, b_1, b_2, \ldots, b_n \)) that minimize the RSS.

4. **Prediction:**
   - Use the trained model to make predictions on new or unseen data.

### Example Using Python and Scikit-Learn:

Here's a simple example of linear regression using the `LinearRegression` class from the scikit-learn library:

```python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Plot the data and the linear regression line
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
```

In this example, synthetic data is generated with a linear relationship. The linear regression model is trained on the training set, and predictions are made on the test set. The performance of the model is evaluated using mean squared error, and the results are visualized using a scatter plot and the linear regression line.

Simple linear Regression: y = b0 + b1x
Multiple                : y = b0 + b1x1 + b2x2 + ... + bnxn 

### Assumption
- linearity
- independence
- normality
- homo


## Assumptions of Simple Linear Regression
Linear regression is a powerful tool for understanding and predicting the behavior of a variable, however, it needs to meet a few conditions in order to be accurate and dependable solutions. 

**Linearity**: The independent and dependent variables have a linear relationship with one another. This implies that changes in the dependent variable follow those in the independent variable(s) in a linear fashion. This means that there should be a straight line that can be drawn through the data points. If the relationship is not linear, then linear regression will not be an accurate model.

**Independence**: The observations in the dataset are independent of each other. This means that the value of the dependent variable for one observation does not depend on the value of the dependent variable for another observation. If the observations are not independent, then linear regression will not be an accurate model.
**Homoscedasticity**: Across all levels of the independent variable(s), the variance of the errors is constant. This indicates that the amount of the independent variable(s) has no impact on the variance of the errors. If the variance of the residuals is not constant, then linear regression will not be an accurate model.

Homoscedasticity in Linear Regression

Normality: The residuals should be normally distributed. This means that the residuals should follow a bell-shaped curve. If the residuals are not normally distributed, then linear regression will not be an accurate model.

In [None]:
# https://www.geeksforgeeks.org/how-to-calculate-mean-absolute-error-in-python/

# Evaluating Lin Reg Model
1. **Mean Absolute Error** -- sumision(actual - pred) / n
2. **MEAN SQUARED ERROR**  -- sumision(actual - pred)**2 / n
3. **Root Mean Squared Erro** -- SQRT[sumision(actual - pred)**2 / n]
4. **Coefficient of Determination (R-squared)** - R**2 = 1 - [Residual sum of Squares (RSS)/Total Sum of Squares (TSS)]
5. 
Residual sum of Squares (RSS) - _sumision(actual - pred)**2 / n_
Total Sum of Squares (TSS) - _sumision(actual - MEAN)**2 / n_