<a href="https://colab.research.google.com/github/Abbujaa/datascience/blob/main/Linear_Regression_Practical_Stuff.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Hypothesis Function

In Linear Regression, the hypothesis function (also known as the model or prediction function) is a linear equation that represents the relationship between the input variables (often represented by "x") and the output variable (often represented by "y"). The equation is typically represented as:

y = mx + b

Where "m" represents the slope of the line, "x" represents the input variable(s), "b" represents the y-intercept, and "y" represents the output variable. The goal of linear regression is to find the best values for "m" and "b" that minimize the error between the predicted values and the actual values of the output variable.

However, we can have other notational changes, so intead of b we can write b0 and instead of m we can write b1.

In multiple linear regression, the hypothesis function will have multiple input variable, the equation will be represented as:

y = b0 + b1*x1 + b2*x2 + … + bn*xn

Where "b0" represents the y-intercept, "b1", "b2", ..., "bn" represents the coefficient of each input variable x1, x2, ... xn respectively.

In [None]:
import numpy as np

def predict(parameters, x):
    x = np.insert(x, 0, 1, axis=1)
    y_hat = np.dot(x, parameters)
    return y_hat

# Example input data
x = np.array([[1, 2], [3, 4], [5, 6]])

# Example parameters
parameters = np.array([1, 2, 3])

# Get predictions
predictions = predict(parameters, x)

print(predictions)

[ 9 19 29]


**Explanation of above output**:-

Here the input data 'x' is a 2-dimensional numpy array containing 3 samples, where each sample has 2 features.
The parameters is a 1-dimensional numpy array containing 3 values, these values are coefficients of the hypothesis function.

The output will be a 1-dimensional numpy array with 3 predictions, one for each sample in the input.

You can also validate the output by calculating the dot product of input and parameters on your own.

Note that in this example, the input data does not include the column of 1s for the y-intercept term, so the predict function will insert that column.

### Case-Study

The problem is to predict the sales of a product based on the amount spent on TV, radio, and newspaper advertising. We have historical data that includes the advertising budget for each medium (TV, radio, and newspaper) and the corresponding sales figures. We will use Linear regression to find the relationship between the advertising budget and the corresponding sales figures. The goal is to find the best set of parameters of the linear model that can be used to predict sales for new advertising budgets. The input for the model will be the budget for the advertising in the three mediums, and the output will be the corresponding sales. By using the Linear Regression model, we will be able to predict the sales for any given advertising budget in future.

In [None]:
import pandas as pd

advertising_df = pd.read_csv("/content/drive/MyDrive/CS001-B03 Notebooks/data/Advertising.csv")
print(advertising_df.head())
print(advertising_df.shape)

      TV  Radio  Newspaper  Sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
2   17.2   45.9       69.3   12.0
3  151.5   41.3       58.5   16.5
4  180.8   10.8       58.4   17.9
(200, 4)


In [None]:
x = advertising_df[['TV', 'Radio', 'Newspaper']]
y = advertising_df['Sales']

parameters = np.random.rand(x.shape[1]+1)
predictions = predict(parameters, x.values)
print(parameters)
print(predictions[0:5])

[0.74064913 0.53778626 0.77732263 0.69871949]
[202.21945234  86.73316616  94.09094215 155.19378273 147.17270804]


Here, I'm using the pandas library to load the advertising data from a CSV file  

Then I'm selecting the input variables 'TV', 'Radio', 'Newspaper' and the output variable 'Sales' from the loaded dataframe. Then, I'm converting the input variables to numpy array by calling the values method. Then rest of the code is same as the previous example, generating random parameters and passing it to the predict function.


Please note that this is just an example and the predictions made by the model with random parameters will not be accurate because the model has not been trained with any kind of optimization techniques. To get accurate predictions, you need to train the model with appropriate optimization techniques like gradient descent.

### Cost Function

In linear regression, the goal is to find the best set of parameters (coefficients) of the linear model that can be used to predict the output variable based on the input variables. The cost function is a measure of how well the model is able to make predictions for the given data.

The most commonly used cost function in linear regression is the mean squared error (MSE) cost function. The MSE cost function measures the average squared difference between the predicted output and the actual output for all the samples in the data.

The MSE cost function is defined as:

1/2m * ∑(y^ - y)^2

Where "m" is the number of samples, "y^" is the predicted output, "y" is the actual output and the summation is done over all the samples.

The factor 1/2m is included in the cost function to make the derivative of the cost function simpler.

The goal of linear regression is to find the set of parameters (coefficients) that minimize the MSE cost function. The parameters that minimize the cost function are considered to be the best fit for the data and can be used to make predictions on new data.

The cost function value can be interpreted as the average of the squared difference between the predicted and actual values, therefore a lower cost function value means the model is making more accurate predictions on the data it was trained on, which is what we want to achieve.

In [None]:
def cost_function(parameters, x, y):
    y_hat = np.dot(x, parameters)
    cost = np.mean((y - y_hat) ** 2)
    cost = cost/(2*len(y))
    return cost

# Get cost/error
cost = cost_function(parameters, x.values, y.values)

print(cost)

52.7588402734372


In [None]:
x['ones'] = np.ones(len(y))


### Gradient Descent

Gradient descent is an optimization algorithm that is commonly used to minimize a cost function in machine learning. It is used to find the set of parameters (such as weights and biases) that minimize the cost function for a given dataset.

The basic idea behind gradient descent is to start with an initial set of parameters, and then iteratively update the parameters in the direction of the negative gradient of the cost function. The negative gradient points in the direction of the steepest decrease in the cost function, and so the parameters are moved in that direction.

In simple terms, gradient descent works by iteratively moving in the opposite direction of the slope of the cost function. The algorithm starts at the top of the cost function, and at each step, it moves a little bit in the opposite direction of the slope, until it reaches the bottom of the cost function. The bottom is the place where the cost function is minimum and the model's prediction are best.

The process of updating the parameters is repeated until the cost function converges to a minimum value or reaches a certain number of iterations. Once the best set of parameters are found, the model can be used to make predictions on new data by passing it through the hypothesis function with the found parameters.

In [None]:
def gradient_descent(x, y, alpha, num_iters):
    # initialize parameters
    parameters = np.random.rand(x.shape[1])
    # run gradient descent
    for i in range(num_iters):
        y_hat = np.dot(x, parameters)
        error = y_hat - y
        gradient = np.dot(x.T, error) / len(y)
        parameters = parameters - alpha * gradient
    return parameters

parameters =  gradient_descent(x,y,0.0001, 100)

In the gradient descent function that I provided, the line gradient = np.dot(x.T, error) / m calculates the gradient of the cost function with respect to the parameters.

Here's how the line works:

`x.T`: Takes the transpose of the input data, this is done to make the shapes of x and error compatible for taking dot product
`error` : is a 1-dimensional numpy array which contains the difference between the predicted output and the actual output for all the samples
`np.dot(x.T, error)` : takes the dot product of the transpose of the input variables and the error, which gives the summation of the product of error and input variables for each sample.
`/m` : Dividing the summation by the number of samples (m) to get the average gradient.

The gradient is a n+1 dimensional vector which contains the partial derivative of the cost function with respect to each parameter. The gradient points in the direction of the steepest increase in the cost function, and so the parameters are moved in the opposite direction (i.e., towards the direction of negative gradient) to minimize the cost function.

In [None]:
y_hat = np.dot(x, parameters)

In [None]:
cost_advertising = cost_function(parameters, x.values, y.values)

In [None]:
cost_advertising

4.750290364655798e+62

### Closed Form Solution

The closed-form solution is an analytical method to find the best set of parameters for the linear regression model. It is also called the normal equation method. It involves finding the inverse of the matrix of input variables and taking the dot product of that matrix with the output variable.




In [None]:
def closed_form_solution(x, y):
    # add a column of ones to the input variables (for the y-intercept term)
    # calculate the inverse of the dot product of x transpose and x
    x_inv = np.linalg.inv(np.dot(x.T, x))
    # calculate the dot product of x transpose and y
    x_ty = np.dot(x.T, y)
    # calculate the optimal parameters
    parameters = np.dot(x_inv, x_ty)
    return parameters
parameters = closed_form_solution(x, y)


In [None]:
parameters

array([5.44457803e-02, 1.07001228e-01, 3.35657922e-04, 4.62512408e+00])

### T-Test for testing significance of individual regression coef

This example uses the scipy library to perform the t-test. The standard error of the coefficient is calculated first by finding the square root of the diagonal of the inverse of the dot product of the transpose of the input variables and the input variables. The t-value is then calculated by dividing the coefficient by the standard error. Finally, the p-value is calculated using the survival function (sf) of the t-distribution, with the absolute value of the t-value and the degrees of freedom (len(y)-2).

In [None]:
import numpy as np
from scipy import stats
# calculate the predicted values
y_hat = np.dot(x, parameters)

# calculate the residuals
residuals = y - y_hat

# calculate the residual sum of squares
rss = np.sum(residuals ** 2)

# calculate the degrees of freedom
df = len(y) - len(parameters)

# calculate the mean squared error
mse = rss / df

# calculate the standard error of the coefficients
se = np.sqrt(np.diag(mse * np.linalg.inv(np.dot(x.T, x))))

# calculate the t-values
t_values = parameters / se

# calculate the p-values
p_values = (1 - stats.t.cdf(np.abs(t_values), df)) * 2


In [None]:
p_values

array([0.       , 0.       , 0.9538145, 0.       ])

In [None]:
def identify_significant_coefficients(p_values, significance_level=0.05):
    significant_coefficients = []
    for i, p_value in enumerate(p_values):
        if p_value < significance_level:
            significant_coefficients.append(i)
    return significant_coefficients

# Example usage:
significant_coefficients = identify_significant_coefficients(p_values)
print(significant_coefficients)
# Output: [0, 1]


[0, 1, 3]
