# Linear Regression
Linear regression is a statistical method used to model the relationship between one or more independent variables (predictors) and a dependent variable (outcome). It assumes that the relationship between the independent and dependent variables is linear, meaning that changes in the independent variables are associated with constant changes in the dependent variable.

## Regression Table
The output from linear regression can be summarized in a regression table.

The content of the table includes:

Information about the model
Coefficients of the linear regression function
Regression statistics
Statistics of the coefficients from the linear regression function
Other information that we will not cover in this module

In [11]:
import pandas as pd 
import statsmodels.formula.api as smf  


df = pd.read_csv("study_performance.csv", header=0, sep=",")

# Creating OLS regression model
model = smf.ols('writing_score ~ reading_score', data=df)

# Fitting model and storing results
results = model.fit()
print(results.summary())



                            OLS Regression Results                            
Dep. Variable:          writing_score   R-squared:                       0.911
Model:                            OLS   Adj. R-squared:                  0.911
Method:                 Least Squares   F-statistic:                 1.025e+04
Date:                Sun, 21 Apr 2024   Prob (F-statistic):               0.00
Time:                        23:50:58   Log-Likelihood:                -2928.4
No. Observations:                1000   AIC:                             5861.
Df Residuals:                     998   BIC:                             5871.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -0.6676      0.694     -0.962

## The "Information Part" in Regression Table

Dep. Variable: is short for "Dependent Variable". writing score is here the dependent variable. The Dependent variable is here assumed to be explained by reading_score.
Model: OLS is short for Ordinary Least Squares. This is a type of model that uses the Least Square method.
Date: and Time: shows the date and time the output was calculated in Python.


## The "Coefficients Part" in Regression Table
Intercept (-0.6676): The base level of writing score when reading score is zero. However, this might not be practically meaningful.
Reading Score (0.9935): For every one-point increase in reading score, the writing score is estimated to increase by approximately 0.9935 points.
Remember that the intercept is used to adjust the model's precision of predicting!

The linear regression function can be rewritten mathematically as:

In [None]:
writing_score=0.9935*reading_score+(-0.6676)

### Define the Linear Regression Function in Python
Define the linear regression function in Python to perform predictions.

What is Calorie_Burnage if Average_Pulse is: 120, 130, 150, 180?

In [14]:
def Predict_writing_score(reading_score):
 return(0.9935*reading_score+(-0.6676))
 
#Try some different values:
print(Predict_writing_score(10))
print(Predict_writing_score(20))
print(Predict_writing_score(30))
print(Predict_writing_score(40))


9.2674
19.2024
29.1374
39.0724


## The "Statistics of the Coefficients Part" in Regression Table
There are four components that explains the statistics of the coefficients:

std err stands for Standard Error
t is the "t-value" of the coefficients
P>|t| is called the "P-value"
 [0.025  0.975] represents the confidence interval of the coefficients
 
## The P-value 


The P-value checks if there's a strong connection between Average_Pulse and Calorie_Burnage.
If it's small (< 0.05), there's likely a link. 
If it's large (> 0.05), probably no important connection.

We test two ideas: no connection (null hypothesis) and a connection (alternative hypothesis). 
For Average_Pulse, null is "no connection (Average_Pulse = 0)" and alternative is "connection (Average_Pulse ≠ 0)."

Small P-value means we reject no connection. Large P-value means we're not sure.

With a P-value of 0.824 for Average_Pulse, it's large.
So, we can't confidently say there's a connection between Average_Pulse and Calorie_Burnage.
We don't worry much about the intercept's P-value—it helps prediction, not relationship between variables.


## R - Squared
R-Squared and Adjusted R-Squared describes how well the linear regression model fits the data points

The value of R-Squared is always between 0 to 1 (0% to 100%).

A high R-Squared value means that many data points are close to the linear regression function line.
A low R-Squared value means that the linear regression function line does not fit the data well.

In [25]:
import pandas as pd
import statsmodels.api as sm

# Read data
df = pd.read_csv("study_performance.csv", header=0, sep=",")

# Calculate R-squared for writing_score
results_writing = sm.OLS(df['writing_score'], sm.add_constant(df['reading_score'])).fit()
r_squared_writing = results_writing.rsquared

# Calculate R-squared for reading_score
results_reading = sm.OLS(df['reading_score'], sm.add_constant(df['writing_score'])).fit()
r_squared_reading = results_reading.rsquared

print("R-squared for Writing Score:", r_squared_writing)
print("R-squared for Reading Score:", r_squared_reading)


R-squared for Writing Score: 0.9112574888913136
R-squared for Reading Score: 0.9112574888913139


In [26]:
import pandas as pd
import statsmodels.api as sm

# Read data
df = pd.read_csv("study_performance.csv", header=0, sep=",")

# Calculate p-value for writing_score
results_writing = sm.OLS(df['writing_score'], sm.add_constant(df['reading_score'])).fit()
p_value_writing = results_writing.pvalues['reading_score']

# Calculate p-value for reading_score
results_reading = sm.OLS(df['reading_score'], sm.add_constant(df['writing_score'])).fit()
p_value_reading = results_reading.pvalues['writing_score']

print("P-value for Writing Score:", p_value_writing)
print("P-value for Reading Score:", p_value_reading)


P-value for Writing Score: 0.0
P-value for Reading Score: 0.0


In [27]:
import pandas as pd
import statsmodels.api as sm

# Read data
df = pd.read_csv("study_performance.csv", header=0, sep=",")

# Calculate coefficient for writing_score
results_writing = sm.OLS(df['writing_score'], sm.add_constant(df['reading_score'])).fit()
coefficient_writing = results_writing.params['reading_score']

# Calculate coefficient for reading_score
results_reading = sm.OLS(df['reading_score'], sm.add_constant(df['writing_score'])).fit()
coefficient_reading = results_reading.params['writing_score']

print("Coefficient for Writing Score:", coefficient_writing)
print("Coefficient for Reading Score:", coefficient_reading)


Coefficient for Writing Score: 0.9935311142409599
Coefficient for Reading Score: 0.9171906906886349


In [29]:
import pandas as pd
import statsmodels.api as sm

# Read data
df = pd.read_csv("study_performance.csv", header=0, sep=",")

# Regression info for writing_score
X_writing = sm.add_constant(df['reading_score'])
model_writing = sm.OLS(df['writing_score'], X_writing)
results_writing = model_writing.fit()
print("Regression Info for Writing Score:")
print(results_writing.summary().tables[1])

# Regression info for reading_score
X_reading = sm.add_constant(df['writing_score'])
model_reading = sm.OLS(df['reading_score'], X_reading)
results_reading = model_reading.fit()
print("\nRegression Info for Reading Score:")
print(results_reading.summary().tables[1])


Regression Info for Writing Score:
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.6676      0.694     -0.962      0.336      -2.029       0.694
reading_score     0.9935      0.010    101.233      0.000       0.974       1.013

Regression Info for Reading Score:
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             6.7505      0.632     10.685      0.000       5.511       7.990
writing_score     0.9172      0.009    101.233      0.000       0.899       0.935


In [31]:
import pandas as pd
from scipy.stats import linregress

# Read data
df = pd.read_csv("study_performance.csv", header=0, sep=",")

# Perform linear regression for writing_score
slope_writing, intercept_writing, _, _, _ = linregress(df['reading_score'], df['writing_score'])
print("Writing Score Linear Regression:")
print("Slope:", slope_writing)
print("Intercept:", intercept_writing)

# Perform linear regression for reading_score
slope_reading, intercept_reading, _, _, _ = linregress(df['writing_score'], df['reading_score'])
print("\nReading Score Linear Regression:")
print("Slope:", slope_reading)
print("Intercept:", intercept_reading)


Writing Score Linear Regression:
Slope: 0.9935311142409595
Intercept: -0.6675536409329226

Reading Score Linear Regression:
Slope: 0.9171906906886339
Intercept: 6.750504735875701
