### `SIMPLE LINEAR REGRESSION INTUITION`
- Simple linear regression is a statistical method used to model the relationship between two variables:
- 1. **Independent Variable (X)**: The variable you use to make predictions (input).
- 2. **Dependent Variable (Y)**: The variable you are trying to predict (output).
 
- **Simple linear regression** : uses mathematical techniques to find the best estimated parameter values (intercept and slope) to solve:

$$
\text{intercept} + \text{slope} \times \text{independent variable value} = \text{estimated dependent variable value}
$$

- Model is a straight line (linear equation in one variable) with algebraic equation \( c + mx = y \), where:
  - \( c \) is the intercept
  - \( m \) is the slope
  - \( x \) is the independent variable
  - \( y \) is the dependent variable

In [12]:
## importing necessary libraries  
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt  
import seaborn as sns  
import plotly.express as px 
import plotly.graph_objects as go
import plotly.io as io  
io.templates.default='plotly_dark'
from scipy import stats  
from statsmodels.formula.api import ols 
from statsmodels.stats.anova import anova_lm
import patsy
plt.rcParams['figure.figsize']=(12,6)
import warnings
warnings.filterwarnings("ignore")
from rich.console import Console
console = Console()

- `Visualize a linear equation in one variable ( y = -3 + 2x)`:

In [26]:
import numpy as np
import plotly.graph_objects as go

# Creating x values
xvals = np.arange(-1, 5, 0.01)

# Creating a figure
fig = go.Figure()

# Adding a scatter trace to the figure
fig.add_trace(
    go.Scatter(
        x=xvals,
        y=-3 + 2*xvals,
        mode='lines',
        name='y = -3 + 2x'
    )
)

fig.update_layout(
    title='EQUATION OF A  LINEAR LINE',
    showlegend=True, 
    legend=dict(
        x=0.8, 
        y=0.9
    ),
    title_font=dict(size=20),  # You can also adjust font size if needed
    title_x=0.5,  # Centers the title
    title_xanchor='center'  # Ensures the title is anchored at the center
)

# Display the plot
fig.show()



Using statistical notation , where:

- $c = \hat{\beta}_0$
- $m = \hat{\beta}_1$
- $\mathbf{\hat{y}}$ is a column vector of estimated dependent variable values (whereas $\mathbf{y}$ is the column vector of actual values)
- $\mathbf{x}$ is a column vector of independent variable values

$$
\hat{\beta}_0 + \hat{\beta}_1 x = \hat{y}
$$

A model finds the best estimates for the true population parameters shown in (4), where:

- $\epsilon$ is the vector of residuals.

$$
\beta_0 + \beta_1 x + \epsilon = y
$$

$$
\beta_0 + \beta_1 x_i + \epsilon_i = y_i
$$

#### `Equation Explanation:`

- $\beta_0$: Population intercept
- $\beta_1 x_i$: Population slope multiplied by a specific individual's independent variable value
- $\epsilon_i$: Specific individual's error
- $y_i$: Specific individual's dependent variable value

#### `The Residual or Error:`

The residual or error is the difference between the actual and estimated dependent variable value:

$$
e = y - \hat{y}
$$

In (6), we see a vector representation of simple linear regression (reflecting (3) with values for $x$ and $\hat{y}$):

$$
\begin{bmatrix}
\hat{\beta}_0 \\
\hat{\beta}_1
\end{bmatrix}
\begin{bmatrix}
1 & 1 \\
1 & 1
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix} =
\begin{bmatrix}
\hat{y}_1 \\
\hat{y}_2
\end{bmatrix}
$$


Visualize a non linear equation in one variable:   $y = x ^ 2$:

In [25]:
import numpy as np
import plotly.graph_objects as go

# Creating x values
xvals = np.arange(-5, 5, 0.01)

# Creating a figure
fig = go.Figure()

# Adding a scatter trace to the figure
fig.add_trace(
    go.Scatter(
        x=xvals,
        y=xvals**2,
        mode='lines',
        name='y = x^2'
    )
)

fig.update_layout(
    title='NON LINEAR LINE',
    showlegend=True, 
    legend=dict(
        x=0.8, 
        y=0.9
    ),
    title_font=dict(size=20),  # You can also adjust font size if needed
    title_x=0.5,  # Centers the title
    title_xanchor='center'  # Ensures the title is anchored at the center
)

# Display the plot
fig.show()


` GENERATING THE RESEARCH  QUESTION`

- Does the independent variable have a significant effect on the dependent variable?

#### Hypothesis:

- **Null Hypothesis ($H_0$):** $\beta_1 = 0$ (The independent variable has no effect on the dependent variable.)
- **Alternative Hypothesis ($H_a$):** $\beta_1 \neq 0$ (The independent variable has a significant effect on the dependent variable.)


` GENERATING SYNTHETIC DATA`

In [11]:
## seeding for reproducibility
np.random.seed(42)
## values from a normal distribution with a mean of 100  and a standard deviation of 10 and 20 in number
Independent = np.round(stats.norm.rvs(loc=100,scale=10,size=20),1)
## add random noise to each value
Dependent = Independent + np.round(stats.norm.rvs(loc=0,scale=10,size=20),1)

In [14]:
## Generating a dataframe from the random value vectors  
df = pd.DataFrame( 
    { "Independent":Independent, 
     "Dependent":Dependent

    }
)
## show the first five rows  
df.head()

Unnamed: 0,Independent,Dependent
0,105.0,119.7
1,98.6,96.3
2,106.5,107.2
3,115.2,101.0
4,97.7,92.3


In [16]:
### checking the shape of the data  
df.shape

(20, 2)

`visualize the data using a scatter plot`

In [27]:

fig = px.scatter( 
    df, 
    x='Independent', 
    y='Dependent', 
    title='RELATIONSHIP OF HOW THE DEPENDENT VARAIBLE RELATES WITH THE INDEPENDENT VARIABLE', 
    trendline='ols', 
    hover_data='Independent'
)

fig.update_layout(
    title={'x': 0.5, 'xanchor': 'center'}
)

fig.show()


`SIMPLE LINEAR REGRESSION MODEL`
* The `ols` function takes a formula as Input 
* The `fit` method fits the data to the model
* The `summary` method returns all the solutions and analysis of a regression model

In [28]:
## creating the model  
linear_Model = ols("Dependent~Independent",data=df).fit()

In [30]:
## use the summary method  
console.print("Linear Regression Model Output Summary:",style='bold underline')
linear_Model.summary()

0,1,2,3
Dep. Variable:,Dependent,R-squared:,0.417
Model:,OLS,Adj. R-squared:,0.384
Method:,Least Squares,F-statistic:,12.87
Date:,"Tue, 21 Jan 2025",Prob (F-statistic):,0.00211
Time:,18:49:22,Log-Likelihood:,-73.025
No. Observations:,20,AIC:,150.0
Df Residuals:,18,BIC:,152.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,12.7618,23.210,0.550,0.589,-36.001,61.525
Independent,0.8432,0.235,3.587,0.002,0.349,1.337

0,1,2,3
Omnibus:,1.113,Durbin-Watson:,2.121
Prob(Omnibus):,0.573,Jarque-Bera (JB):,0.679
Skew:,0.444,Prob(JB):,0.712
Kurtosis:,2.832,Cond. No.,1040.0


 `CALCULATION OF THE COEFFICEINTS`

### With linear algebra, we find an orthogonal projection onto a subspace.  
we see the linear algebra statement of the problem:

$$
\small X \hat{\beta} = \hat{y}
$$

---

#### Left-multiply by the transpose of the design matrix on both sides, shown in **(10)**:

$$
\small X^T X \hat{\beta} = X^T \hat{y}
$$

---

#### Left-multiply by the inverse of the resultant square matrix $X^T X$ on both sides, shown in **(11)**:

$$
\small (X^T X)^{-1} (X^T X) \hat{\beta} = (X^T X)^{-1} X^T \hat{y}
$$

---

#### Final derivation solving for the vector of best-fit parameter estimates is shown in **(12)**:

$$
\small \hat{\beta} = (X^T X)^{-1} X^T \hat{y}
$$

---

#### `We follow the calculations below:`


In [31]:
## we first create a dependent vactor values and a design matrix of Intercept (1s) and Independent variable  
y , X = patsy.dmatrices('Dependent~Independent',df) 

In [32]:
### so we convert the y and X to arrays since they are dmatrice objects  
y = np.array(y)
X = np.array(X)

In [33]:
## showing thee first five rows  
y[:5]

array([[119.7],
       [ 96.3],
       [107.2],
       [101. ],
       [ 92.3]])

In [34]:
X[:5]

array([[  1. , 105. ],
       [  1. ,  98.6],
       [  1. , 106.5],
       [  1. , 115.2],
       [  1. ,  97.7]])

`Coefficeint Calculation Based On the Formula`:

 $$
\small \hat{\beta} = (X^T X)^{-1} X^T \hat{y}
$$

In [35]:
## Coefficient Calculation Bases on the formula 
XT = X.transpose()
XTX = np.matmul(XT , X)
XTXI = np.linalg.inv(XTX)
XTXIXT = np.matmul(XTXI,XT)
beta =  np.matmul(XTXIXT , y)

* Below see the values of the Coefficeints

In [38]:
## extracting the intercept and the coefficeint  
beta0 = beta[0]
beta1 = beta[1]
beta0 , beta1

(array([12.7618174]), array([0.84320853]))

In [39]:
console.print("Calculation of the fitted values",style="bold underline")
## fitted values  
fitted_values = beta0 + beta1*df.Independent
## print out the fitted values  
fitted_values[:10]

0    101.298713
1     95.902179
2    102.563526
3    109.899440
4     95.143291
5     95.143291
6    110.405365
7    103.575376
8     93.119590
9    101.635997
Name: Independent, dtype: float64

In [41]:
console.print("Using the Inbuilt attributes of the liear model gives us the same output as the previous",style='bold underline')
## its the same as intializing the fittedValue agurment in linear model 
linear_Model.fittedvalues[:10]

0    101.298713
1     95.902179
2    102.563526
3    109.899440
4     95.143291
5     95.143291
6    110.405365
7    103.575376
8     93.119590
9    101.635997
dtype: float64

In [44]:
console.print("Residual Calculation (Residual Means Deviation from the regression line):",style='bold underline')
## The Residual    
Residual = df.Dependent - fitted_values
Residual[:5]

0    18.401287
1     0.397821
2     4.636474
3    -8.899440
4    -2.843291
dtype: float64

In [45]:
console.print("Using the Inbuilt Resid attribute that is provided gives the same output as previous:",style='bold underline')
## its the same as 
linear_Model.resid[:5]


0    18.401287
1     0.397821
2     4.636474
3    -8.899440
4    -2.843291
dtype: float64

In [47]:
## creating a new column called fittedvalues in the df  
## creating a new column of fitted values   
df['fittedvalues'] = linear_Model.fittedvalues
## display first five rows 
df.head()

Unnamed: 0,Independent,Dependent,fittedvalues
0,105.0,119.7,101.298713
1,98.6,96.3,95.902179
2,106.5,107.2,102.563526
3,115.2,101.0,109.89944
4,97.7,92.3,95.143291


` visualizing the actuals and estimated values using ploty`

In [53]:
## creating a visualization  
fig = go.Figure()
## adding traces   
fig.add_trace( 
    go.Scatter( 
        x = df.Independent, 
        y= df.Dependent, 
        mode='markers', 
        name='actual values'
    )
)
## creating xvals  
xvals = np.arange(df.Independent.min(),df.Dependent.max(),1)
fig.add_trace( 
    go.Scatter( 
        x = xvals, 
        y = beta0 + beta1*xvals, 
        mode='lines', 
        name='model:(ols)'

    )
)
fig.add_trace( 
   go.Scatter( 
       x = df.Independent, 
       y = df.fittedvalues, 
       mode='markers', 
       name='estimated values'
   )  
)
fig.update_layout(title = 'SIMPLE LINEAR REGRESSION MODEL' ,
                  title_font=dict(size=20),
                  title_x=0.5,  
                   title_xanchor='center' 
                  
                  )

`MODEL STATISTIC`
* We also investigate the analysis of variance (ANOVA) using the `annova_lm` table

In [54]:
## anova lm table 
console.print("The Analysis of Variance table",style='bold underline') 
anova_lm(linear_Model)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Independent,1.0,1242.210171,1242.210171,12.867168,0.002107
Residual,18.0,1737.739329,96.541074,,


`MEASURES OF VARIATION (DECOMPOSITION OF VARIANCE)`
- They are three type of errors in our model 
* SSE --> Explained Sum of Squares: `it represents the variation explained by the regression/model`
* SSR-->  Residual Sum of Squares:    `it represents the variation not explained by the regression/model`
* SST-->  Total Sum of Squares:      `it represents the total sample variation the regression`

In [55]:
## SSE (Explainde sum of squares)
SSE =  np.sum((df.fittedvalues - df.Dependent.mean())**2)
console.print(f"Explained Sum of Squares (SSE) = {SSE:.6f}",style='bold underline')
## SSR 
SSR = np.sum((df.Dependent - df.fittedvalues)**2)
console.print(f"Residual Sum of Squares (SSR) = {SSR:.6f}",style='bold underline')
## SST 
SST = SSE + SSR
console.print(f"Total Sum of Squares (SSR) = {SST:.6f}",style='bold underline')


`THE F-statistic`


In [56]:
## The F statistic   
F = (SSE/1)/(SSR/18)
console.print(f"The F-statistic is = {F:6f}")

`THE P-VALUE`


In [69]:
## The P value 
P_Value = 1 - stats.f.cdf(F,1,18)
console.print(f"The P Value is = {P_Value:6f}")
## using the inbuilt function         
console.print(f"Using the Inbuilt attribute = {linear_Model.pvalues[1]:.6f}\n",style='bold underline')



`THE R-SQUARED (Coiffiecent of determination)`

In [67]:
## R SQUARED 
R_SQUARED = SSE/SST
console.print(f"The R SQUARED is = {R_SQUARED:6f}")
console.print(f"Using the inbuilt attribute = {linear_Model.rsquared:.6f}\n",style='bold underline')


`Answer to the Research Question`

Based on the results of the Ordinary Least Squares (OLS) regression analysis:

1. The **null hypothesis** ($H_0: \beta_1 = 0$) states that the independent variable has no effect on the dependent variable.  
2. The **alternative hypothesis** ($H_1: \beta_1 \neq 0$) suggests that the independent variable does have an effect.

`Key Findings:`
- The coefficient ($\beta_1$) for the independent variable is **0.8432**, with a **p-value of 0.002**.  
- Since the p-value is less than the common significance level ($\alpha = 0.05$), we reject the null hypothesis:
  $$
  H_0: \beta_1 = 0
  $$
  and conclude that:
  $$
  H_1: \beta_1 \neq 0
  $$
- This indicates that there is strong statistical evidence that the independent variable has a significant positive relationship with the dependent variable.

`Additional Insights:`
- The model's $R^2$ value is **0.417**, meaning approximately **41.7% of the variation** in the dependent variable is explained by the independent variable.
- The F-statistic is **12.87** with a corresponding p-value of **0.00211**, further confirming the model's significance.
  
`Interpretation of the Coeffiecient`
- ($\beta_1$) = **0.8432** --> it means for 1 unit increase in the Indepent variable , the Dependent variable wil increase by 0.8432 wherever units the Dependent is , if its dollars , it will be , it will increase by 0.8432 dollars while keeping other factors constant, the other 
factors in this case if we have other Independent variables:

` Pratical Significance`
- To evaluate the pratical significance , we lool at the magnitude of the coefficent and the unit of measuremrnt of the dependent varaible 


`Conclusion:`
The results suggest that the independent variable is an important predictor of the dependent variable. This insight could be valuable for understanding the dynamics in this dataset and making informed decisions based on the relationship identified.



