# Pre-lecture HW 06 - Cayden Wong 2024/11/05

### 1. Explain the theoretical Simple Linear Regression model in your own words by describing its components (of predictor and outcome variables, slope and intercept coefficients, and an error term) and how they combine to form a sample from normal distribution; then, create python code explicitly demonstrating your explanation using numpy and scipy.stats

Session link for Q1: https://chatgpt.com/share/672a32ef-2e68-8012-b368-486c9ee02929

The theoretical simple linear regression model describes a relationship between a predictor variable $x_i$ and an outcome variable $Y_i$. The theoretical model is defined by the equation $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, where $\beta_0$ is the y-intercept, $\beta_1$ is the slope, and $\epsilon_i$ are normally distributed error terms. Since $Y_i$ is dependant on the normally distributed $\epsilon_i$, then the outcome variable $Y_i$ must also be normally distributed while dependant on the value of the predictor variable $x_i$ 

In [1]:
import numpy as np
import scipy.stats as stats
import plotly.graph_objects as go

# Define parameters
n = 100  # number of data points
beta0 = 2  # intercept
beta1 = 1.5  # slope
sigma = 1  # standard deviation of errors

# Generate predictor variable (x) from a uniform distribution
np.random.seed(42)
x = stats.uniform.rvs(0, 10, size=n)

# Generate error terms from a normal distribution
errors = stats.norm.rvs(0, sigma, size=n)

# Generate outcome variable (y) using the theoretical linear model
y = beta0 + beta1 * x + errors

# Visualization of the data and the true regression line
fig = go.Figure()

# Scatter plot of the data points (x, y)
fig.add_trace(go.Scatter(x=x, y=y, mode='markers', name='Data'))

# Add the true theoretical line (without noise)
x_line = np.array([x.min(), x.max()])
y_line = beta0 + beta1 * x_line
fig.add_trace(go.Scatter(x=x_line, y=y_line, mode='lines', name='True Line', line=dict(color='orange', dash='dot')))

# Show the plot
fig.update_layout(title='Simple Linear Regression: True Model vs Simulated Data',
                  xaxis_title='Predictor (x)',
                  yaxis_title='Outcome (y)')
fig.show()


### 2. Use a dataset simulated from your theoretical Simple Linear Regression model to demonstrate how to create and visualize a fitted Simple Linear Regression model using pandas and import statsmodels.formula.api as smf

Session link for Q2: https://chatgpt.com/share/672a32ef-2e68-8012-b368-486c9ee02929

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go

# Combine x and y into a pandas DataFrame
df = pd.DataFrame({'x': x, 'y': y})


In [3]:
# Specify and fit the OLS regression model
model_data_specification = smf.ols("y ~ x", data=df)  # Specifies the linear model y ~ x
fitted_model = model_data_specification.fit()  # Fits the model to the data

# Model summary and parameters
fitted_model.summary()  # Provides a detailed summary of the regression results
fitted_model.summary().tables[1]  # Provides the coefficient table
fitted_model.params  # Provides the estimated intercept and slope
fitted_model.params.values  # Array of the estimated intercept and slope values
fitted_model.rsquared  # Provides the R-squared value of the model

# Create a scatter plot of the data and add the fitted model line
df['Data'] = 'Data'  # Hack to add the data label to the legend
fig = px.scatter(df, x='x', y='y', color='Data', title='y vs. x with Fitted Line')

# Add the fitted regression line from the OLS model
fig.add_scatter(x=df['x'], y=fitted_model.fittedvalues,
                line=dict(color='blue'), name="trendline='ols'")


# Show the plot
fig.show()

### 3. Add the line from Question 1 on the figure of Question 2 and explain the difference between the nature of the two lines in your own words; but, hint though: simulation of random sampling variation

Session link for Q3: https://chatgpt.com/share/672a32ef-2e68-8012-b368-486c9ee02929

The dotted orange line is our model population, which we used to generate samples. We defined the parameters for this population. The solid blue line represents a fitted regression model created from our samples generated from our theoretical model. The solid blue line is an estimate for what our population model looks like from our sample.

In [4]:
import numpy as np
import scipy.stats as stats
import plotly.graph_objects as go

# Define parameters
n = 100  # number of data points
beta0 = 2  # intercept
beta1 = 1.5  # slope
sigma = 1  # standard deviation of errors

# Generate predictor variable (x) from a uniform distribution
x = stats.uniform.rvs(0, 10, size=n)

# Generate error terms from a normal distribution
errors = stats.norm.rvs(0, sigma, size=n)
# Generate outcome variable (y) using the theoretical linear model
y = beta0 + beta1 * x + errors

# Visualization of the data and the true regression line
fig = go.Figure()

# Scatter plot of the data points (x, y)
fig.add_trace(go.Scatter(x=x, y=y, mode='markers', name='Data'))

# Add the true theoretical line (without noise)
x_line = np.array([x.min(), x.max()])
y_line = beta0 + beta1 * x_line

# Combine x and y into a pandas DataFrame
df = pd.DataFrame({'x': x, 'y': y})

# Specify and fit the OLS regression model
model_data_specification = smf.ols("y ~ x", data=df)  # Specifies the linear model y ~ x
fitted_model = model_data_specification.fit()  # Fits the model to the data

# Model summary and parameters
fitted_model.summary()  # Provides a detailed summary of the regression results
fitted_model.summary().tables[1]  # Provides the coefficient table
fitted_model.params  # Provides the estimated intercept and slope
fitted_model.params.values  # Array of the estimated intercept and slope values
fitted_model.rsquared  # Provides the R-squared value of the model

# Create a scatter plot of the data and add the fitted model line
df['Data'] = 'Data'  # Hack to add the data label to the legend
fig = px.scatter(df, x='x', y='y', color='Data', title='y vs. x with Fitted Line')

# Add the fitted regression line from the OLS model
fig.add_scatter(x=df['x'], y=fitted_model.fittedvalues,
                line=dict(color='blue'), name="trendline='ols'")


# Adding the theoretical line from Question 1
x_range = np.array([df['x'].min(), df['x'].max()])
y_line = beta0 + beta1 * x_range  # y = beta0 + beta1 * x

# Add the theoretical line (dotted and orange) to the plot
fig.add_scatter(x=x_range, y=y_line, mode='lines',
                name=str(beta0)+' + '+str(beta1)+' * x', 
                line=dict(dash='dot', color='orange'))

# Show the updated plot with both the fitted and theoretical lines
fig.show()


##### Interaction summary for Q1-Q3

Here’s a summary of our interaction for your homework on Simple Linear Regression:

1. **Theoretical Model Explanation and Code (Question 1)**:
   - We discussed the components of the Simple Linear Regression (SLR) model, including predictor \( x \), outcome \( y \), slope \( \beta_1 \), intercept \( \beta_0 \), and error term \( \epsilon \). 
   - A Python code snippet demonstrated creating the data using arbitrary values for these parameters, with errors sampled from a normal distribution, then visualizing data points and the theoretical line with Plotly.

2. **Fitted Model Creation and Explanation (Question 2)**:
   - We used `statsmodels.formula.api` to fit a model to the simulated data and explored various outputs:
     - **`fitted_model.summary()`**: Provides a statistical summary.
     - **`fitted_model.params`**: Lists the estimated coefficients (intercept and slope).
     - **`fitted_model.rsquared`**: Shows the R-squared value, indicating model fit quality.
   - Plotly was used to visualize the simulated data points and fitted model line.

3. **Comparison of Theoretical and Fitted Lines (Question 3)**:
   - We added the theoretical line (from Question 1) to the plot with the fitted line (from Question 2) for comparison. 
   - Key Insight: The theoretical line represents the true relationship, while the fitted line is an approximation influenced by random errors in the data. Differences between the two lines highlight the impact of noise on model fitting.

This interaction provided a comprehensive overview of theoretical versus fitted models in SLR, using Python and visualization to illustrate differences. If you need further clarifications, feel free to ask!

### 4. Explain how fitted_model.fittedvalues are derived on the basis of fitted_model.summary().tables[1] (or more specifically fitted_model.params or fitted_model.params.values)

Session link for Q4: https://chatgpt.com/share/672a37b9-2e10-8012-975d-f822e512c9b3

In [9]:
fitted_model.fittedvalues

0     14.008395
1      9.544463
2     10.679135
3      9.389855
4      4.847996
        ...    
95    13.138838
96    10.334611
97    11.211078
98     8.275796
99     5.649925
Length: 100, dtype: float64

In [10]:
fitted_model.params

Intercept    1.865011
x            1.527832
dtype: float64

In [8]:
fitted_model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.8650,0.189,9.866,0.000,1.490,2.240
x,1.5278,0.031,48.563,0.000,1.465,1.590


Fitted values are calculated using the fitted model. A given predictor value is multiplied by the fitted slope coefficient, then is added to the fitted intercept to produce a fitted value. The values in `fitted_model.fittedvalues` are calculated by passing each predictor variable into the fitted model recording the outcome.

##### Interaction summary for Q4:
Here's a brief summary of our interaction:

You asked about interpreting a question related to Simple Linear Regression in Python, specifically about explaining how `fitted_model.fittedvalues` are derived using model parameters. I clarified that the question requires illustrating how fitted values (predicted values) are calculated using the estimated intercept and slope from the fitted model, contrasting them with theoretical values. I explained that `fitted_model.fittedvalues` provides these predicted values based on the regression equation, allowing you to assess model predictions against actual values.

### 5. Explain concisely in your own words what line is chosen for the fitted model based on observed data using the "ordinary least squares" method (as is done by trendline='ols' and smf.ols(...).fit()) and why it requires "squares"

Session link for Q5: https://chatgpt.com/share/672a3b94-c000-8012-99c0-aa18588c8bc4

The line that is chosen is the one that has the minimum sum of squares, that is the sum of the squares of the distance between the datapoint and the line. Squares are used to remove directionality from the distances of far away each point is from the line and punishes further deviations from the line, we want to minimize this value since we want a line that best represents the population which we assume to have normally distributed error terms. Thus, having a model that minimizes its distance from our observed samples would best represent the population.