Q1
Y=beta0+beta1x+sigma
Y:outcome variable
x:predictor
beta0:intercept
beta1:slope
sigma:error

In [32]:
import numpy as np
import plotly.graph_objects as go
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(0)

# Parameters for the model
n = 100  # number of data points
beta0 = 2  # intercept
beta1 = 0.5  # slope
sigma = 1  # standard deviation of the error term

# Generate predictor variable x uniformly
x = np.linspace(0, 10, n)

# Generate random errors from a normal distribution with mean 0 and standard deviation sigma
errors = norm.rvs(0, sigma, n)

# Generate outcome variable Y according to the linear model
Y = beta0 + beta1 * x + errors

# Create plotly figure
fig = go.Figure()

# Add the "theoretical" line (without noise)
fig.add_trace(go.Scatter(x=x, y=beta0 + beta1 * x, mode='lines', name='Theoretical Line'))

# Add simulated data points
fig.add_trace(go.Scatter(x=x, y=Y, mode='markers', name='Simulated Data'))

# Customize layout
fig.update_layout(
    title="Simple Linear Regression Model (Theoretical Line and Simulated Data)",
    xaxis_title="Predictor Variable x",
    yaxis_title="Outcome Variable Y"
)

fig.show()

In [None]:
import statsmodels.formula.api as smf  # what is this library for?
import plotly.express as px  # this is a ploting library

# what are the following two steps doing?
model_data_specification = smf.ols("Y~x", data=df) 
fitted_model = model_data_specification.fit() 

# what do each of the following provide?
fitted_model.summary()  # simple explanation? 
fitted_model.summary().tables[1]  # simple explanation?
fitted_model.params  # simple explanation?
fitted_model.params.values  # simple explanation?
fitted_model.rsquared  # simple explanation?

# what two things does this add onto the figure?
df['Data'] = 'Data' # hack to add data to legend 
fig = px.scatter(df, x='x',  y='Y', color='Data', 
                 trendline='ols', title='Y vs. x')

# This is essentially what above `trendline='ols'` does
fig.add_scatter(x=df['x'], y=fitted_model.fittedvalues,
                line=dict(color='blue'), name="trendline='ols'")

fig.show()
x_range = np.array([df['x'].min(), df['x'].max()])
# beta0 and beta1 are assumed to be defined
y_line = beta0 + beta1 * x_range
fig.add_scatter(x=x_range, y=y_line, mode='lines',
                name=str(beta0)+' + '+str(beta1)+' * x', 
                line=dict(dash='dot', color='orange'))

fig.show()

Q3
true line reflects the actual relationship between x and Y
trend line reflects the simulated relationship between x and Y

Q4
fitted_model.fittedvalues in a linear regression model represent the predicted values of the dependent variable
by using fitted_model.params, we can get the estimated coefficients for each independent variable, including the intercept term.Then we can have the linear regression model
To calculate fitted_model.fittedvalues, the model multiplies each coefficient by its respective independent variable and then sums these values.
SUMMARY
Coefficients (fitted_model.params):

fitted_model.params contains the estimated coefficients for each variable in the model, including the intercept. These coefficients are obtained from fitting the regression model to the data.
Fitted Values Calculation:

Each fitted (predicted) value in fitted_model.fittedvalues is calculated using the equation:
𝑦
^
𝑖
=
𝛽
0
+
𝛽
1
⋅
𝑋
𝑖
1
+
𝛽
2
⋅
𝑋
𝑖
2
+
⋯
+
𝛽
𝑛
⋅
𝑋
𝑖
𝑛
y
^
​
  
i
​
 =β 
0
​
 +β 
1
​
 ⋅X 
i1
​
 +β 
2
​
 ⋅X 
i2
​
 +⋯+β 
n
​
 ⋅X 
in
​
 
where 
𝛽
0
β 
0
​
  is the intercept, and 
𝛽
1
,
𝛽
2
,
…
,
𝛽
𝑛
β 
1
​
 ,β 
2
​
 ,…,β 
n
​
  are the coefficients for each independent variable.
Dot Product:

Internally, this is computed as a dot product of the coefficient vector (fitted_model.params) with each row in the independent variable matrix (including the intercept term).
https://chatgpt.com/share/67296b9c-04e8-8000-9985-8356a84cc239

Q8
p-value is very small, smaller than 0.001, we have very strong evidence against the null hypothesis, which means there is linear association


In [33]:
import seaborn as sns
import statsmodels.formula.api as smf

# The "Classic" Old Faithful Geyser dataset
old_faithful = sns.load_dataset('geyser')

linear_for_specification = 'duration ~ waiting'
model = smf.ols(linear_for_specification, data=old_faithful)
fitted_model = model.fit()
fitted_model.summary()

0,1,2,3
Dep. Variable:,duration,R-squared:,0.811
Model:,OLS,Adj. R-squared:,0.811
Method:,Least Squares,F-statistic:,1162.0
Date:,"Thu, 07 Nov 2024",Prob (F-statistic):,8.13e-100
Time:,01:08:04,Log-Likelihood:,-194.51
No. Observations:,272,AIC:,393.0
Df Residuals:,270,BIC:,400.2
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.8740,0.160,-11.702,0.000,-2.189,-1.559
waiting,0.0756,0.002,34.089,0.000,0.071,0.080

0,1,2,3
Omnibus:,4.133,Durbin-Watson:,2.561
Prob(Omnibus):,0.127,Jarque-Bera (JB):,3.173
Skew:,-0.138,Prob(JB):,0.205
Kurtosis:,2.548,Cond. No.,384.0


In [34]:
import plotly.express as px
import statsmodels.formula.api as smf


short_wait_limit = 62 # 64 # 66 #
short_wait = old_faithful.waiting < short_wait_limit

print(smf.ols('duration ~ waiting', data=old_faithful[short_wait]).fit().summary().tables[1])

# Create a scatter plot with a linear regression trendline
fig = px.scatter(old_faithful[short_wait], x='waiting', y='duration', 
                 title="Old Faithful Geyser Eruptions for short wait times (<"+str(short_wait_limit)+")", 
                 trendline='ols')

fig.show() # USE `fig.show(renderer="png")` FOR ALL GitHub and MarkUs SUBMISSIONS

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.6401      0.309      5.306      0.000       1.025       2.255
waiting        0.0069      0.006      1.188      0.238      -0.005       0.019


Q9
the p-value is 0.238 which is bigger than 0.1.so we have no evidence against the null hypothesis. There is no evidence in the data for a relationship between duration and wait time in the same manner as in the full data set.

Q11
smf.ols('duration ~ waiting', data=old_faithful): this is a full model
smf.ols('duration ~ waiting', data=old_faithful[short_wait]):this only considers waiting that is smaller than 68,which is a reduced model
smf.ols('duration ~ waiting', data=old_faithful[long_wait]):this only considers waiting that is biggger than 68,which is a reduced model
the independent variable of this model specification only has the value 0 or 1

In [35]:
from IPython.display import display

display(smf.ols('duration ~ C(kind, Treatment(reference="short"))', data=old_faithful).fit().summary().tables[1])

fig = px.box(old_faithful, x='kind', y='duration', 
             title='duration ~ kind',
             category_orders={'kind': ['short', 'long']})
fig.show() # USE `fig.show(renderer="png")` FOR ALL GitHub and MarkUs SUBMISSIONS

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.0943,0.041,50.752,0.000,2.013,2.176
"C(kind, Treatment(reference=""short""))[T.long]",2.2036,0.052,42.464,0.000,2.101,2.306


p-value=0, so we have very strong evidence against the null hypothesis, which means there is difference between groups