In [None]:
Simple Linear Regression (SLR) is a technique used to model the relationship between one dependent variable (Y) and one independent variable (X). 

In [None]:
Simple Linear Regression (SLR) is a statistical method used to model the relationship between a dependent variable (Y) and an independent variable (X).

In [None]:
# Explanation:
# Simple Linear Regression (SLR) is a statistical method that models the relationship between a dependent variable (outcome) and an independent variable (predictor). 
# The equation for SLR is:
#
#     Y = beta_0 + beta_1 * X + epsilon
#
# Where:
# - `Y` is the dependent variable (outcome).
# - `X` is the independent variable (predictor).
# - `beta_0` (intercept) is the value of `Y` when `X` is zero.
# - `beta_1` (slope) indicates the change in `Y` for a one-unit change in `X`.
# - `epsilon` is the error term that represents the difference between the actual and predicted values of `Y` and is assumed to be normally distributed.

# In SLR, we assume that for given values of `X`, the `Y` values are drawn from a normal distribution centered around `beta_0 + beta_1 * X` with some variance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the regression model
beta_0 = 5  # Intercept
beta_1 = 2  # Slope
sigma = 1.5  # Standard deviation of the error term

# Generate sample data
X = np.linspace(0, 10, 100)  # Predictor variable
epsilon = np.random.normal(0, sigma, len(X))  # Random error term
Y = beta_0 + beta_1 * X + epsilon  # Outcome variable

# Create a DataFrame for use with statsmodels
data = pd.DataFrame({"X": X, "Y": Y})

# Fit the Simple Linear Regression model using statsmodels
model = smf.ols(formula='Y ~ X', data=data).fit()

# Print the summary of the fitted model
print(model.summary())

# Plot the generated data and the fitted regression line
plt.figure(figsize=(10, 6))
plt.scatter(data["X"], data["Y"], color='blue', label='Generated Data')
plt.plot(data["X"], model.fittedvalues, color='green', label='Fitted Regression Line')
plt.plot(data["X"], beta_0 + beta_1 * X, color='red', linestyle='--', label='True Regression Line (from Question 1)')
plt.title('Fitted Simple Linear Regression Model with True Line Overlay')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Display a histogram of the error terms to show their normal distribution
residuals = model.resid
plt.figure(figsize=(8, 5))
plt.hist(residuals, bins=20, density=True, alpha=0.6, color='gray')

# Plot the normal distribution curve for comparison
x_axis = np.linspace(min(residuals), max(residuals), 100)
plt.plot(x_axis, norm.pdf(x_axis, 0, np.std(residuals)), color='red')
plt.title('Histogram of Residuals with Normal Distribution Curve')
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.show()

# Explanation of the two lines:
# The plot now includes two lines:
# 1. **True Regression Line (Red Dashed Line)**: This line represents the theoretical relationship defined by `beta_0` and `beta_1` used to generate the data. It shows the true underlying relationship without the influence of random error.
# 2. **Fitted Regression Line (Green Line)**: This line is the result of fitting the model to the simulated data. It represents the estimated relationship based on the sample data and can vary due to random sampling variation.
# The difference between the two lines highlights the effect of random sampling variation: while the true line shows the expected relationship, the fitted line accounts for the noise introduced by the error term and reflects sample-specific estimates.


In [None]:
The fitted_model.fittedvalues in the context of the statsmodels Simple Linear Regression model are derived using the estimated parameters of the model, specifically fitted_model.params. These parameters include the estimated intercept (β₀) and slope (β₁).

How fitted_model.fittedvalues are Derived:
The fitted_model.params attribute holds the estimated coefficients for the regression model, which are computed based on the sample data during model fitting.
The fittedvalues are calculated by applying the estimated regression equation to the predictor values X

X represents the independent variable data.
Relationship to fitted_model.summary().tables[1]:
The fitted_model.summary().tables[1] displays the parameter estimates (\hat{\beta}_0 and \hat{\beta}_1), their standard errors, t-values, and p-values. These estimates are used to form the fittedvalues by substituting them into the linear regression equation.
fitted_model.params.values directly gives you an array of the parameter values: [intercept, slope].

In [None]:
The line chosen for the fitted model using the "ordinary least squares" (OLS) method is the one that minimizes the sum of the squared differences between the observed data points and the predicted values from the line. This line is often called the "best fit" line.

Why "Squares" are Used:
The reason OLS uses the squares of the differences (residuals) is to ensure that all deviations contribute positively to the total error and to give larger deviations disproportionately more weight. Squaring emphasizes larger errors, making the model more sensitive to outliers. This approach helps identify the line that results in the smallest overall error when predicting Y from X.

In essence, OLS chooses the line where the sum of squared residuals is minimized, providing a balanced fit that reduces the impact of random variations in the data.


In [None]:
The first expression being referred to, fitted_model.rsquared, is known as the coefficient of determination or R-squared. Here's an explanation of why it represents "the proportion of variation in (outcome) Y explained by the model" and its interpretation in the context of Simple Linear Regression。


In [None]:
To evaluate the compatibility of the Simple Linear Regression model assumptions with a dataset, we must consider several assumptions integral to SLR. Here are a couple of key assumptions that may not align with certain data characteristics:

1. Linearity Assumption:
The SLR model assumes a linear relationship between the independent variable X and the dependent variable Y. If the data suggests a non-linear pattern (e.g., a curve or other complex relationship), the linear model is not appropriate.

2. Homoscedasticity (Constant Variance of Errors):
SLR assumes that the variance of the residuals (errors) is constant across all levels of X. If the spread of the residuals increases or decreases as X changes (e.g., forming a funnel shape), this violates the homoscedasticity assumption.

Checking Compatibility with Example Data:
Visual Inspection for Linearity: Plotting Y versus X and observing if the data follows a roughly straight line helps identify non-linearity.
Residual Plot for Homoscedasticity: Plotting residuals against fitted values can reveal if the variance of errors is consistent. A pattern or change in spread suggests heteroscedasticity.

In [None]:
# Explanation:
# Simple Linear Regression (SLR) is a statistical method that models the relationship between a dependent variable (outcome) and an independent variable (predictor). 
# The equation for SLR is:
#
#     Y = beta_0 + beta_1 * X + epsilon
#
# Where:
# - `Y` is the dependent variable (outcome).
# - `X` is the independent variable (predictor).
# - `beta_0` (intercept) is the value of `Y` when `X` is zero.
# - `beta_1` (slope) indicates the change in `Y` for a one-unit change in `X`.
# - `epsilon` is the error term that represents the difference between the actual and predicted values of `Y` and is assumed to be normally distributed.

# In SLR, we assume that for given values of `X`, the `Y` values are drawn from a normal distribution centered around `beta_0 + beta_1 * X` with some variance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the regression model
beta_0 = 5  # Intercept
beta_1 = 2  # Slope
sigma = 1.5  # Standard deviation of the error term

# Generate sample data
X = np.linspace(0, 10, 100)  # Predictor variable
epsilon = np.random.normal(0, sigma, len(X))  # Random error term
Y = beta_0 + beta_1 * X + epsilon  # Outcome variable

# Create a DataFrame for use with statsmodels
data = pd.DataFrame({"X": X, "Y": Y})

# Fit the Simple Linear Regression model using statsmodels
model = smf.ols(formula='Y ~ X', data=data).fit()

# Print the summary of the fitted model
print(model.summary())

# Plot the generated data and the fitted regression line
plt.figure(figsize=(10, 6))
plt.scatter(data["X"], data["Y"], color='blue', label='Generated Data')
plt.plot(data["X"], model.fittedvalues, color='green', label='Fitted Regression Line')
plt.plot(data["X"], beta_0 + beta_1 * X, color='red', linestyle='--', label='True Regression Line (from Question 1)')
plt.title('Fitted Simple Linear Regression Model with True Line Overlay')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Display a histogram of the error terms to show their normal distribution
residuals = model.resid
plt.figure(figsize=(8, 5))
plt.hist(residuals, bins=20, density=True, alpha=0.6, color='gray')

# Plot the normal distribution curve for comparison
x_axis = np.linspace(min(residuals), max(residuals), 100)
plt.plot(x_axis, norm.pdf(x_axis, 0, np.std(residuals)), color='red')
plt.title('Histogram of Residuals with Normal Distribution Curve')
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.show()

# Explanation of the two lines:
# The plot now includes two lines:
# 1. **True Regression Line (Red Dashed Line)**: This line represents the theoretical relationship defined by `beta_0` and `beta_1` used to generate the data. It shows the true underlying relationship without the influence of random error.
# 2. **Fitted Regression Line (Green Line)**: This line is the result of fitting the model to the simulated data. It represents the estimated relationship based on the sample data and can vary due to random sampling variation.
# The difference between the two lines highlights the effect of random sampling variation: while the true line shows the expected relationship, the fitted line accounts for the noise introduced by the error term and reflects sample-specific estimates.

# Null hypothesis for Simple Linear Regression:
# H0: beta_1 = 0 (There is no linear association between X and Y)
# This hypothesis states that the slope of the regression line is zero, implying that changes in X do not predict changes in Y.

# Using the fitted model to assess the evidence against H0:
# The p-value associated with the slope coefficient (beta_1) from model.summary().tables[1] indicates the strength of the evidence.
# If the p-value is below a certain threshold (e.g., 0.05), we reject H0, suggesting there is a statistically significant linear relationship.

# Interpretation for the Old Faithful Geyser dataset:
# The analysis involves fitting a Simple Linear Regression model to the geyser data and checking if the p-value for beta_1 is low enough to reject the null hypothesis.
# A low p-value suggests that the duration of eruptions (predictor) and the waiting time (outcome) have a significant linear relationship.


In [None]:
# Explanation:
# Simple Linear Regression (SLR) is a statistical method that models the relationship between a dependent variable (outcome) and an independent variable (predictor). 
# The equation for SLR is:
#
#     Y = beta_0 + beta_1 * X + epsilon
#
# Where:
# - `Y` is the dependent variable (outcome).
# - `X` is the independent variable (predictor).
# - `beta_0` (intercept) is the value of `Y` when `X` is zero.
# - `beta_1` (slope) indicates the change in `Y` for a one-unit change in `X`.
# - `epsilon` is the error term that represents the difference between the actual and predicted values of `Y` and is assumed to be normally distributed.

# In SLR, we assume that for given values of `X`, the `Y` values are drawn from a normal distribution centered around `beta_0 + beta_1 * X` with some variance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the regression model
beta_0 = 5  # Intercept
beta_1 = 2  # Slope
sigma = 1.5  # Standard deviation of the error term

# Generate sample data
X = np.linspace(0, 10, 100)  # Predictor variable
epsilon = np.random.normal(0, sigma, len(X))  # Random error term
Y = beta_0 + beta_1 * X + epsilon  # Outcome variable

# Create a DataFrame for use with statsmodels
data = pd.DataFrame({"X": X, "Y": Y})

# Fit the Simple Linear Regression model using statsmodels
model = smf.ols(formula='Y ~ X', data=data).fit()

# Print the summary of the fitted model
print(model.summary())

# Plot the generated data and the fitted regression line
plt.figure(figsize=(10, 6))
plt.scatter(data["X"], data["Y"], color='blue', label='Generated Data')
plt.plot(data["X"], model.fittedvalues, color='green', label='Fitted Regression Line')
plt.plot(data["X"], beta_0 + beta_1 * X, color='red', linestyle='--', label='True Regression Line (from Question 1)')
plt.title('Fitted Simple Linear Regression Model with True Line Overlay')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Display a histogram of the error terms to show their normal distribution
residuals = model.resid
plt.figure(figsize=(8, 5))
plt.hist(residuals, bins=20, density=True, alpha=0.6, color='gray')

# Plot the normal distribution curve for comparison
x_axis = np.linspace(min(residuals), max(residuals), 100)
plt.plot(x_axis, norm.pdf(x_axis, 0, np.std(residuals)), color='red')
plt.title('Histogram of Residuals with Normal Distribution Curve')
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.show()

# Explanation of the two lines:
# The plot now includes two lines:
# 1. **True Regression Line (Red Dashed Line)**: This line represents the theoretical relationship defined by `beta_0` and `beta_1` used to generate the data. It shows the true underlying relationship without the influence of random error.
# 2. **Fitted Regression Line (Green Line)**: This line is the result of fitting the model to the simulated data. It represents the estimated relationship based on the sample data and can vary due to random sampling variation.
# The difference between the two lines highlights the effect of random sampling variation: while the true line shows the expected relationship, the fitted line accounts for the noise introduced by the error term and reflects sample-specific estimates.

# Null hypothesis for Simple Linear Regression:
# H0: beta_1 = 0 (There is no linear association between X and Y)
# This hypothesis states that the slope of the regression line is zero, implying that changes in X do not predict changes in Y.

# Using the fitted model to assess the evidence against H0:
# The p-value associated with the slope coefficient (beta_1) from model.summary().tables[1] indicates the strength of the evidence.
# If the p-value is below a certain threshold (e.g., 0.05), we reject H0, suggesting there is a statistically significant linear relationship.

# Interpretation for the Old Faithful Geyser dataset:
# The analysis involves fitting a Simple Linear Regression model to the geyser data and checking if the p-value for beta_1 is low enough to reject the null hypothesis.
# A low p-value suggests that the duration of eruptions (predictor) and the waiting time (outcome) have a significant linear relationship.

# Restricting the dataset to short wait times and evaluating evidence against the null hypothesis:
short_wait_limits = [62, 64, 66]
for limit in short_wait_limits:
    short_data = data[data["X"] < limit]
    short_model = smf.ols(formula='Y ~ X', data=short_data).fit()
    print(f"\nSummary for wait times less than {limit} minutes:")
    print(short_model.summary())

    # Plot the restricted data and fitted regression line
    plt.figure(figsize=(10, 6))
    plt.scatter(short_data["X"], short_data["Y"], color='blue', label=f'Data for wait < {limit} min')
    plt.plot(short_data["X"], short_model.fittedvalues, color='green', label='Fitted Regression Line')
    plt.title(f'Fitted Simple Linear Regression Model for Wait < {limit} min')
    plt.xlabel('Predictor Variable (X)')
    plt.ylabel('Outcome Variable (Y)')
    plt.legend()
    plt.show()

    # Check the p-value for beta_1 to assess evidence against H0
    p_value = short_model.pvalues['X']
    if p_value < 0.05:
        print(f"Evidence suggests a significant linear relationship for wait < {limit} min (p-value: {p_value:.4f})")
    else:
        print(f"No significant linear relationship detected for wait < {limit} min (p-value: {p_value:.4f})")


In [None]:
# Explanation:
# Simple Linear Regression (SLR) is a statistical method that models the relationship between a dependent variable (outcome) and an independent variable (predictor). 
# The equation for SLR is:
#
#     Y = beta_0 + beta_1 * X + epsilon
#
# Where:
# - `Y` is the dependent variable (outcome).
# - `X` is the independent variable (predictor).
# - `beta_0` (intercept) is the value of `Y` when `X` is zero.
# - `beta_1` (slope) indicates the change in `Y` for a one-unit change in `X`.
# - `epsilon` is the error term that represents the difference between the actual and predicted values of `Y` and is assumed to be normally distributed.

# In SLR, we assume that for given values of `X`, the `Y` values are drawn from a normal distribution centered around `beta_0 + beta_1 * X` with some variance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the regression model
beta_0 = 5  # Intercept
beta_1 = 2  # Slope
sigma = 1.5  # Standard deviation of the error term

# Generate sample data
X = np.linspace(0, 10, 100)  # Predictor variable
epsilon = np.random.normal(0, sigma, len(X))  # Random error term
Y = beta_0 + beta_1 * X + epsilon  # Outcome variable

# Create a DataFrame for use with statsmodels
data = pd.DataFrame({"X": X, "Y": Y})

# Fit the Simple Linear Regression model using statsmodels
model = smf.ols(formula='Y ~ X', data=data).fit()

# Print the summary of the fitted model
print(model.summary())

# Plot the generated data and the fitted regression line
plt.figure(figsize=(10, 6))
plt.scatter(data["X"], data["Y"], color='blue', label='Generated Data')
plt.plot(data["X"], model.fittedvalues, color='green', label='Fitted Regression Line')
plt.plot(data["X"], beta_0 + beta_1 * X, color='red', linestyle='--', label='True Regression Line (from Question 1)')
plt.title('Fitted Simple Linear Regression Model with True Line Overlay')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Display a histogram of the error terms to show their normal distribution
residuals = model.resid
plt.figure(figsize=(8, 5))
plt.hist(residuals, bins=20, density=True, alpha=0.6, color='gray')

# Plot the normal distribution curve for comparison
x_axis = np.linspace(min(residuals), max(residuals), 100)
plt.plot(x_axis, norm.pdf(x_axis, 0, np.std(residuals)), color='red')
plt.title('Histogram of Residuals with Normal Distribution Curve')
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.show()

# Explanation of the two lines:
# The plot now includes two lines:
# 1. **True Regression Line (Red Dashed Line)**: This line represents the theoretical relationship defined by `beta_0` and `beta_1` used to generate the data. It shows the true underlying relationship without the influence of random error.
# 2. **Fitted Regression Line (Green Line)**: This line is the result of fitting the model to the simulated data. It represents the estimated relationship based on the sample data and can vary due to random sampling variation.
# The difference between the two lines highlights the effect of random sampling variation: while the true line shows the expected relationship, the fitted line accounts for the noise introduced by the error term and reflects sample-specific estimates.

# Null hypothesis for Simple Linear Regression:
# H0: beta_1 = 0 (There is no linear association between X and Y)
# This hypothesis states that the slope of the regression line is zero, implying that changes in X do not predict changes in Y.

# Using the fitted model to assess the evidence against H0:
# The p-value associated with the slope coefficient (beta_1) from model.summary().tables[1] indicates the strength of the evidence.
# If the p-value is below a certain threshold (e.g., 0.05), we reject H0, suggesting there is a statistically significant linear relationship.

# Interpretation for the Old Faithful Geyser dataset:
# The analysis involves fitting a Simple Linear Regression model to the geyser data and checking if the p-value for beta_1 is low enough to reject the null hypothesis.
# A low p-value suggests that the duration of eruptions (predictor) and the waiting time (outcome) have a significant linear relationship.

# Restricting the dataset to short wait times and evaluating evidence against the null hypothesis:
short_wait_limits = [62, 64, 66]
for limit in short_wait_limits:
    short_data = data[data["X"] < limit]
    short_model = smf.ols(formula='Y ~ X', data=short_data).fit()
    print(f"\nSummary for wait times less than {limit} minutes:")
    print(short_model.summary())

    # Plot the restricted data and fitted regression line
    plt.figure(figsize=(10, 6))
    plt.scatter(short_data["X"], short_data["Y"], color='blue', label=f'Data for wait < {limit} min')
    plt.plot(short_data["X"], short_model.fittedvalues, color='green', label='Fitted Regression Line')
    plt.title(f'Fitted Simple Linear Regression Model for Wait < {limit} min')
    plt.xlabel('Predictor Variable (X)')
    plt.ylabel('Outcome Variable (Y)')
    plt.legend()
    plt.show()

    # Check the p-value for beta_1 to assess evidence against H0
    p_value = short_model.pvalues['X']
    if p_value < 0.05:
        print(f"Evidence suggests a significant linear relationship for wait < {limit} min (p-value: {p_value:.4f})")
    else:
        print(f"No significant linear relationship detected for wait < {limit} min (p-value: {p_value:.4f})")

# Considering only long wait times (n=160) and analyzing the relationship
long_wait_limit = 160
long_data = data[data["X"] >= long_wait_limit]
long_model = smf.ols(formula='Y ~ X', data=long_data).fit()
print(f"\nSummary for wait times greater than or equal to {long_wait_limit} minutes:")
print(long_model.summary())

# Plot the long wait times data and the fitted regression line
plt.figure(figsize=(10, 6))
plt.scatter(long_data["X"], long_data["Y"], color='blue', label=f'Data for wait >= {long_wait_limit} min')
plt.plot(long_data["X"], long_model.fittedvalues, color='green', label='Fitted Regression Line')
plt.title(f'Fitted Simple Linear Regression Model for Wait >= {long_wait_limit} min')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Check the p-value for beta_1 to assess evidence against H0 for long wait times
p_value_long = long_model.pvalues['X']
if p_value_long < 0.05:
    print(f"Evidence suggests a significant linear relationship for wait >= {long_wait_limit} min (p-value: {p_value_long:.4f})")
else:
    print(f"No significant linear relationship detected for wait >= {long_wait_limit} min (p-value: {p_value_long:.4f})")


In [None]:
# Explanation:
# Simple Linear Regression (SLR) is a statistical method that models the relationship between a dependent variable (outcome) and an independent variable (predictor). 
# The equation for SLR is:
#
#     Y = beta_0 + beta_1 * X + epsilon
#
# Where:
# - `Y` is the dependent variable (outcome).
# - `X` is the independent variable (predictor).
# - `beta_0` (intercept) is the value of `Y` when `X` is zero.
# - `beta_1` (slope) indicates the change in `Y` for a one-unit change in `X`.
# - `epsilon` is the error term that represents the difference between the actual and predicted values of `Y` and is assumed to be normally distributed.

# In SLR, we assume that for given values of `X`, the `Y` values are drawn from a normal distribution centered around `beta_0 + beta_1 * X` with some variance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the regression model
beta_0 = 5  # Intercept
beta_1 = 2  # Slope
sigma = 1.5  # Standard deviation of the error term

# Generate sample data
X = np.linspace(0, 10, 100)  # Predictor variable
epsilon = np.random.normal(0, sigma, len(X))  # Random error term
Y = beta_0 + beta_1 * X + epsilon  # Outcome variable

# Create a DataFrame for use with statsmodels
data = pd.DataFrame({"X": X, "Y": Y})

# Add an indicator variable for short (< 68) and long (>= 68) wait times
data['wait_category'] = np.where(data['X'] < 6.8, 'short', 'long')

# Fit the Simple Linear Regression model using statsmodels with the indicator variable
indicator_model = smf.ols(formula='Y ~ C(wait_category)', data=data).fit()

# Print the summary of the fitted model with the indicator variable
print(indicator_model.summary())

# Plot the data with different categories and fitted lines
plt.figure(figsize=(10, 6))
plt.scatter(data[data['wait_category'] == 'short']["X"], data[data['wait_category'] == 'short']["Y"], color='blue', label='Short Wait Times')
plt.scatter(data[data['wait_category'] == 'long']["X"], data[data['wait_category'] == 'long']["Y"], color='orange', label='Long Wait Times')
plt.axhline(y=indicator_model.params['Intercept'], color='green', linestyle='--', label='Fitted Line (Short Wait)')
plt.axhline(y=indicator_model.params['Intercept'] + indicator_model.params['C(wait_category)[T.long]'], color='red', linestyle='-', label='Fitted Line (Long Wait)')
plt.title('Fitted Simple Linear Regression Model with Indicator Variable for Wait Times')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Explanation:
# The model includes an indicator variable (wait_category) that differentiates between short and long wait times.
# The parameter `C(wait_category)[T.long]` represents the difference in the mean outcome (Y) for long wait times compared to short wait times.
# The green dashed line represents the fitted value for short wait times, while the red solid line represents the fitted value for long wait times.
# This helps us understand if there is a significant difference in the mean outcome based on the length of the wait time.

# Interpretation of the p-value for `C(wait_category)[T.long]`:
# - If the p-value is less than a certain threshold (e.g., 0.05), we conclude that there is a significant difference in the outcome (Y) between short and long wait times.
# - Otherwise, we conclude that there is no significant difference.


In [None]:
# Explanation:
# Simple Linear Regression (SLR) is a statistical method that models the relationship between a dependent variable (outcome) and an independent variable (predictor). 
# The equation for SLR is:
#
#     Y = beta_0 + beta_1 * X + epsilon
#
# Where:
# - `Y` is the dependent variable (outcome).
# - `X` is the independent variable (predictor).
# - `beta_0` (intercept) is the value of `Y` when `X` is zero.
# - `beta_1` (slope) indicates the change in `Y` for a one-unit change in `X`.
# - `epsilon` is the error term that represents the difference between the actual and predicted values of `Y` and is assumed to be normally distributed.

# In SLR, we assume that for given values of `X`, the `Y` values are drawn from a normal distribution centered around `beta_0 + beta_1 * X` with some variance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the regression model
beta_0 = 5  # Intercept
beta_1 = 2  # Slope
sigma = 1.5  # Standard deviation of the error term

# Generate sample data
X = np.linspace(0, 10, 100)  # Predictor variable
epsilon = np.random.normal(0, sigma, len(X))  # Random error term
Y = beta_0 + beta_1 * X + epsilon  # Outcome variable

# Create a DataFrame for use with statsmodels
data = pd.DataFrame({"X": X, "Y": Y})

# Add an indicator variable for short (< 68) and long (>= 68) wait times
data['wait_category'] = np.where(data['X'] < 6.8, 'short', 'long')

# Fit the Simple Linear Regression model using statsmodels with the indicator variable
indicator_model = smf.ols(formula='Y ~ C(wait_category)', data=data).fit()

# Print the summary of the fitted model with the indicator variable
print(indicator_model.summary())

# Plot the data with different categories and fitted lines
plt.figure(figsize=(10, 6))
plt.scatter(data[data['wait_category'] == 'short']['X'], data[data['wait_category'] == 'short']['Y'], color='blue', label='Short Wait Times')
plt.scatter(data[data['wait_category'] == 'long']['X'], data[data['wait_category'] == 'long']['Y'], color='orange', label='Long Wait Times')
plt.axhline(y=indicator_model.params['Intercept'], color='green', linestyle='--', label='Fitted Line (Short Wait)')
plt.axhline(y=indicator_model.params['Intercept'] + indicator_model.params['C(wait_category)[T.long]'], color='red', linestyle='-', label='Fitted Line (Long Wait)')
plt.title('Fitted Simple Linear Regression Model with Indicator Variable for Wait Times')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Explanation:
# The model includes an indicator variable (wait_category) that differentiates between short and long wait times.
# The parameter `C(wait_category)[T.long]` represents the difference in the mean outcome (Y) for long wait times compared to short wait times.
# The green dashed line represents the fitted value for short wait times, while the red solid line represents the fitted value for long wait times.
# This helps us understand if there is a significant difference in the mean outcome based on the length of the wait time.

# Interpretation of the p-value for `C(wait_category)[T.long]`:
# - If the p-value is less than a certain threshold (e.g., 0.05), we conclude that there is a significant difference in the outcome (Y) between short and long wait times.
# - Otherwise, we conclude that there is no significant difference.

# Display a histogram of the error terms to show their normal distribution
residuals = indicator_model.resid
plt.figure(figsize=(8, 5))
plt.hist(residuals, bins=20, density=True, alpha=0.6, color='gray')

# Plot the normal distribution curve for comparison
x_axis = np.linspace(min(residuals), max(residuals), 100)
plt.plot(x_axis, norm.pdf(x_axis, 0, np.std(residuals)), color='red')
plt.title('Histogram of Residuals with Normal Distribution Curve')
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.show()

# Evaluation of histograms and normality assumption:
# - The histogram of residuals for the indicator model helps determine if the error terms are approximately normally distributed.
# - If the histogram closely matches the overlaid normal distribution curve (red line), this suggests the plausibility of the normality assumption.
# - If the histogram significantly deviates from the normal curve (e.g., skewed distribution, presence of outliers, or multimodal patterns), this indicates that the normality assumption may not hold.

# Among the different models and histograms:
# - The histogram that closely aligns with the normal distribution curve suggests the plausibility of the normality assumption.
# - In the current histogram, the residuals appear to roughly match the normal curve, suggesting that the normality assumption may be reasonable for this model.
# - However, if there are other histograms that show deviation from the normal curve, such as being asymmetrical, having heavy tails, or showing multiple peaks, these would indicate that the normality assumption is not supported for those models.
# - Specifically, histograms that are skewed (either left or right), have obvious outliers, or have multiple modes do not support the assumption of normally distributed error terms, indicating a potential issue with the model fit.


In [None]:
# Explanation:
# Simple Linear Regression (SLR) is a statistical method that models the relationship between a dependent variable (outcome) and an independent variable (predictor). 
# The equation for SLR is:
#
#     Y = beta_0 + beta_1 * X + epsilon
#
# Where:
# - `Y` is the dependent variable (outcome).
# - `X` is the independent variable (predictor).
# - `beta_0` (intercept) is the value of `Y` when `X` is zero.
# - `beta_1` (slope) indicates the change in `Y` for a one-unit change in `X`.
# - `epsilon` is the error term that represents the difference between the actual and predicted values of `Y` and is assumed to be normally distributed.

# In SLR, we assume that for given values of `X`, the `Y` values are drawn from a normal distribution centered around `beta_0 + beta_1 * X` with some variance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import norm

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the regression model
beta_0 = 5  # Intercept
beta_1 = 2  # Slope
sigma = 1.5  # Standard deviation of the error term

# Generate sample data
X = np.linspace(0, 10, 100)  # Predictor variable
epsilon = np.random.normal(0, sigma, len(X))  # Random error term
Y = beta_0 + beta_1 * X + epsilon  # Outcome variable

# Create a DataFrame for use with statsmodels
data = pd.DataFrame({"X": X, "Y": Y})

# Add an indicator variable for short (< 68) and long (>= 68) wait times
data['wait_category'] = np.where(data['X'] < 6.8, 'short', 'long')

# Fit the Simple Linear Regression model using statsmodels with the indicator variable
indicator_model = smf.ols(formula='Y ~ C(wait_category)', data=data).fit()

# Print the summary of the fitted model with the indicator variable
print(indicator_model.summary())

# Plot the data with different categories and fitted lines
plt.figure(figsize=(10, 6))
plt.scatter(data[data['wait_category'] == 'short']['X'], data[data['wait_category'] == 'short']['Y'], color='blue', label='Short Wait Times')
plt.scatter(data[data['wait_category'] == 'long']['X'], data[data['wait_category'] == 'long']['Y'], color='orange', label='Long Wait Times')
plt.axhline(y=indicator_model.params['Intercept'], color='green', linestyle='--', label='Fitted Line (Short Wait)')
plt.axhline(y=indicator_model.params['Intercept'] + indicator_model.params['C(wait_category)[T.long]'], color='red', linestyle='-', label='Fitted Line (Long Wait)')
plt.title('Fitted Simple Linear Regression Model with Indicator Variable for Wait Times')
plt.xlabel('Predictor Variable (X)')
plt.ylabel('Outcome Variable (Y)')
plt.legend()
plt.show()

# Explanation:
# The model includes an indicator variable (wait_category) that differentiates between short and long wait times.
# The parameter `C(wait_category)[T.long]` represents the difference in the mean outcome (Y) for long wait times compared to short wait times.
# The green dashed line represents the fitted value for short wait times, while the red solid line represents the fitted value for long wait times.
# This helps us understand if there is a significant difference in the mean outcome based on the length of the wait time.

# Interpretation of the p-value for `C(wait_category)[T.long]`:
# - If the p-value is less than a certain threshold (e.g., 0.05), we conclude that there is a significant difference in the outcome (Y) between short and long wait times.
# - Otherwise, we conclude that there is no significant difference.

# Display a histogram of the error terms to show their normal distribution
residuals = indicator_model.resid
plt.figure(figsize=(8, 5))
plt.hist(residuals, bins=20, density=True, alpha=0.6, color='gray')

# Plot the normal distribution curve for comparison
x_axis = np.linspace(min(residuals), max(residuals), 100)
plt.plot(x_axis, norm.pdf(x_axis, 0, np.std(residuals)), color='red')
plt.title('Histogram of Residuals with Normal Distribution Curve')
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.show()

# Evaluation of histograms and normality assumption:
# - The histogram of residuals for the indicator model helps determine if the error terms are approximately normally distributed.
# - If the histogram closely matches the overlaid normal distribution curve (red line), this suggests the plausibility of the normality assumption.
# - If the histogram significantly deviates from the normal curve (e.g., skewed distribution, presence of outliers, or multimodal patterns), this indicates that the normality assumption may not hold.

# Among the different models and histograms:
# - The histogram that closely aligns with the normal distribution curve suggests the plausibility of the normality assumption.
# - In the current histogram, the residuals appear to roughly match the normal curve, suggesting that the normality assumption may be reasonable for this model.
# - However, if there are other histograms that show deviation from the normal curve, such as being asymmetrical, having heavy tails, or showing multiple peaks, these would indicate that the normality assumption is not supported for those models.
# - Specifically, histograms that are skewed (either left or right), have obvious outliers, or have multiple modes do not support the assumption of normally distributed error terms, indicating a potential issue with the model fit.

# Two-sample hypothesis testing using permutation test and bootstrap confidence interval
# Separate the data into two groups: short and long wait times
short_wait_data = data[data['wait_category'] == 'short']['Y']
long_wait_data = data[data['wait_category'] == 'long']['Y']

# Permutation test to assess the difference in means
observed_diff = long_wait_data.mean() - short_wait_data.mean()

# Permutation test function
def permutation_test(data1, data2, num_permutations=10000):
    combined = np.concatenate([data1, data2])
    count = 0
    for _ in range(num_permutations):
        np.random.shuffle(combined)
        new_data1 = combined[:len(data1)]
        new_data2 = combined[len(data1):]
        new_diff = new_data2.mean() - new_data1.mean()
        if abs(new_diff) >= abs(observed_diff):
            count += 1
    return count / num_permutations

p_value_permutation = permutation_test(short_wait_data.values, long_wait_data.values)
print(f"Permutation test p-value: {p_value_permutation:.4f}")

# Bootstrap confidence interval for the difference in means
num_bootstrap_samples = 10000
diff_means = []
for _ in range(num_bootstrap_samples):
    short_sample = np.random.choice(short_wait_data, size=len(short_wait_data), replace=True)
    long_sample = np.random.choice(long_wait_data, size=len(long_wait_data), replace=True)
    diff_means.append(long_sample.mean() - short_sample.mean())

# Calculate the 95% confidence interval
lower_bound = np.percentile(diff_means, 2.5)
upper_bound = np.percentile(diff_means, 97.5)
print(f"95% Bootstrap Confidence Interval for Difference in Means: ({lower_bound:.4f}, {upper_bound:.4f})")

# Interpretation:
# - The permutation test p-value helps determine if there is a significant difference between the means of the two groups.
# - A p-value less than 0.05 suggests that the difference in means is statistically significant.
# - The bootstrap confidence interval provides an estimate of the range within which the true difference in means lies with 95% confidence.


In [None]:
Yes