# Content

### Introduction to Inference about Slope

#### Theory
When we create a scatterplot and find the line of best fit (the least-squares regression line), we get an equation of the form **`ŷ = a + bx`**.
*   `b` is the **sample slope**. It's a statistic calculated from our sample data.
*   `a` is the **sample y-intercept**.

However, this is just an *estimate* based on one sample. We assume there is a "true" but unknown regression line for the entire population, with the equation **`y = α + βx`**.
*   **`β` (beta)** is the **true population slope**. This is the parameter we are interested in.
*   **`α` (alpha)** is the **true population y-intercept**.

The core question of inference is: **Is the relationship we see in our sample (`b`) strong enough to conclude that there is a real linear relationship in the population (`β`)?**

What if there were actually no linear relationship at all between the two variables? In that case, the true population slope would be zero (`β = 0`). The goal of a significance test is to see if our sample slope `b` is far enough from zero that we can reject the idea that the true slope `β` is zero.

The **sampling distribution of the slope** describes what the values of the sample slope `b` would look like if we took many, many random samples from the same population. Under the right conditions, this distribution is a t-distribution centered at the true slope `β`.

#### Hypotheses
*   **Null Hypothesis (H₀): `β = 0`**. This implies there is **no linear relationship** between the two variables in the population. Any slope we see in our sample is just due to random chance.
*   **Alternative Hypothesis (Hₐ):** Can be one of three forms:
    *   `β > 0` (There is a positive linear relationship).
    *   `β < 0` (There is a negative linear relationship).
    *   `β ≠ 0` (There is *some* linear relationship, either positive or negative).

---

### Conditions for Inference on Slope

For our calculations (both tests and intervals) to be valid, a set of conditions must be met. These can be remembered with the acronym **LINER**.

*   **L - Linear:** The actual relationship between the variables in the population must be linear.
    *   **How to check:** Examine a **scatterplot** of the data. It should show a roughly linear pattern. Also, create a **residual plot** (a plot of the residuals vs. the x-values). The residual plot should show no obvious pattern, just random scatter around zero.

*   **I - Independent:** Individual observations must be independent of each other. When sampling without replacement, this is the **10% Rule**: the sample size `n` should be no more than 10% of the population size `N`.

*   **N - Normal:** The residuals must be approximately Normally distributed. This means that for any given x-value, the y-values are Normally distributed around the true regression line.
    *   **How to check:** Create a **histogram** or a **Normal probability plot** of the residuals. The histogram should be roughly symmetric and bell-shaped.

*   **E - Equal Variance (Homoscedasticity):** The standard deviation of the residuals must be the same for all x-values.
    *   **How to check:** Look at the **residual plot** again. The points should form a random band of consistent width around the zero line. There should be no "fanning out" (cone shape), which would indicate that the model's predictions get worse as `x` increases.

*   **R - Random:** The data must be collected from a well-designed random sample or randomized experiment.

---

### Confidence Interval for the Slope of a Regression Line

#### Theory
A confidence interval for the slope provides a range of plausible values for the true population slope, `β`. The structure is the same as every other interval we've studied.

*   **Formula:** **`b ± t* * SE_b`**
*   `b` is the sample slope from our data.
*   `SE_b` is the **Standard Error of the Slope**. This complex formula measures the typical variability of the sample slope `b` from the true slope `β`.
*   `t*` is the critical t-value for a given confidence level. The degrees of freedom for regression slope inference are **df = n - 2**.

#### Interpretation
The most important part of interpreting the interval is to **look for the value 0**.
*   If the confidence interval for the slope **contains 0**, then it is plausible that the true slope is zero. This means we do not have convincing evidence of a linear relationship between the variables.
*   If the confidence interval **does not contain 0** (it is entirely positive or entirely negative), then we have strong evidence that the true slope is not zero. This means there is a statistically significant linear relationship between the variables.

#### Step-by-Step Example
**Scenario:** A real estate agent wants to model the relationship between the size of a house (in square feet) and its selling price (in thousands of dollars). She takes a random sample of 20 houses. Computer output for the regression analysis is shown below. Construct a 95% confidence interval for the true slope.

**Computer Output:**
*   Predictor: House Size
*   Sample size `n = 20`
*   Coefficient (Slope `b`): 0.150
*   Standard Error of the Slope (`SE_b`): 0.025

**STATE:**
*   We want to find a 95% confidence interval for `β`, the true slope of the regression line relating house price to its size.

**PLAN:**
*   **Procedure:** We will construct a t-interval for the slope `β`.
*   **Conditions:** We would check the LINER conditions using scatterplots and residual plots of the data. For this example, we will assume the conditions have been met.

**DO:**
1.  **Find `t*`:**
    *   `df = n - 2 = 20 - 2 = 18`.
    *   For 95% confidence with df=18, the critical value is **`t* = 2.101`**.
2.  **Calculate the interval:**
    *   `b ± t* * SE_b`
    *   `0.150 ± 2.101 * 0.025`
    *   `0.150 ± 0.0525`
    *   Interval: **(0.0975, 0.2025)**

**CONCLUDE:**
*   "We are 95% confident that the interval from 0.0975 to 0.2025 captures the true slope of the regression line relating a house's price to its size."
*   **Interpretation in context:** Since the interval is entirely positive and does not contain 0, we have convincing evidence of a significant positive linear relationship between size and price. For each additional square foot of size, we are 95% confident that the true mean increase in price is between $97.50 and $202.50 (since prices were in thousands).

***

### Python Code Illustration

This code uses `scikit-learn` to get the slope and then manually calculates the standard error and the confidence interval, which is highly instructive.




In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from scipy import stats

# --- Setup: Create sample data for our housing price example ---
np.random.seed(42)
house_size = np.random.randint(1500, 3500, size=20)
true_slope = 0.15
noise = np.random.normal(0, 20, size=20)
price = 50 + true_slope * house_size + noise

# Scikit-learn requires X to be a 2D array
X = house_size.reshape(-1, 1)
y = price

# --- Fit the Model ---
model = LinearRegression()
model.fit(X, y)

# Extract the slope (b) and intercept (a)
slope_b = model.coef_[0]
intercept_a = model.intercept_
print("--- Regression Model Fit ---")
print(f"Sample Slope (b): {slope_b:.4f}")
print(f"Sample Intercept (a): {intercept_a:.4f}\n")
print("="*50)

# --- Manually Calculate for Inference ---
print("\n--- Inference Calculations ---")
n = len(X)
df = n - 2  # Degrees of freedom for regression

# 1. Calculate the residuals
predictions = model.predict(X)
residuals = y - predictions

# 2. Calculate the Standard Error of the Slope (SE_b)
# This is the most complex part of the manual calculation.
# SE_b = sqrt( Σ(residuals²) / (n-2) ) / sqrt( Σ(x - x̄)² )
se_residuals = np.sqrt(np.sum(residuals**2) / df)
ssx = np.sum((house_size - np.mean(house_size))**2)
se_b = se_residuals / np.sqrt(ssx)

print(f"Standard Error of the Slope (SE_b): {se_b:.4f}")

# 3. Perform the significance test for H₀: β = 0
t_statistic = (slope_b - 0) / se_b
# Get the two-sided p-value
p_value = 2 * (1 - stats.t.cdf(np.abs(t_statistic), df=df))

print(f"t-statistic for H₀: β=0: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
    print("Conclusion: Since P-value < 0.05, we reject H₀ and conclude a significant linear relationship exists.\n")
else:
    print("Conclusion: Since P-value >= 0.05, we fail to reject H₀.\n")


# 4. Construct the 95% confidence interval
confidence_level = 0.95
t_star = stats.t.ppf((1 + confidence_level) / 2, df=df)
margin_of_error = t_star * se_b
lower_bound = slope_b - margin_of_error
upper_bound = slope_b + margin_of_error

print("--- Confidence Interval ---")
print(f"Critical t* for 95% confidence (df={df}): {t_star:.4f}")
print(f"Margin of Error: {margin_of_error:.4f}")
print(f"95% Confidence Interval for β: ({lower_bound:.4f}, {upper_bound:.4f})")
print("Interpretation: Since this interval does not contain 0, it confirms our test result.")


--- Regression Model Fit ---
Sample Slope (b): 0.1318
Sample Intercept (a): 86.6481


--- Inference Calculations ---
Standard Error of the Slope (SE_b): 0.0062
t-statistic for H₀: β=0: 21.1824
P-value: 0.0000
Conclusion: Since P-value < 0.05, we reject H₀ and conclude a significant linear relationship exists.

--- Confidence Interval ---
Critical t* for 95% confidence (df=18): 2.1009
Margin of Error: 0.0131
95% Confidence Interval for β: (0.1187, 0.1449)
Interpretation: Since this interval does not contain 0, it confirms our test result.
