In [1]:
import numpy as np
import matplotlib.pyplot as plt

import os
import sys

# Append the parent directory to sys.path so that utils can be found
sys.path.append(os.path.join(sys.path[0], os.path.pardir))

from utils import StringHandler as SH
from utils import LinearRegression as LR
from utils import CorrelationCoefficient as CC

### **7.1. TWO ROUGH PREDICTIONS**

#### **Predict "Relatively Large Number":**

#### **Predict "between 14 and 18 Cards":**

### **7.2. A REGRESSION LINE**

#### **Placement of Line:**
- The regression line is a line, which is designated to pass through the main cluster in a scatterplot, possibly touching some dots but missing the others.
- It is used to predict the corresponding value of one variable given a knowned value of the other. Nonetheless, because the line is designed to try to fit all the dots, the trade-off will be errors between the actual values and predicted values.

#### **Predictive Errors:**
- When predictive values, taken from the regression line, fail to reflect the actual values of variables, the differences are called predictive errors.

#### **Total Predictive Error:**
- The smaller the total of all predictive errors, the more favorable will be the prognosis for the predictions. Therefore, it is desireable for the regression line to be placed in a position that minimizes the total predictive error, that is, minimizes the discrepancies between the predicted and actual values.

#### **Progress Check 7.1:** To check your understanding of the first part of this chapter, make predictions using the following graph.
![image.png](attachment:4864c2ec-578b-4324-bfdd-9e11cfbc7470.png)

(a) Predict the approximate rate of inflation, given an unemployment rate of 5 percent. <br>
5.35%. <br>
(b) Predict the approximate rate of inflation, given an unemployment rate of 15 percent. <br>
2.65%.

### **7.3. LEAST SQUARES REGRESSION LINE**

#### **Need a Mathematical Solution:**

#### **Least Squares Regression Equation:**
<center><b>LEAST SQUARES REGRESSION EQUATION</b></center>
<center>$\Large Y^{'} = bX + a$</center>
<br>
Where: <br>
+ Y': predicted value <br>
+ X: known (given) value <br>
+ a, b: numbers calculated from the original correlation analysis.

#### **Finding Values of b and a:**

<center><b>SOLVING FOR b</b></center>
<center>$\Large b = r\sqrt{\frac{SS_y}{SS_x}}$</center>
<br>
Where: <br>
+ r: correlation coefficient between variables X and y <br>
+ $SS_x$, $SS_y$: sum of squares of X and Y, respectively.

<center><b>SOLVING FOR a</b></center>
<center>$\Large a = \overline{Y} - b\overline{X}$</center>
<br>
Where: <br>
+ $\overline{X},\, \overline{Y}$: sample means for all X and Y scores, respectively.

#### **Key Property:**
- Definition:
    + Least Squares Regression Equation: the equation that minimizes the total of all squared prediction errors for known Y scores in the original correlation analysis.

#### **Solving for Y':**
- The value of b reflects the rate of change as one unit of value is assigned to variable X.
- The value of a reflects the default value that will be added to each X (that is, if X is 0 then Y' = a).

#### **A Limitation:**
- Again, the predicted value or the regression equation does not give any evidence of a cause-effect relationship, so that, in reality, the output of the equation should be taken with caution and requires an aid of further investigation to determine whether it is trustworthy.

#### **Progress Check 7.2:** Assume that an r of .30 describes the relationship between educational level (highest grade completed) and estimated number of hours spent reading each week. More specifically:
![image.png](attachment:48f1fc15-93f9-43ce-8166-74e8a9f58dc7.png)

(a) Determine the least squares equation for predicting weekly reading time from educational level. <br>
Y' = 0.42X + 2.54. <br> 
(b) Faith’s education level is 15. What is her predicted reading time? <br>
8.849. <br>
(c) Keegan’s educational level is 11. What is his predicted reading time? <br>
7.16

In [18]:
# Solution:
    # a);
coefficient = np.round(0.3 * np.sqrt(50 / 25), decimals=2)
constant = 8 - (coefficient * 13)
coefficient, constant

(0.42, 2.54)

In [19]:
# b):
faith_rtime = LR.calc_pvalue(15, coefficient, constant)
faith_rtime

8.84

In [20]:
# c):
keegan_rtime = LR.calc_pvalue(11, coefficient, constant)
keegan_rtime

7.16

#### **Graphs or Equations:**
- Equation is preferrably used since it can yield a more precise prediction and easy to calculate given the right formula, while a graph takes more time and effort to construct.

### **7.4. STANDARD ERROR OF ESTIMATE, $s_{y|x}$**

#### **Finding the Standard Error of Estimate:**
- Although minimized by the least squares regression, it is not entirely eliminated by the equation, therefore, we should include an estimate of error to justify the final output.
<br> <br>
<center><b>STANDARD ERROR OF ESTIMATE (DEFINITION FORMULA)</b></center>
<center>$\Large s_{y|x} = \sqrt{\frac{SS_{y|x}}{n - 2}} = \sqrt{\frac{\sum{(Y-Y^{'})^2}}{n-2}}$</center>
<br>
Where:<br>
    + $SS_{y|x}$: sum of squares for predictive errors <br>
    + n - 2: degrees of freedom <br>
<br>
<center><b>STANDARD ERROR OF ESTIMATE (COMPUTATION FORMULA)</b></center>
<center>$\Large s_{y|x} = \sqrt{\frac{SS_y(1-r^2)}{n-2}}$</center>

#### **Key Property:**
- Definition:
    + Standard Error of Estimate ($s_{y|x}$): a rough measure of the average amount of predictive error.

#### **Importance of <i>r</i>:**

#### **Progress Check 7.3:**

(a) Calculate the standard error of estimate for the data in Question 7.2 on page 132, assumingthat the correlation of .30 is based on n = 35 pairs of observations. <br>
(b) Supply a rough interpretation of the standard error of estimate. <br>
Roughly indicates the average amount by which the prediction is in error.

In [None]:
# a):
s_yx = np.sqrt(50 * (1 - 0.3**2) / 33)
s_yx

### **7.5. ASSUMPTIONS**

#### **Linearity:**
- Use of the regression equation requires that the underlying relationship be linear, if the scatterplot or some statistical procedure reveals that the relationship is non-linear, then other calculations should be applied.

#### **Homoscedasticity:**
- Use of the standard error of estimate requires that the dots in the original scatterplot be equally dispersed about all segments of the regression line.

#### **Example: Violation of homoscedasticity assumption. (Dots lack equal variability about all line segments)**
![image.png](attachment:57a90125-6a88-4d4d-8a64-dfb3ed30efac.png)

### **7.6. INTERPRETATION OF $r^2$**

#### **Repetitive Prediction of the Mean:** 
- A valid predictive procedure when data is one-dimensional, because the sum of all deviations of data from their mean is always 0.

#### **Predictive Errors:** 
- The errors produced by the least squares regression tend to be lower than in repetitive prediction of the mean.

#### **Error Variability (Sum of Squares):** 

#### **Proportion of Predicted Variability:** 
- $SS_y$ measures the total variability of Y scores that occurs after only primitive predictions based on $\overline{Y}$ are made.
- $SS_{y|x}$ measures the residual variability of Y scores that remains after customized least squares predictions are made.
- To obtain an SS measure of the actual <i>gain in accuracy</i>, subtract the total variability ($SS_y$) by the residual variability ($SS_{y|x}$): <br> <br>
<center>$\Large SS_y - SS_{y|x}$</center> <br>
- To express the difference as a gain in accuracy relative to the original variability for the repetitive prediction, divide the actual gain in accuracy by the total variability: <br> <br>
<center>$\Large \frac{SS_y - SS_{y|x}}{SS_y}$</center> <br>
- Definition:
    + Squared Correlation Coefficient ($r^2$): the proportion of the total variability in one variable that is predictable from its relationship with the other variable.
    + Formula:
    <center>$\Large r^2 = \frac{SS_{y^{'}}}{SS_y} = \frac{SS_y - SS_{y|x}}{SS_y}$</center>

#### **$r^2$ Does Not Apply to Individual Scores:** 
- Interpretation of $r^2$: the value of squared correlation coefficient indicates the percentage/proportion of variation of one variable that can be explained/predicted by the other variable.
- Note: $r^2$ does not succintly signify the amount of scores that can be predicted perfectly.

#### **Small Values of $r^2$:** 
- Strength of predictions: the value of $r^2$ in the vicinity of
    + 0.01: implies a weak relationship
    + 0.09: a moderate relationship
    + 0.25: a strong relationship <br>
    between the actual data and the data that can be predicted using the least squares regression model.

#### **$r^2$ Doesn't Ensure Cause-Effect:** 
- Again, no measurement of $r$ can imply an cause-effect relationship between variables, rather, the $r^2$ produces the approximate percentage/amount of variations that is explainable/predictable from the other variable - in other words, the value/change of one variable could be statistically attributable to the value/change in the other variable.

#### **Progress Check 7.4:** Assume that an r of .30 describes the relationship between educational level and estimated hours spent reading each week.

(a) According to r², what percent of the variability in weekly reading time is predictable from its relationship with educational level? <br>
9% of weekly reading time can be predicted from its relationship with educational level. <br>
(b) What percent of variability in weekly reading time is not predictable from this relationship?<br>
91%. <br>
(c) Someone claims that 9 percent of each person’s estimated reading time is predictable from the relationship. What is wrong with this claim? <br>
The 9% here accounts for the percentage of weekly reading times of all people that are predictable, not for individuals.

#### **Progress Check 7.5:** As indicated in Figure 6.3 on page 111, the correlation between the IQ scores of parents and children is .50, and that between the IQ scores of foster parents and foster children is .27.

(a) Does this signify, therefore, that the relationship between foster parents and foster children is about one-half as strong as the relationship between parents and children? <br>
No. Because the value of $r$ indicates the proportion of scores of one variable that are correlated to another. <br>
(b) Use $r^2$ to compare the strengths of these two correlations. <br>
The $r^2$ = 0.25 is approximately 4 times greater than $r^2$ = 0.07, which means the relationship between parents and children is 4 times greater than the relationship between foster parents and children.

### **7.7. MULTIPLE REGRESSION EQUATIONS**

- Definition:
    + Multiple Regression Equation: a least squares regression that contains more than one predictor or X variable.

#### **Common Features:**

### **7.8. REGRESSION TOWARD THE MEAN**

- Definition:
    + Regression Toward the Mean: a tendency for scores, particularly extreme scores, to shrink toward the mean.

#### **Appears in Many Distributions:**
- It is commonly seen that, the extremely low/high scores in a distribution, after receiving repetitive test, tend to regress toward the mean.
- Remark: the tendency is intended for a subset of scores, not for the entire distribution.

#### **The Regression Fallacy**:
- Definition:
    + Regression Fallcy: occurs whenever regression toward the mean is interpreted as a real (without experimentations), rather than a chance, effect.
- Conclusion: every observable summary upon datasets and their variables should be cautiously taken into consideration, especially if the datasets are collected without any prior control (using independent variables - constructed by the investigators) and the specific relationships between variables and their tendencies should undergo experimentations for verification.

#### **Avoiding Regression Fallacy:**

#### **Progress Check 7.6:** After a group of college students attended a stress-reduction clinic, declines were observed in the anxiety scores of those who, prior to attending the clinic, had scored high on a test for anxiety.

(a) Can this decline be attributed to the stress-reduction clinic? Explain your answer. <br>
No. The differences might be attributable to chances (or the regression toward the mean) and not entirely dependent of the treatment. <br>
(b) What type of study, if any, would permit valid conclusions about the effect of the stressreduction clinic? <br>
If the group of students that scored high on the prior anxiety test were randomly assigned into two different groups, one would (repeatedly) receive the treatment, and the other would not, then the declines in anxiety scores of the group that received treatment could be attributed to the treatment.

### **REVIEW QUESTIONS**

#### **7.7.** Assume that an r of –.80 describes the strong negative relationship between years of heavy smoking (X) and life expectancy (Y ). Assume, furthermore, that the distributions of heavy smoking and life expectancy each have the following means and sums of squares:
![image.png](attachment:062969af-5715-4183-8f56-6fff1b4d8d32.png)

(a) Determine the least squares regression equation for predicting life expectancy from years of heavy smoking. <br>
Y' = -1.13X + 65.65. <br>
(b) Determine the standard error of estimate, sy|x, assuming that the correlation of –.80 was based on n = 50 pairs of observations. <br>
(c) Supply a rough interpretation of sy|x. <br>
Roughly indicates the error produced by the value obtained from the least squares regression. <br>
(d) Predict the life expectancy for John , who has smoked heavily for 8 years.(e) Predict the life expectancy for Katie, who has never smoked heavily. <br>
56.61 +- 0.72 and 65.65 +- 0.72 years.

In [2]:
# Solution:
r = -0.8
x_mean = 5
y_mean = 60
ss_x = 35
ss_y = 70

In [6]:
# a):
coefficient = np.round(r * np.sqrt(ss_y / ss_x), decimals=2)
constant = y_mean - coefficient * x_mean
coefficient, constant

(-1.13, 65.65)

In [7]:
# b):
size = 50
predictive_error = np.sqrt( ss_y*(1 - r**2) / (size - 2) )
predictive_error

0.7245688373094717

In [9]:
# d):
john_expect = coefficient * 8 + constant
katie_expect = coefficient * 0 + constant
john_expect, katie_expect

(56.61000000000001, 65.65)

#### **7.8.** Each of the following pairs represents the number of licensed drivers (X ) and the number of cars (Y ) for seven houses in my neighborhood: 
![image.png](attachment:1381ad1b-0178-4ffb-8fe5-439c73b4644b.png)

(a) Construct a scatterplot to verify a lack of pronounced curvilinearity. <br>
(b) Determine the least squares equation for these data. (Remember, you will first have to calculate r, SSy and SSx) <br>
Y' = 0.556X + 0.697. <br>
(c) Determine the standard error of estimate, sy|x, given that n = 7. <br>
0.41. <br>
(d) Predict the number of cars for each of two new families with two and five drivers. <br>
1.81 +- 0.41 and 3.48 +- 0.41 cars.

In [None]:
# Solution:
data_str = ["5 4","5 3","2 2",
            "2 2","3 2","1 1","2 2"]
ignored, data = SH.to_data(data_str, ncol=2, name_maxlength=0)
data

In [None]:
# a):
fig, ax = CC.draw_scatterplot(data[:, 0], data[:, 1])

In [None]:
# b):
coefficient = LR.calc_slope(data[:, 0], data[:, 1])
constant = LR.calc_constant(data[:, 0], data[:, 1])
coefficient, constant

In [None]:
fig, ax = LR.draw_linregression(data[:, 0], data[:, 1])

In [None]:
# c):
predictive_error = LR.predictive_error(data[:, 0], data[:, 1], size=7, dataset_type="sample")
predictive_error

In [None]:
# d):
params = [coefficient, constant]
fam1 = LR.calc_pvalue(2, *params)
fam2 = LR.calc_pvalue(5, *params)
fam1, fam2

#### **7.9.** At a large bank, length of service is the best single predictor of employees’ salaries. Can we conclude, therefore, that there is a cause-effect relationship between length of service and salary?

Absolutely no, the length of service is merely a predictor, not a factor that causes any changes in employees' salaries.

#### **7.10.** Assume that $r^2$ equals .50 for the relationship between height and weight for adults. Indicate whether the following statements are true or false.

(a) Fifty percent of the variability in heights can be explained by variability in weights. <br>
True. <br>
(b) There is a cause-effect relationship between height and weight. <br>
False. <br>
(c) The heights of 50 percent of adults can be predicted exactly from their weights. <br>
False. <br>
(d) Fifty percent of the variability in weights is predictable from heights. <br>
True

(a) Sons of tall fathers will tend to be shorter than their fathers. <br>
True. <br>
(b) Sons of short fathers will tend to be taller than the mean for all sons. <br>
False. Sons of short fathers will tend to be shorter than the mean for all sons. <br>
(c) Every son of a tall father will be shorter than his father. <br>
False. Regression toward the mean is only a tendency, so there will be exceptions<br>
(d) Taken as a group, adult sons are shorter than their fathers. <br>

(e) Fathers of tall sons will tend to be taller than their sons. <br>
(f) Fathers of short sons will tend to be taller than their sons but shorter than the mean for all fathers. <br>

#### **7.12.** Someone suggests that it would be a good investment strategy to buy the five poorest-performing stocks on the New York Stock Exchange and capitalize on regression toward the mean. Comments?

- It might be true, that some of the poorest-performing stocks on the NYSE will eventually regress toward the mean.
- However, there exact or even approximate number of these stocks will perform better is unknown, unless proved otherwise, then the strategy might not be rewarding overall.
- Furthermore, one needs to investigate to deduce the fundamental factors that caused these stocks to poorly perform, in other words, if these unknown factors persist, then the stocks might never progress at all.

#### **7.13.** In the original study of regression toward the mean, Sir Francis Galton noted a tendency for offspring of both tall and short parents to drift toward the mean height for offspring and referred to this tendency as “regression toward mediocrity.” What is wrong with the conclusion that eventually all heights will be close to their mean?

- The false assumption is that "all" heights will be close to their mean, since regression toward the mean indicates that, unless being further investigated, some or all, determined by chances, of the extreme scores on height will come back to the average value.